Table of contents
No headings in the article.
The journey or rather since I like to be fancy, Odyssey towards gaining the coveted title of being referred to as a "Data Scientist" has started.
Beginning with Pandas, sadly not the animal, but the Python library. Pandas is a python library, currently at version 1.4.2, which is used to manipulate data. Not just any form of data but, exel sheet, csv like data. Tabulated data.
Believe me but after eeing fellow collegues go over 1000s of rows of excel data manually, my respect for Pandas grew astronomically (and hope for myself).
Beginning with the heart of this library, the Dataframe.
A Dataframe is essentially a table of data, with some fancy labels attached to it for data identifiation purposes such as rows, columns, headers and index. A 2d data structure basically.
There are 14 types of files, that pandas can work with:
- Comma-separated values (.csv)
- XLSX
- ZIP
- Plain Text (.txt)
- JSON
- XML
- HTML
- Images
- Hierarchical Data Format
- DOCX
- MP3
- MP4
- SQL
The following lines of code deal with understanding your data, and how pandas is considering your data.
Reading the data and storing it in a variable
DataFrame = pd.read_csv('File path')
What does pandas consider our variable as?
type(DataFrame)
->pandas.core.frame.DataFrame
Display all the columns of our df
DataFrame.columns
No. of rows and cols in our data
DataFrame.shape
Size of our df i.e row x col
DataFrame.size
Setting min no. of output rows in jupyter nb
pd.options.display.min_rows = x
Get the first 'n' no. of rows
DataFrame.head(), DataFrame.head(n)
Get the last 'n' no. of rows
DataFrame.tail(), DataFrame.tail(n)
Get info about the df, details about all the cols
DataFrame.info()
Types of data in each col
DataFrame.dtypes