Pandas ( Summary Function, Data Types, Missing Data)

Aman S
3 min readJul 4, 2022
Image made in Canva

Selecting the right data out of the Dataset is critical to getting the work done. But the data does not always come out of memory in the format we wanted. Sometimes we have to do some work to reformat it for the task to be done.

DATA-SET LINK

We will be doing all the operations with this dataset.

I dropped the Unnamed column in the dataset for better understanding.

Importing Libraries

import pandas as pd

Read and drop the unnamed column

df= pd.read_csv(“winemag-data-130k-v2.csv”)
df.drop(“Unnamed: 0”,axis=1,inplace=True)
df.head(n=5)

Summary Functions

Pandas provide many simple summary functions which restructure the data in some useful way.

Example: describe()

df.describe()

The describe() method is used for calculating some statistical data like percentile, mean, and standard deviation of numerical values of the Series or DataFrame. It analyzes both numeric and object series and also the DataFrame column sets of mixed data types.

  • count — The number of not-empty values.
  • mean — The average (mean) value.
  • std — The standard deviation.
  • min — the minimum value.
  • 25% — The 25% percentile*.
  • 50% — The 50% percentile*.
  • 75% — The 75% percentile*.
  • max — the maximum value.

If you want to get some particular summary statistics about a column in a DataFrame, we can make use of the pandas function.

For example, to see the mean of the points allotted, we can use the mean() function.

Dtypes

The data type for a column in a DataFrame or a Series is known as the dtype.

You can use the dtype property to grab the type of a specific column. For instance, we can get the dtype of the price column in the DataFrame by using .

df.price.dtype
df.dtypes

Changing the DataTypes

It is possible to convert a column of one type into another wherever such a conversion makes sense by using the astype() function.

We can transform price columns from its existing int64 data type into a float64 data type:

df.points.astype(‘float64’)

Missing Data

Entries missing values are given the value NaN, short for "Not a Number".

Pandas provide some methods specific to missing data. To select NaN entries you can use pd.isnull() (or its companion pd.notnull()). This is meant to be used thusly:

df[pd.isnull(df.country)]

Replace missing values is a common operation to handle with missing values.

Pandas provides a really handy method for this problem. fillna().

fillna() provides a few different strategies for mitigating such data. For example, we can simply replace each NaN with an "Unknown":

df.region_2.fillna(“Unknown”)

Know more about How to handle the missing values.

LEARN WITH ME. ✌️. Whatever thing I learn, I document it.

Connect with me.

--

--