Statistics Concepts in Data-Science

Aman S

4 min readJul 27, 2022

Why Statistics

Statistics presents information in an easy way.

Gives you an understanding of the data you are working on.

In this blog, we will work on a dataset to understand a few concepts of Statistics.

The dataset is a list of Genders, Heights, and Weights.

Import the required Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Read the Data and get the first rows of the DataFrame

data = pd.read_csv(‘files/weight-height.csv’)

Count

The count is descriptive statistics and counts observations.
The count is the most used in statistics and has high importance to evaluate findings.
Example: Making conclusion on childhood weights and the study only had 12 childing (observations). Is that trustworthy?
The count says something about the quality of the study.
Count groups to see the significance of the results.

data.groupby(‘Gender’).count()

Mean

Return the mean of the values over the requested axis.

data.groupby(‘Gender’).mean()

Standard Deviation

Standard deviation is a measure of how dispersed (spread) the data is in relation to the mean.
A low standard deviation means data is close to the mean.
A high standard deviation means data is spread out.

data.groupby(‘Gender’).std()

In the above diagram, we can observe if we take 3 samples of males(multiple times), 2 of them are likely to be within one standard deviation of the mean.

Describe

Descriptive statistics include those that summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values.

data.describe()

Box Plots

Box plots are a great way to visualize descriptive statistics.
Notice that Q1: 25%, Q2: 50%, and Q3: 75%.
Box plots also help us to identify the Outliers.
Make a box-and-whisker plot from DataFrame columns, optionally grouped by some other columns.

data[‘Weight’].plot.box(vert=False)

data[‘Height’].plot.box(vert=False)

We can observe a few outliers in the above BoxPlots.(Outliers should be removed for a proper model to be built.)

data.boxplot(column=[‘Height’, ‘Weight’])

BoxPlot grouped by gender

data.boxplot(column=[‘Height’, ‘Weight’], by=’Gender’)

Now we will look at one of the most important statistical properties.

Correlation

Correlation can be said to measure the linear relationship between two variables measured.
A non-linear relationship could cancel out positive and negative correlations which might end up with zero(no relationship between variables).
Measure the relationship between two variables and ranges from -1 to 1

SCATTER PLOT is a great tool for identifying correlation.

data.plot.scatter(x=’Height’, y=’Weight’, alpha=.1)

data.corr()

Correlation Matrix

A correlation matrix is simply a table showing the correlation coefficients between variables.

data.groupby('Gender').corr()

Using a heat map could be more helpful.

import seaborn as sns
axis_corr = sns.heatmap(
correlation,
vmin=-1, vmax=1, center=0,
cmap=sns.diverging_palette(50, 500, n=500),
square=True)

Thank you for reading to the end. If you find it helpful give some claps.

If you find any mistakes, feel free to mention them in comments.

Check out my other blogs:

Machine Learning Roadmap

Machine learning is all about Prediction. Artificial Intelligence is, without a debate, the most in-demand technology…

amansinganamala.medium.com

K Means Clustering Algorithm

Table of Content

amansinganamala.medium.com

Pandas ( Indexing and Selecting).

Python DataFrame is a two-dimensional size-mutable, tabular data structure with labeled axes (rows and columns).

amansinganamala.medium.com

Connect with me on Twitter.

JavaScript is not available.

Edit description

twitter.com

Statistics Concepts in Data-Science

Correlation

Machine Learning Roadmap

Machine learning is all about Prediction. Artificial Intelligence is, without a debate, the most in-demand technology…

K Means Clustering Algorithm

Table of Content

Pandas ( Indexing and Selecting).

Python DataFrame is a two-dimensional size-mutable, tabular data structure with labeled axes (rows and columns).

JavaScript is not available.

Edit description

Written by Aman S