Statistics Concepts in Data-Science

Aman S
4 min readJul 27, 2022

Why Statistics

Statistics presents information in an easy way.

Gives you an understanding of the data you are working on.

In this blog, we will work on a dataset to understand a few concepts of Statistics.

The dataset is a list of Genders, Heights, and Weights.

Import the required Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Read the Data and get the first rows of the DataFrame

data = pd.read_csv(‘files/weight-height.csv’)

Count

  • The count is descriptive statistics and counts observations.
  • The count is the most used in statistics and has high importance to evaluate findings.
  • Example: Making conclusion on childhood weights and the study only had 12 childing (observations). Is that trustworthy?
  • The count says something about the quality of the study.
  • Count groups to see the significance of the results.
data.groupby(‘Gender’).count()
Output

Mean

Return the mean of the values over the requested axis.

data.groupby(‘Gender’).mean()

Standard Deviation

  • Standard deviation is a measure of how dispersed (spread) the data is in relation to the mean.
  • A low standard deviation means data is close to the mean.
  • A high standard deviation means data is spread out.
data.groupby(‘Gender’).std()

In the above diagram, we can observe if we take 3 samples of males(multiple times), 2 of them are likely to be within one standard deviation of the mean.

Describe

Descriptive statistics include those that summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values.

data.describe()

Box Plots

  • Box plots are a great way to visualize descriptive statistics.
  • Notice that Q1: 25%, Q2: 50%, and Q3: 75%.
  • Box plots also help us to identify the Outliers.
  • Make a box-and-whisker plot from DataFrame columns, optionally grouped by some other columns.
data[‘Weight’].plot.box(vert=False)
data[‘Height’].plot.box(vert=False)

We can observe a few outliers in the above BoxPlots.(Outliers should be removed for a proper model to be built.)

data.boxplot(column=[‘Height’, ‘Weight’])

BoxPlot grouped by gender

data.boxplot(column=[‘Height’, ‘Weight’], by=’Gender’)

Now we will look at one of the most important statistical properties.

Correlation

  • Correlation can be said to measure the linear relationship between two variables measured.
  • A non-linear relationship could cancel out positive and negative correlations which might end up with zero(no relationship between variables).
  • Measure the relationship between two variables and ranges from -1 to 1

SCATTER PLOT is a great tool for identifying correlation.

data.plot.scatter(x=’Height’, y=’Weight’, alpha=.1)
data.corr()

Correlation Matrix

A correlation matrix is simply a table showing the correlation coefficients between variables.

data.groupby('Gender').corr()

Using a heat map could be more helpful.

import seaborn as sns
axis_corr = sns.heatmap(
correlation,
vmin=-1, vmax=1, center=0,
cmap=sns.diverging_palette(50, 500, n=500),
square=True)

--

--