10 Minutes Exploratory Data Analysis with 3 Python Libraries

Introduction

When I’m unfamiliar with the dataset and don’t know what to explore, I found seaborn.pairplot is useful to find relationship of the data, but actually there are various Exploratory Data Analysis (EDA) libraries, DataPrep, Pandas Profiling, and AutoViz are 3 popular ones.

I’m going to do a quick comparison of these libraries.

DataPrep

First of all, I need to install it in Jupyter Notebook (through Anaconda Navigator).

!conda install -y dataprep

Dataprep also provides online demo in Colab.

Import libraries

from dataprep.datasets import load_dataset
from dataprep.datasets import get_dataset_names
from dataprep.eda import create_report
import warnings
warnings.filterwarnings('ignore')

# list all datasets
get_dataset_names()
['house_prices_test',
 'adult',
 'countries',
 'patient_info',
 'waste_hauler',
 'iris',
 'titanic',
 'house_prices_train',
 'covid19',
 'wine-quality-red']

Load dataset

Here I choose the dataset wine-quality-red.

df = load_dataset("wine-quality-red")
df.head()
fixed_acidity volatile_acidity citric_acid residual_sugar chlorides free_sulfur_dioxide total_sulfur_dioxide density pH sulphates alcohol class
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5

Analyze dataset

report = create_report(df)
report

png

Pandas Profiling

from pandas_profiling import ProfileReport

profile = ProfileReport(df, title="Pandas Profiling Report")
profile

png

AutoViz

from autoviz.AutoViz_Class import AutoViz_Class
%matplotlib inline

AV = AutoViz_Class()
dft = AV.AutoViz(
    "",
    dfte = df
)
dft

png

png

png

png

png

png

png

All Plots done
Time to run AutoViz = 21 seconds 

 ###################### AUTO VISUALIZATION Completed ########################
fixed_acidity volatile_acidity citric_acid residual_sugar chlorides free_sulfur_dioxide total_sulfur_dioxide density pH sulphates alcohol class
0 7.4 0.700 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.4 5
1 7.8 0.880 0.00 2.6 0.098 25.0 67.0 0.99680 3.20 0.68 9.8 5
2 7.8 0.760 0.04 2.3 0.092 15.0 54.0 0.99700 3.26 0.65 9.8 5
3 11.2 0.280 0.56 1.9 0.075 17.0 60.0 0.99800 3.16 0.58 9.8 6
4 7.4 0.700 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.4 5
... ... ... ... ... ... ... ... ... ... ... ... ...
1594 6.2 0.600 0.08 2.0 0.090 32.0 44.0 0.99490 3.45 0.58 10.5 5
1595 5.9 0.550 0.10 2.2 0.062 39.0 51.0 0.99512 3.52 0.76 11.2 6
1596 6.3 0.510 0.13 2.3 0.076 29.0 40.0 0.99574 3.42 0.75 11.0 6
1597 5.9 0.645 0.12 2.0 0.075 32.0 44.0 0.99547 3.57 0.71 10.2 5
1598 6.0 0.310 0.47 3.6 0.067 18.0 42.0 0.99549 3.39 0.66 11.0 6

1599 rows × 12 columns

seaborn.pairplot

Similarly, seaborn can also check the relationships between variables with method pairplot.

import seaborn as sns
import matplotlib.pyplot as plt

sns.pairplot(df, corner=True)
plt.show()

png

Some variables show strong correlations, such as citric_acid and fixed_acidity, density and fixed_acidity, etc., that’s the same as “Interactions” and “Correlations” of dataprep.

Conclusion

DataPrep, Pandas Profiling, and AutoViz, they are very similar in a extent. Both DataPrep and Pandas Profiling visualize missing values, that’s useful. Currently I prefer DataPrep because of its concise UI, especially the Interactions part.

I’m thinking about adding above libraries to create my self-defined functions to speed up the data analysis.