10 Minutes Exploratory Data Analysis with 3 Python Libraries
10 Minutes Exploratory Data Analysis with 3 Python Libraries
Introduction
When I’m unfamiliar with the dataset and don’t know what to explore, I found seaborn.pairplot is useful to find relationship of the data, but actually there are various Exploratory Data Analysis (EDA) libraries, DataPrep, Pandas Profiling, and AutoViz are 3 popular ones.
I’m going to do a quick comparison of these libraries.
DataPrep
First of all, I need to install it in Jupyter Notebook (through Anaconda Navigator).
!conda install -y dataprep
Dataprep also provides online demo in Colab.
Import libraries
from dataprep.datasets import load_dataset
from dataprep.datasets import get_dataset_names
from dataprep.eda import create_report
import warnings
warnings.filterwarnings('ignore')
# list all datasets
get_dataset_names()
['house_prices_test',
'adult',
'countries',
'patient_info',
'waste_hauler',
'iris',
'titanic',
'house_prices_train',
'covid19',
'wine-quality-red']
Load dataset
Here I choose the dataset wine-quality-red.
df = load_dataset("wine-quality-red")
df.head()
| fixed_acidity | volatile_acidity | citric_acid | residual_sugar | chlorides | free_sulfur_dioxide | total_sulfur_dioxide | density | pH | sulphates | alcohol | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
| 1 | 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 |
| 2 | 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 |
| 3 | 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.9980 | 3.16 | 0.58 | 9.8 | 6 |
| 4 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
Analyze dataset
report = create_report(df)
report

Pandas Profiling
from pandas_profiling import ProfileReport
profile = ProfileReport(df, title="Pandas Profiling Report")
profile

AutoViz
from autoviz.AutoViz_Class import AutoViz_Class
%matplotlib inline
AV = AutoViz_Class()
dft = AV.AutoViz(
"",
dfte = df
)
dft







All Plots done
Time to run AutoViz = 21 seconds
###################### AUTO VISUALIZATION Completed ########################
| fixed_acidity | volatile_acidity | citric_acid | residual_sugar | chlorides | free_sulfur_dioxide | total_sulfur_dioxide | density | pH | sulphates | alcohol | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.4 | 0.700 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.99780 | 3.51 | 0.56 | 9.4 | 5 |
| 1 | 7.8 | 0.880 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.99680 | 3.20 | 0.68 | 9.8 | 5 |
| 2 | 7.8 | 0.760 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.99700 | 3.26 | 0.65 | 9.8 | 5 |
| 3 | 11.2 | 0.280 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.99800 | 3.16 | 0.58 | 9.8 | 6 |
| 4 | 7.4 | 0.700 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.99780 | 3.51 | 0.56 | 9.4 | 5 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1594 | 6.2 | 0.600 | 0.08 | 2.0 | 0.090 | 32.0 | 44.0 | 0.99490 | 3.45 | 0.58 | 10.5 | 5 |
| 1595 | 5.9 | 0.550 | 0.10 | 2.2 | 0.062 | 39.0 | 51.0 | 0.99512 | 3.52 | 0.76 | 11.2 | 6 |
| 1596 | 6.3 | 0.510 | 0.13 | 2.3 | 0.076 | 29.0 | 40.0 | 0.99574 | 3.42 | 0.75 | 11.0 | 6 |
| 1597 | 5.9 | 0.645 | 0.12 | 2.0 | 0.075 | 32.0 | 44.0 | 0.99547 | 3.57 | 0.71 | 10.2 | 5 |
| 1598 | 6.0 | 0.310 | 0.47 | 3.6 | 0.067 | 18.0 | 42.0 | 0.99549 | 3.39 | 0.66 | 11.0 | 6 |
1599 rows × 12 columns
seaborn.pairplot
Similarly, seaborn can also check the relationships between variables with method pairplot.
import seaborn as sns
import matplotlib.pyplot as plt
sns.pairplot(df, corner=True)
plt.show()

Some variables show strong correlations, such as citric_acid and fixed_acidity, density and fixed_acidity, etc., that’s the same as “Interactions” and “Correlations” of dataprep.
Conclusion
DataPrep, Pandas Profiling, and AutoViz, they are very similar in a extent. Both DataPrep and Pandas Profiling visualize missing values, that’s useful. Currently I prefer DataPrep because of its concise UI, especially the Interactions part.
I’m thinking about adding above libraries to create my self-defined functions to speed up the data analysis.