Four R packages for Automated Exploratory Data Analysis

Introduction

Exploratory data analysis (EDA) is a critical step in any data science workflow. Inspired by the article on medium, I’d like to explore the 4 most popular R EDA packages based on their downloads. The dataset is from my project Chicago Bike-Share Analysis, to make it more efficient, I’m going to sample 10% of its original size.

Import the data

library(tidyverse)
# for clean output, quit the message showing the guessed column types 
options(readr.show_col_types = FALSE)
set.seed(1)

df <- list.files(path = "Bike-Share-Data",pattern = "*.csv", full.names = TRUE) %>% 
  lapply(read_csv) %>%
  bind_rows
# get load 10% of original dataset
df_sample <- df %>% 
  sample_n(round(nrow(df)/10,0), replace = FALSE, prob = NULL)

head(df_sample)

A tibble: 6 × 13
ride_id	rideable_type	started_at	ended_at	start_station_name	start_station_id	end_station_name	end_station_id	start_lat	start_lng	end_lat	end_lng	member_casual
<chr>	<chr>	<dttm>	<dttm>	<chr>	<chr>	<chr>	<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<chr>
D19CD62061C56CFC	classic_bike	2021-08-22 16:52:48	2021-08-22 17:28:14	<span style=white-space:pre-wrap>Wood St & Hubbard St </span>	<span style=white-space:pre-wrap>13432 </span>	<span style=white-space:pre-wrap>Wood St & Hubbard St </span>	13432	41.88990	-87.67147	41.88990	-87.67147	member
EB24C059C9BEE023	electric_bike	2021-08-20 22:34:09	2021-08-20 23:06:58	<span style=white-space:pre-wrap>Dearborn St & Erie St </span>	<span style=white-space:pre-wrap>13045 </span>	California Ave & Division St	13256	41.89419	-87.62946	41.90303	-87.69746	casual
064A7468D993FE0B	electric_bike	2022-04-14 07:15:08	2022-04-14 07:24:14	NA	NA	NA	NA	41.94000	-87.70000	41.95000	-87.72000	member
2E3852512AC429F6	classic_bike	2021-06-08 18:20:49	2021-06-08 18:42:45	Fort Dearborn Dr & 31st St	TA1307000048	Field Blvd & South Water St	15534	41.83856	-87.60822	41.88635	-87.61752	member
ACF8FF1A88DC6FC0	electric_bike	2021-10-12 14:36:18	2021-10-12 14:44:56	<span style=white-space:pre-wrap>Wabash Ave & Wacker Pl </span>	TA1307000131	<span style=white-space:pre-wrap>Canal St & Adams St </span>	13011	41.88686	-87.62648	41.87947	-87.64021	casual
BE48602F3CF4EE17	classic_bike	2021-07-04 12:57:19	2021-07-04 13:10:02	<span style=white-space:pre-wrap>Clark St & Berwyn Ave </span>	KA1504000146	<span style=white-space:pre-wrap>Broadway & Thorndale Ave </span>	15575	41.97800	-87.66805	41.98974	-87.66014	casual

Rank packages

Putatunda et al. (ref link) shared an insightful comparison between SmartEDA and other similar packages available in CRAN for EDA.

Let’s rank these cited packages based on the downloads in the last 12 months.

library("dlstats")

# rank popular EDA related packages
stats <- cran_stats(c("SmartEDA", "DataExplorer", "tableone", "GGally", "Hmisc", 
                      "exploreR", "dlookr", "desctable", "summarytools"))

stats %>%
  filter(start >= "2021-08-01" & end < "2022-08-01") %>%
  select(package, downloads) %>%
  group_by(package) %>% 
  summarize(downloads = sum(downloads)) %>%
  arrange(desc(downloads))

A tibble: 9 × 2
package	downloads
<fct>	<int>
Hmisc	9741969
GGally	890258
DataExplorer	207483
SmartEDA	197655
summarytools	178863
tableone	144601
dlookr	44637
desctable	7015
exploreR	4800

I’m going to try the top 4 packages: Hmisc, GGally, DataExplorer, and SmartEDA.

Hmisc

There is no function in Hmisc can create a overall report of the input dataset, we have to call specific functions based on specific needs. describe is one of the popular functions, let’s have a try.

library(Hmisc)

# describe all variables in a data frame
describe(df_sample)

Loading required package: lattice

Loading required package: survival

Loading required package: Formula


Attaching package: ‘Hmisc’


The following objects are masked from ‘package:dplyr’:

    src, summarize


The following objects are masked from ‘package:base’:

    format.pval, units





df_sample 

 13  Variables      575755  Observations
--------------------------------------------------------------------------------
ride_id 
       n  missing distinct 
  575755        0   575755 

lowest : 00000B4F1F71F9C2 000022C3D3CE7DD5 000043B681BFB305 0000453517CABB51 00005B17AF7201F6
highest: FFFFA5E4DD5C0668 FFFFB13489EEB3AE FFFFCCFF4A678336 FFFFCD8626C9CC4A FFFFD1346B5EADA0
--------------------------------------------------------------------------------
rideable_type 
       n  missing distinct 
  575755        0        3 
                                                    
Value       classic_bike   docked_bike electric_bike
Frequency         319986         29227        226542
Proportion         0.556         0.051         0.393
--------------------------------------------------------------------------------
started_at 
                  n             missing            distinct                Info 
             575755                   0              564875                   1 
               Mean                 Gmd                 .05                 .10 
2021-09-18 19:29:31             9129862 2021-05-20 15:18:44 2021-06-03 10:19:11 
                .25                 .50                 .75                 .90 
2021-07-07 14:56:31 2021-08-31 18:40:37 2021-11-04 06:21:56 2022-03-09 19:17:13 
                .95 
2022-04-09 14:28:10 

lowest : 2021-05-01 00:01:07 2021-05-01 00:05:30 2021-05-01 00:10:01 2021-05-01 00:12:46 2021-05-01 00:13:11
highest: 2022-04-30 23:55:08 2022-04-30 23:55:12 2022-04-30 23:57:38 2022-04-30 23:58:03 2022-04-30 23:58:34
--------------------------------------------------------------------------------
ended_at 
                  n             missing            distinct                Info 
             575755                   0              564658                   1 
               Mean                 Gmd                 .05                 .10 
2021-09-18 19:50:28             9129636 2021-05-20 15:41:39 2021-06-03 10:39:22 
                .25                 .50                 .75                 .90 
2021-07-07 15:17:55 2021-08-31 18:58:01 2021-11-04 06:37:02 2022-03-09 19:34:00 
                .95 
2022-04-09 14:42:12 

lowest : 2021-05-01 00:11:51 2021-05-01 00:15:29 2021-05-01 00:16:23 2021-05-01 00:18:02 2021-05-01 00:28:49
highest: 2022-05-01 00:14:02 2022-05-01 00:15:21 2022-05-01 00:15:24 2022-05-01 00:16:27 2022-05-01 01:32:41
--------------------------------------------------------------------------------
start_station_name 
       n  missing distinct 
  497037    78718      855 

lowest : 2112 W Peterson Ave          63rd St Beach                900 W Harrison St            Aberdeen St & Jackson Blvd   Aberdeen St & Monroe St     
highest: Woodlawn Ave & 55th St       Woodlawn Ave & 75th St       Woodlawn Ave & Lake Park Ave Yates Blvd & 75th St         Yates Blvd & 93rd St        
--------------------------------------------------------------------------------
start_station_id 
       n  missing distinct 
  497037    78718      847 

lowest : 13001                                 13006                                 13008                                 13011                                 13016                                
highest: Throop/Hastings Mobile Station        Wilton Ave & Diversey Pkwy - Charging WL-008                                WL-011                                WL-012                               
--------------------------------------------------------------------------------
end_station_name 
       n  missing distinct 
  491608    84147      847 

lowest : 2112 W Peterson Ave          63rd St Beach                900 W Harrison St            Aberdeen St & Jackson Blvd   Aberdeen St & Monroe St     
highest: Woodlawn Ave & 55th St       Woodlawn Ave & 75th St       Woodlawn Ave & Lake Park Ave Yates Blvd & 75th St         Yates Blvd & 93rd St        
--------------------------------------------------------------------------------
end_station_id 
       n  missing distinct 
  491608    84147      839 

lowest : 13001                                 13006                                 13008                                 13011                                 13016                                
highest: Throop/Hastings Mobile Station        Wilton Ave & Diversey Pkwy - Charging WL-008                                WL-011                                WL-012                               
--------------------------------------------------------------------------------
start_lat 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
  575755        0   116555        1     41.9  0.04922    41.80    41.86 
     .25      .50      .75      .90      .95 
   41.88    41.90    41.93    41.95    41.97 

lowest : 41.64850 41.64850 41.64854 41.64855 41.64857
highest: 42.06478 42.06478 42.06481 42.06485 42.07000
--------------------------------------------------------------------------------
start_lng 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
  575755        0   113875        1   -87.65  0.03164   -87.70   -87.68 
     .25      .50      .75      .90      .95 
  -87.66   -87.64   -87.63   -87.62   -87.60 

lowest : -87.84000 -87.83000 -87.82931 -87.82000 -87.81000
highest: -87.52838 -87.52836 -87.52823 -87.52823 -87.52000
--------------------------------------------------------------------------------
end_lat 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
  575287      468    94509        1     41.9  0.04935    41.80    41.86 
     .25      .50      .75      .90      .95 
   41.88    41.90    41.93    41.95    41.97 

lowest : 41.56000 41.57000 41.60000 41.62000 41.64000
highest: 42.06486 42.06497 42.07000 42.08000 42.11000
--------------------------------------------------------------------------------
end_lng 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
  575287      468    92145        1   -87.65  0.03186   -87.70   -87.68 
     .25      .50      .75      .90      .95 
  -87.66   -87.64   -87.63   -87.62   -87.60 

lowest : -87.86000 -87.84000 -87.83000 -87.82000 -87.81000
highest: -87.52823 -87.52823 -87.52452 -87.52000 -87.50000
--------------------------------------------------------------------------------
member_casual 
       n  missing distinct 
  575755        0        2 
                        
Value      casual member
Frequency  254015 321740
Proportion  0.441  0.559
--------------------------------------------------------------------------------

It did provide a quick view of all the variables, but to me, it’s not impressive as the result looks like a combination of multiple functions, such as summary(), str(), and the output format is a little hard to read.

GGally

library(GGally)

options(repr.plot.width = 14, repr.plot.height = 10)

# show the interactions of each variable with each of the others
# distinguish by member_casual
df_sample %>% 
  select(rideable_type:ended_at, start_lat:end_lng) %>%
  ggpairs(mapping = aes(color = df_sample$member_casual, alpha = 0.5))

png

According to the image above, we know different riders (casual and member) have different rideable_type, started_at, and ended_at. That’s a good starting point.

DataExplorer

library(DataExplorer)

data_all %>% create_report(
  output_file = paste("Report", format(Sys.time(), "%Y-%m-%d %H:%M:%S %Z"), sep=" - "),
  report_title = "EDA Report - Chicago Bike Share"
)

It generated a .html file in the working directory, below is the screenshot:

Data Explorer EDA Report

That report is very informative. I like its Missing Data Profile and Bar Chart.

SmartEDA

library(SmartEDA)

df_sample %>% ExpReport(op_file="SmartEDA_Report.html")

It also generated a .html file and below is the screenshot:

SmartEDA Report

It’s very similar to DataExplorer, but I think its function ExpCatViz() (i.e., 5. Distributions of categorical variables in its report) is better than DataExplorer’s plot_bar() (i.e., Bar Chart (with frequency)), as the former shows percentage instead of frequency.

The end

All of these packages can help us get a general idea of the dataset. Both DataExplorer and SmartEDA can create reports with just one function, that’s super convienent, but none of them is the answer to all questions. We need to combine specific funcitons of above packages to create our own functions for EDA, and then dig deeper.

Qin Liu