Introduction

Exploratory data analysis (EDA) is a critical step in any data science workflow. Inspired by the article on medium, I’d like to explore the 4 most popular R EDA packages based on their downloads. The dataset is from my project Chicago Bike-Share Analysis, to make it more efficient, I’m going to sample 10% of its original size.

Import the data

library(tidyverse)
# for clean output, quit the message showing the guessed column types 
options(readr.show_col_types = FALSE)
set.seed(1)

df <- list.files(path = "Bike-Share-Data",pattern = "*.csv", full.names = TRUE) %>% 
  lapply(read_csv) %>%
  bind_rows
# get load 10% of original dataset
df_sample <- df %>% 
  sample_n(round(nrow(df)/10,0), replace = FALSE, prob = NULL)

head(df_sample)
A tibble: 6 × 13
ride_idrideable_typestarted_atended_atstart_station_namestart_station_idend_station_nameend_station_idstart_latstart_lngend_latend_lngmember_casual
<chr><chr><dttm><dttm><chr><chr><chr><chr><dbl><dbl><dbl><dbl><chr>
D19CD62061C56CFCclassic_bike 2021-08-22 16:52:482021-08-22 17:28:14<span style=white-space:pre-wrap>Wood St & Hubbard St </span><span style=white-space:pre-wrap>13432 </span><span style=white-space:pre-wrap>Wood St & Hubbard St </span>1343241.88990-87.6714741.88990-87.67147member
EB24C059C9BEE023electric_bike2021-08-20 22:34:092021-08-20 23:06:58<span style=white-space:pre-wrap>Dearborn St & Erie St </span><span style=white-space:pre-wrap>13045 </span>California Ave & Division St1325641.89419-87.6294641.90303-87.69746casual
064A7468D993FE0Belectric_bike2022-04-14 07:15:082022-04-14 07:24:14NA NA NA NA 41.94000-87.7000041.95000-87.72000member
2E3852512AC429F6classic_bike 2021-06-08 18:20:492021-06-08 18:42:45Fort Dearborn Dr & 31st StTA1307000048Field Blvd & South Water St 1553441.83856-87.6082241.88635-87.61752member
ACF8FF1A88DC6FC0electric_bike2021-10-12 14:36:182021-10-12 14:44:56<span style=white-space:pre-wrap>Wabash Ave & Wacker Pl </span>TA1307000131<span style=white-space:pre-wrap>Canal St & Adams St </span>1301141.88686-87.6264841.87947-87.64021casual
BE48602F3CF4EE17classic_bike 2021-07-04 12:57:192021-07-04 13:10:02<span style=white-space:pre-wrap>Clark St & Berwyn Ave </span>KA1504000146<span style=white-space:pre-wrap>Broadway & Thorndale Ave </span>1557541.97800-87.6680541.98974-87.66014casual

Rank packages

Putatunda et al. (ref link) shared an insightful comparison between SmartEDA and other similar packages available in CRAN for EDA.

Let’s rank these cited packages based on the downloads in the last 12 months.

library("dlstats")

# rank popular EDA related packages
stats <- cran_stats(c("SmartEDA", "DataExplorer", "tableone", "GGally", "Hmisc", 
                      "exploreR", "dlookr", "desctable", "summarytools"))

stats %>%
  filter(start >= "2021-08-01" & end < "2022-08-01") %>%
  select(package, downloads) %>%
  group_by(package) %>% 
  summarize(downloads = sum(downloads)) %>%
  arrange(desc(downloads))
A tibble: 9 × 2
packagedownloads
<fct><int>
Hmisc 9741969
GGally 890258
DataExplorer 207483
SmartEDA 197655
summarytools 178863
tableone 144601
dlookr 44637
desctable 7015
exploreR 4800

I’m going to try the top 4 packages: Hmisc, GGally, DataExplorer, and SmartEDA.

Hmisc

There is no function in Hmisc can create a overall report of the input dataset, we have to call specific functions based on specific needs. describe is one of the popular functions, let’s have a try.

library(Hmisc)

# describe all variables in a data frame
describe(df_sample)
Loading required package: lattice

Loading required package: survival

Loading required package: Formula


Attaching package: ‘Hmisc’


The following objects are masked from ‘package:dplyr’:

    src, summarize


The following objects are masked from ‘package:base’:

    format.pval, units





df_sample 

 13  Variables      575755  Observations
--------------------------------------------------------------------------------
ride_id 
       n  missing distinct 
  575755        0   575755 

lowest : 00000B4F1F71F9C2 000022C3D3CE7DD5 000043B681BFB305 0000453517CABB51 00005B17AF7201F6
highest: FFFFA5E4DD5C0668 FFFFB13489EEB3AE FFFFCCFF4A678336 FFFFCD8626C9CC4A FFFFD1346B5EADA0
--------------------------------------------------------------------------------
rideable_type 
       n  missing distinct 
  575755        0        3 
                                                    
Value       classic_bike   docked_bike electric_bike
Frequency         319986         29227        226542
Proportion         0.556         0.051         0.393
--------------------------------------------------------------------------------
started_at 
                  n             missing            distinct                Info 
             575755                   0              564875                   1 
               Mean                 Gmd                 .05                 .10 
2021-09-18 19:29:31             9129862 2021-05-20 15:18:44 2021-06-03 10:19:11 
                .25                 .50                 .75                 .90 
2021-07-07 14:56:31 2021-08-31 18:40:37 2021-11-04 06:21:56 2022-03-09 19:17:13 
                .95 
2022-04-09 14:28:10 

lowest : 2021-05-01 00:01:07 2021-05-01 00:05:30 2021-05-01 00:10:01 2021-05-01 00:12:46 2021-05-01 00:13:11
highest: 2022-04-30 23:55:08 2022-04-30 23:55:12 2022-04-30 23:57:38 2022-04-30 23:58:03 2022-04-30 23:58:34
--------------------------------------------------------------------------------
ended_at 
                  n             missing            distinct                Info 
             575755                   0              564658                   1 
               Mean                 Gmd                 .05                 .10 
2021-09-18 19:50:28             9129636 2021-05-20 15:41:39 2021-06-03 10:39:22 
                .25                 .50                 .75                 .90 
2021-07-07 15:17:55 2021-08-31 18:58:01 2021-11-04 06:37:02 2022-03-09 19:34:00 
                .95 
2022-04-09 14:42:12 

lowest : 2021-05-01 00:11:51 2021-05-01 00:15:29 2021-05-01 00:16:23 2021-05-01 00:18:02 2021-05-01 00:28:49
highest: 2022-05-01 00:14:02 2022-05-01 00:15:21 2022-05-01 00:15:24 2022-05-01 00:16:27 2022-05-01 01:32:41
--------------------------------------------------------------------------------
start_station_name 
       n  missing distinct 
  497037    78718      855 

lowest : 2112 W Peterson Ave          63rd St Beach                900 W Harrison St            Aberdeen St & Jackson Blvd   Aberdeen St & Monroe St     
highest: Woodlawn Ave & 55th St       Woodlawn Ave & 75th St       Woodlawn Ave & Lake Park Ave Yates Blvd & 75th St         Yates Blvd & 93rd St        
--------------------------------------------------------------------------------
start_station_id 
       n  missing distinct 
  497037    78718      847 

lowest : 13001                                 13006                                 13008                                 13011                                 13016                                
highest: Throop/Hastings Mobile Station        Wilton Ave & Diversey Pkwy - Charging WL-008                                WL-011                                WL-012                               
--------------------------------------------------------------------------------
end_station_name 
       n  missing distinct 
  491608    84147      847 

lowest : 2112 W Peterson Ave          63rd St Beach                900 W Harrison St            Aberdeen St & Jackson Blvd   Aberdeen St & Monroe St     
highest: Woodlawn Ave & 55th St       Woodlawn Ave & 75th St       Woodlawn Ave & Lake Park Ave Yates Blvd & 75th St         Yates Blvd & 93rd St        
--------------------------------------------------------------------------------
end_station_id 
       n  missing distinct 
  491608    84147      839 

lowest : 13001                                 13006                                 13008                                 13011                                 13016                                
highest: Throop/Hastings Mobile Station        Wilton Ave & Diversey Pkwy - Charging WL-008                                WL-011                                WL-012                               
--------------------------------------------------------------------------------
start_lat 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
  575755        0   116555        1     41.9  0.04922    41.80    41.86 
     .25      .50      .75      .90      .95 
   41.88    41.90    41.93    41.95    41.97 

lowest : 41.64850 41.64850 41.64854 41.64855 41.64857
highest: 42.06478 42.06478 42.06481 42.06485 42.07000
--------------------------------------------------------------------------------
start_lng 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
  575755        0   113875        1   -87.65  0.03164   -87.70   -87.68 
     .25      .50      .75      .90      .95 
  -87.66   -87.64   -87.63   -87.62   -87.60 

lowest : -87.84000 -87.83000 -87.82931 -87.82000 -87.81000
highest: -87.52838 -87.52836 -87.52823 -87.52823 -87.52000
--------------------------------------------------------------------------------
end_lat 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
  575287      468    94509        1     41.9  0.04935    41.80    41.86 
     .25      .50      .75      .90      .95 
   41.88    41.90    41.93    41.95    41.97 

lowest : 41.56000 41.57000 41.60000 41.62000 41.64000
highest: 42.06486 42.06497 42.07000 42.08000 42.11000
--------------------------------------------------------------------------------
end_lng 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
  575287      468    92145        1   -87.65  0.03186   -87.70   -87.68 
     .25      .50      .75      .90      .95 
  -87.66   -87.64   -87.63   -87.62   -87.60 

lowest : -87.86000 -87.84000 -87.83000 -87.82000 -87.81000
highest: -87.52823 -87.52823 -87.52452 -87.52000 -87.50000
--------------------------------------------------------------------------------
member_casual 
       n  missing distinct 
  575755        0        2 
                        
Value      casual member
Frequency  254015 321740
Proportion  0.441  0.559
--------------------------------------------------------------------------------

It did provide a quick view of all the variables, but to me, it’s not impressive as the result looks like a combination of multiple functions, such as summary(), str(), and the output format is a little hard to read.

GGally

library(GGally)

options(repr.plot.width = 14, repr.plot.height = 10)

# show the interactions of each variable with each of the others
# distinguish by member_casual
df_sample %>% 
  select(rideable_type:ended_at, start_lat:end_lng) %>%
  ggpairs(mapping = aes(color = df_sample$member_casual, alpha = 0.5))

png

According to the image above, we know different riders (casual and member) have different rideable_type, started_at, and ended_at. That’s a good starting point.

DataExplorer

library(DataExplorer)

data_all %>% create_report(
  output_file = paste("Report", format(Sys.time(), "%Y-%m-%d %H:%M:%S %Z"), sep=" - "),
  report_title = "EDA Report - Chicago Bike Share"
)

It generated a .html file in the working directory, below is the screenshot:

Data Explorer EDA Report

That report is very informative. I like its Missing Data Profile and Bar Chart.

SmartEDA

library(SmartEDA)

df_sample %>% ExpReport(op_file="SmartEDA_Report.html")

It also generated a .html file and below is the screenshot:

SmartEDA Report

It’s very similar to DataExplorer, but I think its function ExpCatViz() (i.e., 5. Distributions of categorical variables in its report) is better than DataExplorer’s plot_bar() (i.e., Bar Chart (with frequency)), as the former shows percentage instead of frequency.

The end

All of these packages can help us get a general idea of the dataset. Both DataExplorer and SmartEDA can create reports with just one function, that’s super convienent, but none of them is the answer to all questions. We need to combine specific funcitons of above packages to create our own functions for EDA, and then dig deeper.