The explore package simplifies Exploratory Data Analysis (EDA). Get faster insights with less code!
The titanic dataset is available in base R. The data has 5 variables and only 32 rows. Each row does NOT represent an observation. It is not tidy, instead the data set contains Frequencies! And it is not a data frame, therefore we we need to convert it first.
library(dplyr)
library(tibble)
library(explore)
titanic <- as_tibble(Titanic)titanic %>% describe_tbl(n = n)
#> 2 201 (2.2k) observations with 5 variables
#> 0 observations containing missings (NA)
#> 0 variables containing missings (NA)
#> 0 variables with no variancetitanic %>% describe()
#> # A tibble: 5 x 8
#>   variable type     na na_pct unique   min  mean   max
#>   <chr>    <chr> <int>  <dbl>  <int> <dbl> <dbl> <dbl>
#> 1 Class    chr       0      0      4    NA  NA      NA
#> 2 Sex      chr       0      0      2    NA  NA      NA
#> 3 Age      chr       0      0      2    NA  NA      NA
#> 4 Survived chr       0      0      2    NA  NA      NA
#> 5 n        dbl       0      0     22     0  68.8   670All variables are categorical except n, representing the number of observations.
The data look like this:
titanic %>% head(10)
#> # A tibble: 10 x 5
#>    Class Sex    Age   Survived     n
#>    <chr> <chr>  <chr> <chr>    <dbl>
#>  1 1st   Male   Child No           0
#>  2 2nd   Male   Child No           0
#>  3 3rd   Male   Child No          35
#>  4 Crew  Male   Child No           0
#>  5 1st   Female Child No           0
#>  6 2nd   Female Child No           0
#>  7 3rd   Female Child No          17
#>  8 Crew  Female Child No           0
#>  9 1st   Male   Adult No         118
#> 10 2nd   Male   Adult No         154As the normal explore() function of the {explore} package expects a tidy dataset (each row is an observation), we need add the parameter n (number of observations)
titanic %>% explore(Class, n = n)We get the exact numbers by using describe() together with the n-parameter (weight)
titanic %>% describe(Class, n = n)
#> variable = Class
#> type     = character
#> na       = 0 of 2 201 (0%)
#> unique   = 4
#>  1st     = 325 (14.8%)
#>  2nd     = 285 (12.9%)
#>  3rd     = 706 (32.1%)
#>  Crew    = 885 (40.2%)To explore all variables, we can simply use explore_all(). You automatically fit the height of the plot using fig.height=total_fig_height(titanic, var_name_n = "n") in the code chunk header.
titanic %>% explore_all(n = n)Now we want to check the relation between variables and Survived. We can use the explore() function with Survived as target.
titanic %>% explore(Class, target = Survived, n = n, split = FALSE)To get a better feeling of the relationship between Class and Survived, we switch to percentage and split the target into sperate bars. We can do that by using split = TRUE (which is default).
titanic %>% explore(Class, target = Survived, n = n, split = TRUE)Now we get a plot, where each color sum to 100%. So a big difference in bar length indicates an important relationship between the two variables. In this case, passengers of 1st Class had the highest probability to survive.
titanic %>% explore(Sex, target = Survived, n = n)Female are much more likely to survive!
titanic %>% explore(Age, target = Survived, n = n)Child had an advantage to survive.
Now we can create a simple decision tree. As we have count-data we need to pass parameter n.
titanic %>% explain_tree(target = Survived, n = n)We see that Sex and Class can give a good explanation who are more likely to survive.
titanic %>% explore(Age, target = Class, n = n)Child are unlikely in the 1st class! And all Crew members are adult as expected.
titanic %>% explore(Sex, target = Class, n = n)Almost no female Crew members! Female tend to have better Class!