Sciencificity's Blog: Using the tidyverse with Databases - Part II

Vebash Naidoo

The project on GitHub, has the example SQLite database, the slides, and some code files.

Connect, and remind ourselves what we’re working with

Make a connection

library(DBI) # main DB interface
library(dplyr) 
library(dbplyr) # dplyr back-end for DBs

con <- dbConnect(drv = RSQLite::SQLite(), # give me a SQLite connection
        dbname = "data/great_brit_bakeoff.db") # To what? The DB named great_brit_bakeoff.db

dbListTables(con) # List me the tables at the connection

Let’s get familiar with our data

results

tbl(con, "results") %>% # Reach into my connection, and "talk" to results table
  head(10) %>%          # get me a subset of the data
  # sometimes if there are many columns, some columns are hidden, 
  # this option prints all columns for us
  print(width = Inf)

baker_results

tbl(con, "baker_results") %>% # Reach in and "talk" to baker_results
  head() %>%                  # get a glimpse of data
  collect() %>%               # bring that glimpsed data into R 
  DT::datatable(options = list(scrollX = TRUE)) # force DT horizontal scrollbar

Show entries

Search:

	series	baker_full	baker	age	occupation	hometown	baker_last	baker_first	star_baker	technical_winner	technical_top3	technical_bottom	technical_highest	technical_lowest	technical_median	series_winner	series_runner_up	total_episodes_appeared	first_date_appeared	last_date_appeared	first_date_us	last_date_us	percent_episodes_appeared	percent_technical_top3

	series	baker_full	baker	age	occupation	hometown	baker_last	baker_first	technical_winner	technical_top3	technical_bottom	technical_highest	technical_lowest	technical_median	series_winner	total_episodes_appeared	first_date_appeared	last_date_appeared	percent_episodes_appeared	percent_technical_top3
1	1	Annetha Mills	Annetha	30	Single mother	Essex	Mills	Annetha	0	1	1	2	7	4.5	0	2	14838	14845	33.3333333333333	50
2	1	David Chambers	David	31	Entrepreneur	Milton Keynes	Chambers	David	0	1	3	3	8	4.5	0	4	14838	14859	66.6666666666667	25
3	1	Edward "Edd" Kimber	Edd	24	Debt collector for Yorkshire Bank	Bradford	Kimber	Edward	2	4	1	1	6	2	1	6	14838	14873	100	66.6666666666667
4	1	Jasminder Randhawa	Jasminder	45	Assistant Credit Control Manager	Birmingham	Randhawa	Jasminder	0	2	2	2	5	3	0	5	14838	14866	83.3333333333333	40
5	1	Jonathan Shepherd	Jonathan	25	Research Analyst	St Albans	Shepherd	Jonathan	1	1	2	1	9	6	0	3	14838	14852	50	33.3333333333333
6	1	Lea Harris	Lea	51	Retired	Midlothian, Scotland	Harris	Lea	0	0	1	10	10	10	0	1	14838	14838	16.6666666666667	0

Showing 1 to 6 of 6 entries

Previous1Next

Notice the use of the collect() function in the code above. I wanted us to be able to get a full glimpse of the data in a nice table, and hence I brought the first few rows of data into R by using the collect() function. This allowed me to then use datatable to display the results a bit better, than the print(width = Inf) alternative.

What are we interested in?

Let’s say we want to see how the WINNER and RUNNER-UP(s) did in the series they appeared in.

To do that we need to get all the baker_results for the WINNER and RUNNER-UP.

Joining data

When doing joins we want to find the common columns across the two tables that we can join on.

Remember the tbl(con, "tbl_name") always

I’d like to bring to your attention the use of tbl(con, "table_1") and tbl(con, "table_2") in the join function.

We must always keep this in mind, because baker_results and results don’t exist in R yet. We’re talking to those tables in our relational database management system (RDBMS), so we always have to do so through our connection.

set.seed(42)
tbl(con, "baker_results") %>% # use connection to "talk" to baker_results
  inner_join(tbl(con, "results"), # use connection to "talk" to results and join both tables 
        by = c('baker' = 'baker',
               'series' = 'series')) %>% # join criteria 
  collect() %>% # get it into R
  sample_n(size = 3) %>% # take a random sample
  print(width = Inf) # print all columns

Notice that all columns of baker_results appear first and then we have the “extra” columns from results i.e. episode and result.

Common mistake

I included the above to show that each time we “talk” to a table we must do so through our connection, because I often make the mistake of not including the tbl(con, "name_of_tbl_i_am_joining") in the join function. I, more times than I care to admit 🤦‍♀️, incorrectly write:

I would like to help you, not repeat my mistake 😕, so heads up AVOID THE FOLLOWING 🛑:

tbl(con, "baker_results") %>% # use connection to "talk" to baker_results
  inner_join(results,  # OOPS! I forgot the tbl(con, "results")
        by = c('baker' = 'baker',
               'series' = 'series'))

Collect

Ok, let us now do our entire pipeline, and only bring the data into R when we’ve got what we’re looking for.

(final_query <- tbl(con, "baker_results") %>% # use connection to "talk" to baker_results
  inner_join(tbl(con, "results"), # use connection to "talk" to results and join both tables 
        by = c('baker' = 'baker',
               'series' = 'series')) %>% # join criteria 
  filter(result %in% c("WINNER", "RUNNER-UP")) %>% # filter rows we're interested in
  select(series, baker:percent_technical_top3,
         result))

The above code just sets up the query that will be executed should we run (Ctrl + Enter) final_query in R (hence the lazy query [?? x 24] in the output). No data is collected (i.e. present in your local R environment) as yet.

What does the query look like?

Bring it into R

Now finally, we are ready to bring our filtered and joined data into R by using collect().

How about that? Notice the A tibble: 24 x 24! R now, has the data in it’s local environment, and can definitively tell us it knows there are 24 observations (no more lazy query) 😄.

Visualise Data

Now that we have finalised what data we wanted from our RDBMS, executed our query, and collected the data into our R environment we can do further processing, create plots for reports etc.

I am interested in understanding how did the winner and runner-up(s) of series 6 do across the season in terms of technical challenges etc.?

library(tidyverse)

top_performers %>% 
  # filter for season we're interested in
  filter(series == 6) %>%
  # format baker nicely so we see winner, then runner-up(s)
  mutate(baker_name = factor(str_glue("{result} - {baker}")),
         baker_name = fct_rev(baker_name)) %>% 
  # let's convert all the tech info cols to be a metric name, 
  # and put the value in the value column 
  # (by default values_to = "value" in pivot_longer())
  pivot_longer(cols = c(star_baker:technical_median),
               names_to = "metric") %>% 
  mutate(metric = fct_reorder(metric, value)) %>% 
  ggplot(aes(x = value, y = metric)) +
  geom_col(fill = "#727d97") +
  facet_wrap(~ baker_name) +
  labs(title = str_glue("Metrics for Season ",
    "{top_performers %>%  filter(series == 6) %>%
    select(series) %>% distinct()}'s Winner and Runner-Up(s)"),
    y = "") +
  theme_light() +
  theme(panel.spacing = unit(1, "lines")) +
  theme(strip.background =element_rect(fill="#f4e4e7"))+
  theme(strip.text = element_text(colour = "#5196b4"))

Given that Nadiya was a technical winner more times than the other contestants, and that her technical_lowest was better (higher number is better) it looks like she had a good run throughout the series, and was a deserved winner.

Using the tidyverse with Databases - Part II

Author

Affiliation

Published

Citation

Part I

What are we tackling in Part II?

Connect, and remind ourselves what we’re working with

Make a connection

Let’s get familiar with our data

results

baker_results

What are we interested in?

Joining data

Remember the `tbl(con, "tbl_name")` always

Common mistake

Collect

What does the query look like?

Bring it into R

Visualise Data

Done? Remember to disconnect!

Still to come

Acknowledgements

Footnotes

Citation

Using the tidyverse with Databases - Part II

Author

Affiliation

Published

Citation

Part I

What are we tackling in Part II?

Connect, and remind ourselves what we’re working with

Make a connection

Let’s get familiar with our data

results

baker_results

What are we interested in?

Joining data

Remember the tbl(con, "tbl_name") always

Common mistake

Collect

What does the query look like?

Bring it into R

Visualise Data

Done? Remember to disconnect!

Still to come

Acknowledgements

Footnotes

Citation

Remember the `tbl(con, "tbl_name")` always