Sciencificity's Blog: Using the tidyverse with Databases - Part I

Vebash Naidoo

Background

On the 1^st of December I certified as an RStudio Tidyverse Instructor. As part of the process I prepared a sample lesson that was delivered during the exam. In a series of posts I will go through an extension of that sample lesson.

Intro to using Databases in R, with Tidyverse tools (Part I)

Motivation

You often have data in a corporate or institutional database (often termed relational database management system, or RDBMS). The language to talk with such a database (DB) is some variant of the language SQL (Structured Query Language). I say some variant, since a vendor providing an RDBMS may adopt the standardised SQL, but they may also extend it with their own variations.

If you use different RDBMS’s from different vendors you need to understand how to query that DB using the particular SQL dialect used by that vendor. For example, we use PostgreSQL, and we also use PrestoDB (for Athena on AWS). JSON fields in the PostgreSQL DB are ingested as structured nested arrays in AWS and hence when querying the same field we use different query syntax.

You know R, especially the dplyr 📦. Even though the dplyr 📦 is so well written to mimic the SQL syntax - select(), group_by(), left_join() etc. there is still a cognitive load when you switch between using R syntax, and SQL syntax (ask me, who has often written == in SQL syntax on Athena only to wonder why I am getting an error 😶).

You only have so much memory in your local environment, and may want your RDBMS to do the heavy lifting (most of the computation), and only pull data into R when you need to (e.g. pull in aggregated data to create plots for a report).

In this tutorial you will learn how to use dbplyr, which is a database back-end of dplyr, to execute queries directly in your RDBMS all the while writing R tidyverse syntax 😮 ⭐.

Connecting to a Database

Packages needed

DBI: This is the main package which is a Database Interface for R.
install.packages("DBI")
dbplyr: The 📦 we may use to talk to our database, all the while using dplyr syntax in our code.
install.packages("dbplyr")
dplyr: To select(), group_by() as well as “talk”, via the connection we will establish, to a database table.
RSQLite: DBI compliant package that allows a connection to a lightweight single user SQLite DB.
install.packages("RSQLite")
Other DBI compliant R packages you may need for your use case - e.g. RPostgres is for connecting to a PostgreSQL RDBMS. Note: Not used in this tutorial (brief overview in slides), and RStudio’s DB Website is a treasure trove of information 👩‍💻.
odbc: This is a DBI compliant interface to allow you to interface to your RDBMS using an odbc driver.Note: Not used in this tutorial (brief overview in slides), and again RStudio’s DB Website has useful information in this regard too!
install.packages("odbc")

The database we’ll use

Alison Hill created a dataset for The Great British Bake Off. I used the data from her package to create a SQLite DB that we will use to practise on.

Connecting to a SQLite DB

For the purposes of this tutorial we will connect to a SQLite DB that I created. To talk to the DB we need to first make a connection.

Load the DBI package: library(DBI)
Make a connection: con <- dbConnect(RSQLite::SQLite(), "mydb.db")

library(DBI) # main DB interface
library(dplyr) 
library(dbplyr) # dplyr back-end for DBs

con <- dbConnect(drv = RSQLite::SQLite(), # give me a SQLite connection
        dbname = "data/great_brit_bakeoff.db") # To what? The DB named great_brit_bakeoff.db
summary(con) # What do we have?

          Length            Class             Mode 
               1 SQLiteConnection               S4

Ok, we have successfully managed to connect to the DB.

Take a look around

Alright, we’ve made a connection now what? Let’s have a look around using some DBI functions.

DBI Functions

dbListTables(con) # List me the tables at the connection

 [1] "baker_results"     "bakers"            "bakes"            
 [4] "challenge_results" "challenges"        "episode_results"  
 [7] "episodes"          "ratings"           "ratings_seasons"  
[10] "results"           "seasons"           "series"

dbListFields(con, # Reach into my connection and ...
      "bakers")   # tell me what fields does the `bakers` table have?

[1] "series"     "baker_full" "age"        "occupation" "hometown"

res <- dbSendQuery(con, "SELECT * FROM bakers LIMIT 3") # Execute a query
dbFetch(res) # get the result

  series          baker_full age                        occupation
1      1       Annetha Mills  30                           Midwife
2      1      David Chambers  31                      Entrepreneur
3      1 Edward "Edd" Kimber  24 Debt collector for Yorkshire Bank
       hometown
1         Essex
2 Milton Keynes
3      Bradford

dbClearResult(res) # clear the result

`dplyr` Functions

Ok, let’s flex our dplyr skills 😄.

tbl(con, "name_of_table") says using my connection “con” go execute a SELECT * FROM name_of_table.

tbl(con, "bakers") # let's have a look - works like glimpse()

# Source:   table<bakers> [?? x 5]
# Database: sqlite 3.36.0 [C:\Users\Vebashini\OneDrive -
#   Rain\Documents\Main-Documents\Personal\Blog_Vebash\_posts\2020-12-12-using-the-tidyverse-with-databases\data\great_brit_bakeoff.db]
   series baker_full                age occupation         hometown   
    <dbl> <chr>                   <dbl> <chr>              <chr>      
 1      1 "Annetha Mills"            30 Midwife            Essex      
 2      1 "David Chambers"           31 Entrepreneur       Milton Key~
 3      1 "Edward \"Edd\" Kimber"    24 Debt collector fo~ Bradford   
 4      1 "Jasminder Randhawa"       45 Assistant Credit ~ Birmingham 
 5      1 "Jonathan Shepherd"        25 Research Analyst   St Albans  
 6      1 "Lea Harris"               51 Retired            Midlothian~
 7      1 "Louise Brimelow"          44 Police Officer     Manchester 
 8      1 "Mark Whithers"            48 Bus Driver         South Wales
 9      1 "Miranda Gore Browne"      37 Food buyer for Ma~ Midhurst, ~
10      1 "Ruth Clemens"             31 Retail manager/Ho~ Poynton, C~
# ... with more rows

tbl(con, "bakers") %>% 
  head(3) # "SELECT * FROM bakers LIMIT 3"

# Source:   lazy query [?? x 5]
# Database: sqlite 3.36.0 [C:\Users\Vebashini\OneDrive -
#   Rain\Documents\Main-Documents\Personal\Blog_Vebash\_posts\2020-12-12-using-the-tidyverse-with-databases\data\great_brit_bakeoff.db]
  series baker_full                age occupation           hometown  
   <dbl> <chr>                   <dbl> <chr>                <chr>     
1      1 "Annetha Mills"            30 Midwife              Essex     
2      1 "David Chambers"           31 Entrepreneur         Milton Ke~
3      1 "Edward \"Edd\" Kimber"    24 Debt collector for ~ Bradford

Notice that each time we used con. We use our connection con to “talk” to our database, and we will use it throughout, even in our tidy pipelines.

In baker_results we have the baker with their details, as well as their standing in the series competition. Let’s say we wanted to know if the winners came from different areas in Britain, or from some areas in particular.

dbListFields(con, "baker_results")

 [1] "series"                    "baker_full"               
 [3] "baker"                     "age"                      
 [5] "occupation"                "hometown"                 
 [7] "baker_last"                "baker_first"              
 [9] "star_baker"                "technical_winner"         
[11] "technical_top3"            "technical_bottom"         
[13] "technical_highest"         "technical_lowest"         
[15] "technical_median"          "series_winner"            
[17] "series_runner_up"          "total_episodes_appeared"  
[19] "first_date_appeared"       "last_date_appeared"       
[21] "first_date_us"             "last_date_us"             
[23] "percent_episodes_appeared" "percent_technical_top3"

tbl(con, "baker_results") %>% 
  select(series, baker, hometown, series_winner) # normal dplyr select

# Source:   lazy query [?? x 4]
# Database: sqlite 3.36.0 [C:\Users\Vebashini\OneDrive -
#   Rain\Documents\Main-Documents\Personal\Blog_Vebash\_posts\2020-12-12-using-the-tidyverse-with-databases\data\great_brit_bakeoff.db]
   series baker     hometown              series_winner
    <dbl> <chr>     <chr>                         <int>
 1      1 Annetha   Essex                             0
 2      1 David     Milton Keynes                     0
 3      1 Edd       Bradford                          1
 4      1 Jasminder Birmingham                        0
 5      1 Jonathan  St Albans                         0
 6      1 Lea       Midlothian, Scotland              0
 7      1 Louise    Manchester                        0
 8      1 Mark      South Wales                       0
 9      1 Miranda   Midhurst, West Sussex             0
10      1 Ruth      Poynton, Cheshire                 0
# ... with more rows

Notice how our dplyr select() has trimmed down the number of columns from 24 to 4.

Time to see if our winning bakers hail from similar hometowns or not.

tbl(con, "baker_results") %>% 
  select(series, baker, hometown, series_winner) %>% 
  filter(series_winner == 1) %>% # normal dplyr filter
  count(hometown, sort = TRUE)   # normal dplyr count

# Source:     lazy query [?? x 2]
# Database:   sqlite 3.36.0 [C:\Users\Vebashini\OneDrive -
#   Rain\Documents\Main-Documents\Personal\Blog_Vebash\_posts\2020-12-12-using-the-tidyverse-with-databases\data\great_brit_bakeoff.db]
# Ordered by: desc(n)
  hometown                              n
  <chr>                             <int>
1 Wigan                                 1
2 West Molesey, Surrey                  1
3 Ongar, Essex                          1
4 Market Harborough, Leicestershire     1
5 Leeds / Luton                         1
6 Bradford                              1
7 Barton-Upon-Humber, Lincolnshire      1
8 Barton-Le-Clay, Bedfordshire          1

Looks like our winners are from very different areas.

Have a look at the SQL behind the scenes

Our dplyr syntax is converted behind the scenes to SQL which is executed directly on the DB table.

tbl(con, "baker_results") %>% 
  select(series, baker, hometown, series_winner) %>% 
  filter(series_winner == 1) %>% 
  count(hometown, sort = TRUE) %>% 
  show_query()

<SQL>
SELECT `hometown`, COUNT(*) AS `n`
FROM (SELECT `series`, `baker`, `hometown`, `series_winner`
FROM `baker_results`)
WHERE (`series_winner` = 1.0)
GROUP BY `hometown`
ORDER BY `n` DESC

That’s pretty nifty, and helps you learn SQL in the process - it is always good to know, and read SQL, which is used extensively in work and institutional environments.

What’s with this `lazy query` / `??` we’re seeing?

You might have noticed that our pipelines have a:

Source: table [?? x 5] OR
Source: lazy query [?? x 4]

in the output results.

Why is this?

Behind the scenes the dplyr code we write, gets converted to SQL (the particular dialect of the DB we’re connecting to).
The SQL is executed directly on the DB. I.e. the data is not first brought into R and manipulated in R. Instead we send the query to the DB, and the DB does the computation.
When we use a dplyr pipeline to execute a query on the DB, the DB does the computation and sends us back a glimpse of the executed end result.
It is for this reason, we see the ?? - we’re being told “Hey, I executed this query (your pipeline converted to SQL) on the DB at connection con, and here’s a snippet of the output, but I did not get the meta-info on how many rows were produced, I just know that I got x_number of columns in the output.”
We will see in the next part of the tutorial how to get the info into R, and understand how many rows were produced etc.

Done? Remember to disconnect!

Good housekeeping means always remembering to disconnect once you’re done.

dbDisconnect(con) # closes our DB connection

To see that is indeed the case let’s see if we can use the connection further …

dbListTables(con)

Error: Invalid or closed connection

Slides

If you’d like to dive in please check out the slides, and the project on GitHub.

Still to come

More dplyr, for example joining tables.
Getting the data into R.

Acknowledgements

Slide inspiration, as well as some of the background pictures are from Dr. Alison Hill.
The Great British Bake Off data from Dr. Alison Hill.
Edgar Ruiz’s database work, and teachings.

Using the tidyverse with Databases - Part I