Using the tidyverse with Databases - Part I

SQL databases tidyverse dbplyr

You don’t need to be a SQL expert to work with Databases through R

Vebash Naidoo https://twitter.com/Sciencificity (R-Ladies Johannesburg)https://twitter.com/RLadiesJozi
2020-12-12

Background

On the 1st of December I certified as an RStudio Tidyverse Instructor. As part of the process I prepared a sample lesson that was delivered during the exam. In a series of posts I will go through an extension of that sample lesson.

Intro to using Databases in R, with Tidyverse tools (Part I)

Motivation

You often have data in a corporate or institutional database (often termed relational database management system, or RDBMS). The language to talk with such a database (DB) is some variant of the language SQL (Structured Query Language). I say some variant, since a vendor providing an RDBMS may adopt the standardised SQL, but they may also extend it with their own variations.

If you use different RDBMS’s from different vendors you need to understand how to query that DB using the particular SQL dialect used by that vendor. For example, we use PostgreSQL, and we also use PrestoDB (for Athena on AWS). JSON fields in the PostgreSQL DB are ingested as structured nested arrays in AWS and hence when querying the same field we use different query syntax.

You know R, especially the dplyr 📦. Even though the dplyr 📦 is so well written to mimic the SQL syntax - select(), group_by(), left_join() etc. there is still a cognitive load when you switch between using R syntax, and SQL syntax (ask me, who has often written == in SQL syntax on Athena only to wonder why I am getting an error 😶).

You only have so much memory in your local environment, and may want your RDBMS to do the heavy lifting (most of the computation), and only pull data into R when you need to (e.g. pull in aggregated data to create plots for a report).

In this tutorial you will learn how to use dbplyr, which is a database back-end of dplyr, to execute queries directly in your RDBMS all the while writing R tidyverse syntax 😮 ⭐.

Connecting to a Database

Packages needed

The database we’ll use

Alison Hill created a dataset for The Great British Bake Off. I used the data from her package to create a SQLite DB that we will use to practise on.

Connecting to a SQLite DB

For the purposes of this tutorial we will connect to a SQLite DB that I created. To talk to the DB we need to first make a connection.

  1. Load the DBI package: library(DBI)
  2. Make a connection: con <- dbConnect(RSQLite::SQLite(), "mydb.db")
library(DBI) # main DB interface
library(dplyr) 
library(dbplyr) # dplyr back-end for DBs

con <- dbConnect(drv = RSQLite::SQLite(), # give me a SQLite connection
        dbname = "data/great_brit_bakeoff.db") # To what? The DB named great_brit_bakeoff.db
summary(con) # What do we have? 
          Length            Class             Mode 
               1 SQLiteConnection               S4 

Ok, we have successfully managed to connect to the DB.

Take a look around

Alright, we’ve made a connection now what? Let’s have a look around using some DBI functions.

DBI Functions

dbListTables(con) # List me the tables at the connection
 [1] "baker_results"     "bakers"            "bakes"            
 [4] "challenge_results" "challenges"        "episode_results"  
 [7] "episodes"          "ratings"           "ratings_seasons"  
[10] "results"           "seasons"           "series"           
dbListFields(con, # Reach into my connection and ...
      "bakers")   # tell me what fields does the `bakers` table have?
[1] "series"     "baker_full" "age"        "occupation" "hometown"  
res <- dbSendQuery(con, "SELECT * FROM bakers LIMIT 3") # Execute a query
dbFetch(res) # get the result
  series          baker_full age                        occupation
1      1       Annetha Mills  30                           Midwife
2      1      David Chambers  31                      Entrepreneur
3      1 Edward "Edd" Kimber  24 Debt collector for Yorkshire Bank
       hometown
1         Essex
2 Milton Keynes
3      Bradford
dbClearResult(res) # clear the result

dplyr Functions

Ok, let’s flex our dplyr skills 😄.

tbl(con, "bakers") # let's have a look - works like glimpse()
# Source:   table<bakers> [?? x 5]
# Database: sqlite 3.36.0 [C:\Users\Vebashini\OneDrive -
#   Rain\Documents\Main-Documents\Personal\Blog_Vebash\_posts\2020-12-12-using-the-tidyverse-with-databases\data\great_brit_bakeoff.db]
   series baker_full                age occupation         hometown   
    <dbl> <chr>                   <dbl> <chr>              <chr>      
 1      1 "Annetha Mills"            30 Midwife            Essex      
 2      1 "David Chambers"           31 Entrepreneur       Milton Key~
 3      1 "Edward \"Edd\" Kimber"    24 Debt collector fo~ Bradford   
 4      1 "Jasminder Randhawa"       45 Assistant Credit ~ Birmingham 
 5      1 "Jonathan Shepherd"        25 Research Analyst   St Albans  
 6      1 "Lea Harris"               51 Retired            Midlothian~
 7      1 "Louise Brimelow"          44 Police Officer     Manchester 
 8      1 "Mark Whithers"            48 Bus Driver         South Wales
 9      1 "Miranda Gore Browne"      37 Food buyer for Ma~ Midhurst, ~
10      1 "Ruth Clemens"             31 Retail manager/Ho~ Poynton, C~
# ... with more rows
tbl(con, "bakers") %>% 
  head(3) # "SELECT * FROM bakers LIMIT 3"
# Source:   lazy query [?? x 5]
# Database: sqlite 3.36.0 [C:\Users\Vebashini\OneDrive -
#   Rain\Documents\Main-Documents\Personal\Blog_Vebash\_posts\2020-12-12-using-the-tidyverse-with-databases\data\great_brit_bakeoff.db]
  series baker_full                age occupation           hometown  
   <dbl> <chr>                   <dbl> <chr>                <chr>     
1      1 "Annetha Mills"            30 Midwife              Essex     
2      1 "David Chambers"           31 Entrepreneur         Milton Ke~
3      1 "Edward \"Edd\" Kimber"    24 Debt collector for ~ Bradford  

Notice that each time we used con. We use our connection con to “talk” to our database, and we will use it throughout, even in our tidy pipelines.

In baker_results we have the baker with their details, as well as their standing in the series competition. Let’s say we wanted to know if the winners came from different areas in Britain, or from some areas in particular.

dbListFields(con, "baker_results")
 [1] "series"                    "baker_full"               
 [3] "baker"                     "age"                      
 [5] "occupation"                "hometown"                 
 [7] "baker_last"                "baker_first"              
 [9] "star_baker"                "technical_winner"         
[11] "technical_top3"            "technical_bottom"         
[13] "technical_highest"         "technical_lowest"         
[15] "technical_median"          "series_winner"            
[17] "series_runner_up"          "total_episodes_appeared"  
[19] "first_date_appeared"       "last_date_appeared"       
[21] "first_date_us"             "last_date_us"             
[23] "percent_episodes_appeared" "percent_technical_top3"   
tbl(con, "baker_results") %>% 
  select(series, baker, hometown, series_winner) # normal dplyr select
# Source:   lazy query [?? x 4]
# Database: sqlite 3.36.0 [C:\Users\Vebashini\OneDrive -
#   Rain\Documents\Main-Documents\Personal\Blog_Vebash\_posts\2020-12-12-using-the-tidyverse-with-databases\data\great_brit_bakeoff.db]
   series baker     hometown              series_winner
    <dbl> <chr>     <chr>                         <int>
 1      1 Annetha   Essex                             0
 2      1 David     Milton Keynes                     0
 3      1 Edd       Bradford                          1
 4      1 Jasminder Birmingham                        0
 5      1 Jonathan  St Albans                         0
 6      1 Lea       Midlothian, Scotland              0
 7      1 Louise    Manchester                        0
 8      1 Mark      South Wales                       0
 9      1 Miranda   Midhurst, West Sussex             0
10      1 Ruth      Poynton, Cheshire                 0
# ... with more rows

Notice how our dplyr select() has trimmed down the number of columns from 24 to 4.

Time to see if our winning bakers hail from similar hometowns or not.

tbl(con, "baker_results") %>% 
  select(series, baker, hometown, series_winner) %>% 
  filter(series_winner == 1) %>% # normal dplyr filter
  count(hometown, sort = TRUE)   # normal dplyr count
# Source:     lazy query [?? x 2]
# Database:   sqlite 3.36.0 [C:\Users\Vebashini\OneDrive -
#   Rain\Documents\Main-Documents\Personal\Blog_Vebash\_posts\2020-12-12-using-the-tidyverse-with-databases\data\great_brit_bakeoff.db]
# Ordered by: desc(n)
  hometown                              n
  <chr>                             <int>
1 Wigan                                 1
2 West Molesey, Surrey                  1
3 Ongar, Essex                          1
4 Market Harborough, Leicestershire     1
5 Leeds / Luton                         1
6 Bradford                              1
7 Barton-Upon-Humber, Lincolnshire      1
8 Barton-Le-Clay, Bedfordshire          1

Looks like our winners are from very different areas.

Have a look at the SQL behind the scenes

Our dplyr syntax is converted behind the scenes to SQL which is executed directly on the DB table.

tbl(con, "baker_results") %>% 
  select(series, baker, hometown, series_winner) %>% 
  filter(series_winner == 1) %>% 
  count(hometown, sort = TRUE) %>% 
  show_query()
<SQL>
SELECT `hometown`, COUNT(*) AS `n`
FROM (SELECT `series`, `baker`, `hometown`, `series_winner`
FROM `baker_results`)
WHERE (`series_winner` = 1.0)
GROUP BY `hometown`
ORDER BY `n` DESC

That’s pretty nifty, and helps you learn SQL in the process - it is always good to know, and read SQL, which is used extensively in work and institutional environments.

What’s with this lazy query / ?? we’re seeing?

You might have noticed that our pipelines have a:

in the output results.

Why is this?

Done? Remember to disconnect!

Good housekeeping means always remembering to disconnect once you’re done.

dbDisconnect(con) # closes our DB connection

To see that is indeed the case let’s see if we can use the connection further …

Error: Invalid or closed connection

Slides

If you’d like to dive in please check out the slides, and the project on GitHub.

Still to come

Acknowledgements

Citation

For attribution, please cite this work as

Naidoo (2020, Dec. 12). Sciencificity's Blog: Using the tidyverse with Databases - Part I. Retrieved from https://sciencificity-blog.netlify.app/posts/2020-12-12-using-the-tidyverse-with-databases/

BibTeX citation

@misc{naidoo2020using,
  author = {Naidoo, Vebash},
  title = {Sciencificity's Blog: Using the tidyverse with Databases - Part I},
  url = {https://sciencificity-blog.netlify.app/posts/2020-12-12-using-the-tidyverse-with-databases/},
  year = {2020}
}