You don’t need to be a SQL expert to work with Databases through R
On the 1st of December I certified as an RStudio Tidyverse Instructor. As part of the process I prepared a sample lesson that was delivered during the exam. In a series of posts I will go through an extension of that sample lesson.
You often have data in a corporate or institutional database (often termed relational database management system, or RDBMS
). The language to talk with such a database (DB) is some variant of the language SQL (Structured Query Language). I say some variant, since a vendor providing an RDBMS may adopt the standardised SQL, but they may also extend it with their own variations.
If you use different RDBMS’s from different vendors you need to understand how to query that DB using the particular SQL dialect used by that vendor. For example, we use PostgreSQL, and we also use PrestoDB (for Athena on AWS). JSON
fields in the PostgreSQL DB are ingested as structured nested arrays in AWS and hence when querying the same field we use different query syntax.
You know R, especially the dplyr
📦. Even though the dplyr
📦 is so well written to mimic the SQL syntax - select(), group_by(), left_join() etc. there is still a cognitive load when you switch between using R syntax, and SQL syntax (ask me, who has often written ==
in SQL syntax on Athena only to wonder why I am getting an error 😶).
You only have so much memory in your local environment, and may want your RDBMS to do the heavy lifting (most of the computation), and only pull data into R when you need to (e.g. pull in aggregated data to create plots for a report).
In this tutorial you will learn how to use dbplyr
, which is a database back-end of dplyr
, to execute queries directly in your RDBMS all the while writing R tidyverse syntax 😮 ⭐.
install.packages("DBI")
dplyr
syntax in our code.install.packages("dbplyr")
select()
, group_by()
as well as “talk”, via the connection we will establish, to a database table.install.packages("RSQLite")
RPostgres
is for connecting to a PostgreSQL RDBMS. Note: Not used in this tutorial (brief overview in slides), and RStudio’s DB Website is a treasure trove of information 👩💻.install.packages("odbc")
Alison Hill created a dataset for The Great British Bake Off. I used the data from her package to create a SQLite DB that we will use to practise on.
For the purposes of this tutorial we will connect to a SQLite DB that I created. To talk to the DB we need to first make a connection.
library(DBI)
con <- dbConnect(RSQLite::SQLite(), "mydb.db")
library(DBI) # main DB interface
library(dplyr)
library(dbplyr) # dplyr back-end for DBs
con <- dbConnect(drv = RSQLite::SQLite(), # give me a SQLite connection
dbname = "data/great_brit_bakeoff.db") # To what? The DB named great_brit_bakeoff.db
summary(con) # What do we have?
Length Class Mode
1 SQLiteConnection S4
Ok, we have successfully managed to connect to the DB.
Alright, we’ve made a connection now what? Let’s have a look around using some DBI functions.
dbListTables(con) # List me the tables at the connection
[1] "baker_results" "bakers" "bakes"
[4] "challenge_results" "challenges" "episode_results"
[7] "episodes" "ratings" "ratings_seasons"
[10] "results" "seasons" "series"
dbListFields(con, # Reach into my connection and ...
"bakers") # tell me what fields does the `bakers` table have?
[1] "series" "baker_full" "age" "occupation" "hometown"
res <- dbSendQuery(con, "SELECT * FROM bakers LIMIT 3") # Execute a query
dbFetch(res) # get the result
series baker_full age occupation
1 1 Annetha Mills 30 Midwife
2 1 David Chambers 31 Entrepreneur
3 1 Edward "Edd" Kimber 24 Debt collector for Yorkshire Bank
hometown
1 Essex
2 Milton Keynes
3 Bradford
dbClearResult(res) # clear the result
dplyr
FunctionsOk, let’s flex our dplyr
skills 😄.
tbl(con, "name_of_table")
says using my connection “con” go execute a SELECT * FROM name_of_table
.tbl(con, "bakers") # let's have a look - works like glimpse()
# Source: table<bakers> [?? x 5]
# Database: sqlite 3.36.0 [C:\Users\Vebashini\OneDrive -
# Rain\Documents\Main-Documents\Personal\Blog_Vebash\_posts\2020-12-12-using-the-tidyverse-with-databases\data\great_brit_bakeoff.db]
series baker_full age occupation hometown
<dbl> <chr> <dbl> <chr> <chr>
1 1 "Annetha Mills" 30 Midwife Essex
2 1 "David Chambers" 31 Entrepreneur Milton Key~
3 1 "Edward \"Edd\" Kimber" 24 Debt collector fo~ Bradford
4 1 "Jasminder Randhawa" 45 Assistant Credit ~ Birmingham
5 1 "Jonathan Shepherd" 25 Research Analyst St Albans
6 1 "Lea Harris" 51 Retired Midlothian~
7 1 "Louise Brimelow" 44 Police Officer Manchester
8 1 "Mark Whithers" 48 Bus Driver South Wales
9 1 "Miranda Gore Browne" 37 Food buyer for Ma~ Midhurst, ~
10 1 "Ruth Clemens" 31 Retail manager/Ho~ Poynton, C~
# ... with more rows
# Source: lazy query [?? x 5]
# Database: sqlite 3.36.0 [C:\Users\Vebashini\OneDrive -
# Rain\Documents\Main-Documents\Personal\Blog_Vebash\_posts\2020-12-12-using-the-tidyverse-with-databases\data\great_brit_bakeoff.db]
series baker_full age occupation hometown
<dbl> <chr> <dbl> <chr> <chr>
1 1 "Annetha Mills" 30 Midwife Essex
2 1 "David Chambers" 31 Entrepreneur Milton Ke~
3 1 "Edward \"Edd\" Kimber" 24 Debt collector for ~ Bradford
Notice that each time we used con
. We use our connection con to “talk” to our database, and we will use it throughout, even in our tidy pipelines.
In baker_results
we have the baker with their details, as well as their standing in the series competition. Let’s say we wanted to know if the winners came from different areas in Britain, or from some areas in particular.
dbListFields(con, "baker_results")
[1] "series" "baker_full"
[3] "baker" "age"
[5] "occupation" "hometown"
[7] "baker_last" "baker_first"
[9] "star_baker" "technical_winner"
[11] "technical_top3" "technical_bottom"
[13] "technical_highest" "technical_lowest"
[15] "technical_median" "series_winner"
[17] "series_runner_up" "total_episodes_appeared"
[19] "first_date_appeared" "last_date_appeared"
[21] "first_date_us" "last_date_us"
[23] "percent_episodes_appeared" "percent_technical_top3"
# Source: lazy query [?? x 4]
# Database: sqlite 3.36.0 [C:\Users\Vebashini\OneDrive -
# Rain\Documents\Main-Documents\Personal\Blog_Vebash\_posts\2020-12-12-using-the-tidyverse-with-databases\data\great_brit_bakeoff.db]
series baker hometown series_winner
<dbl> <chr> <chr> <int>
1 1 Annetha Essex 0
2 1 David Milton Keynes 0
3 1 Edd Bradford 1
4 1 Jasminder Birmingham 0
5 1 Jonathan St Albans 0
6 1 Lea Midlothian, Scotland 0
7 1 Louise Manchester 0
8 1 Mark South Wales 0
9 1 Miranda Midhurst, West Sussex 0
10 1 Ruth Poynton, Cheshire 0
# ... with more rows
Notice how our dplyr
select() has trimmed down the number of columns from 24 to 4.
Time to see if our winning bakers hail from similar hometowns or not.
tbl(con, "baker_results") %>%
select(series, baker, hometown, series_winner) %>%
filter(series_winner == 1) %>% # normal dplyr filter
count(hometown, sort = TRUE) # normal dplyr count
# Source: lazy query [?? x 2]
# Database: sqlite 3.36.0 [C:\Users\Vebashini\OneDrive -
# Rain\Documents\Main-Documents\Personal\Blog_Vebash\_posts\2020-12-12-using-the-tidyverse-with-databases\data\great_brit_bakeoff.db]
# Ordered by: desc(n)
hometown n
<chr> <int>
1 Wigan 1
2 West Molesey, Surrey 1
3 Ongar, Essex 1
4 Market Harborough, Leicestershire 1
5 Leeds / Luton 1
6 Bradford 1
7 Barton-Upon-Humber, Lincolnshire 1
8 Barton-Le-Clay, Bedfordshire 1
Looks like our winners are from very different areas.
Our dplyr
syntax is converted behind the scenes to SQL which is executed directly on the DB table.
tbl(con, "baker_results") %>%
select(series, baker, hometown, series_winner) %>%
filter(series_winner == 1) %>%
count(hometown, sort = TRUE) %>%
show_query()
<SQL>
SELECT `hometown`, COUNT(*) AS `n`
FROM (SELECT `series`, `baker`, `hometown`, `series_winner`
FROM `baker_results`)
WHERE (`series_winner` = 1.0)
GROUP BY `hometown`
ORDER BY `n` DESC
That’s pretty nifty, and helps you learn SQL in the process - it is always good to know, and read SQL, which is used extensively in work and institutional environments.
lazy query
/ ??
we’re seeing?You might have noticed that our pipelines have a:
in the output results.
dplyr
code we write, gets converted to SQL (the particular dialect of the DB we’re connecting to).dplyr
pipeline to execute a query on the DB, the DB does the computation and sends us back a glimpse of the executed end result.??
- we’re being told “Hey, I executed this query (your pipeline converted to SQL) on the DB at connection con
, and here’s a snippet of the output, but I did not get the meta-info on how many rows were produced, I just know that I got x_number of columns in the output.”Good housekeeping means always remembering to disconnect once you’re done.
dbDisconnect(con) # closes our DB connection
To see that is indeed the case let’s see if we can use the connection further …
dbListTables(con)
Error: Invalid or closed connection
If you’d like to dive in please check out the slides, and the project on GitHub.
dplyr
, for example joining tables.The Great British Bake Off
data from Dr. Alison Hill.
For attribution, please cite this work as
Naidoo (2020, Dec. 12). Sciencificity's Blog: Using the tidyverse with Databases - Part I. Retrieved from https://sciencificity-blog.netlify.app/posts/2020-12-12-using-the-tidyverse-with-databases/
BibTeX citation
@misc{naidoo2020using, author = {Naidoo, Vebash}, title = {Sciencificity's Blog: Using the tidyverse with Databases - Part I}, url = {https://sciencificity-blog.netlify.app/posts/2020-12-12-using-the-tidyverse-with-databases/}, year = {2020} }