+ - 0:00:00
Notes for current slide
Notes for next slide

Introduction to
using Databases in R,
with Tidyverse tools

Working directly in a database

Vebash Naidoo

Who am I?

Who are you?

You've got data in a database

Who are you?

You've got data in a database

You know enough of the tidyverse

You have good tidyverse knowledge...most especially dplyr.

Who are you?

You've got data in a database

You know enough of the tidyverse

You want to flex that skill

You have good tidyverse knowledge...most especially dplyr.

You want to leverage that knowledge and connect directly to your database

Who are you?

You've got data in a database

You know enough of the tidyverse

You want to flex that skill

Lighter cognitive load



You have good tidyverse knowledge...most especially dplyr.

You want to leverage that knowledge and connect directly to your database

It's hard switching context between SQL and R tidyverse. While it is good to know both sometimes in an analysis you'd like to have a lower cognitive load.

Quick Knowledge Check

Do you understand what this piece of code is doing? Will you discuss with your neighbour for a minute, and make a choice on the next tab.

palmerpenguins::penguins %>%
select(species, sex, flipper_length_mm) %>%
drop_na() %>%
group_by(species, sex) %>%
summarise(avg_flipper_len = mean(flipper_length_mm))
  1. Returns only the average flipper length of each species.

  2. Returns the entire dataset, with the average flipper length
    of the species, tagged on as new column.

  3. It returns the average flipper length split by species and sex. Only observations which have values for species, sex and flipper_length_mm are included.

  4. It gives an error because it should be summarize() not summarise().

The answer is c.

palmerpenguins::penguins %>%
select(species, sex,
flipper_length_mm) %>%
drop_na() %>%
group_by(species, sex) %>%
summarise(avg_flipper_len =
mean(flipper_length_mm))
## `summarise()` regrouping output by 'species' (override with `.groups` argument)
## # A tibble: 6 x 3
## # Groups: species [3]
## species sex avg_flipper_len
## <fct> <fct> <dbl>
## 1 Adelie female 188.
## 2 Adelie male 192.
## 3 Chinstrap female 192.
## 4 Chinstrap male 200.
## 5 Gentoo female 213.
## 6 Gentoo male 222.

Connecting to a Database

  • Install the DBI package.
  • The DBI package is a DataBase Interface for R.
  • Install the odbc package.
  • The odbc package is a DBI compliant interface to Open DataBase Connectivity (ODBC) drivers.
  • Install the dbplyr package which is a dplyr back-end for databases.
install.packages("DBI")
install.packages("odbc")
install.packages("dbplyr")
  • Install the RSQLite package.
install.packages("RSQLite")
  • We make a connection to the DB, using dbConnect(), providing two arguments:

    • drv: The driver we're using, here RSQLite::SQLite().
    • A path to the DB - the example here uses :memory: which is an in memory SQLite DB.
library(DBI)
# Create an in-memory RSQLite DB
con <- dbConnect(RSQLite::SQLite(), ":memory:")
# List tables available through our connection
dbListTables(con) # nothing there yet -> character(0)
  • Install the DBI-compliant R 📦
  • We'll use the RPostgres 📦 as an example here (similar process for RMariaDB etc.).
install.packages("RPostgres")
  • We make a connection to the DB, using dbConnect() as before, this time providing more info such as user name, password, host name etc.
con <- dbConnect(drv = RPostgres::Postgres(), # driver we're using this time
dbname = "DB_name",
host = "URL_or_localhost",
user = "user_name",
password = "pswd",
port = "5432" # common port for PostgreSQL, but check with DBA
)
  • You can download the ODBC driver from your database vendor's website.
  • List the drivers you have available using odbcListDrivers().
library(odbc)
sort(unique(odbcListDrivers()[[1]]))
  • We make a connection to the DB, using dbConnect() as before, this time through the applicable odbc driver.
con <- dbConnect(
odbc::odbc(),
Driver = "e.g.PostgreSQL ODBC Driver(UNICODE)",
Database = "DB_name",
Server = "URL_or_localhost",
UID = "user_name",
PWD = "pswd",
port = "5432"
)
  • The RSQLite package embeds a SQLite database engine in R, and is a DBI-compliant interface.

We're connected, now what?

  • Write data into a database table




  • List all tables




  • List the fields in a particular table




dbWriteTable(con, # using connection con
"mtcars", # write into a table "mtcars"
mtcars, # the dataset mtcars
overwrite = TRUE # overwrite the data there if it exists
)
dbListTables(con)
## [1] "mtcars"
dbListFields(con, "mtcars")
## [1] "mpg" "cyl" "disp" "hp" "drat"
# "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
  • Connect to table tbl(con, "tbl_name")

  • Select subsets of data



  • Anayse data





  • Filter data

tbl(con, # using our connection
'mtcars' # reach into the mtcars table
) %>%
select(cyl) %>% distinct()
# Source: lazy query [?? x 1]
# Database: sqlite 3.33.0 [:memory:]
# cyl
# 1 6 ...
tbl(con, "mtcars") %>%
count(cyl)
# cyl n
# <dbl> <int>
# 1 4 11
# 2 6 7 ...
tbl(con, "mtcars") %>%
filter(am == 0)
  • Join tables from your database NOTE: tbl(con, "tbl_name1") and tbl(con, "tbl_name2").



  • What's the SQL Looking like?



tbl(con, "tbl_name1") %>%
left_join(tbl(con, "tbl_name2"),
by = c("col1" = "name_1",
"coln" = "name_n")
tbl(con, "mtcars") %>%
count(cyl) %>%
show_query()
# <SQL>
# SELECT `cyl`, COUNT(*) AS `n`
# FROM `mtcars`
# GROUP BY `cyl`

Always remember to ...

  • Collect your data only when you need to (close to the end).
my_db_mtcars <- tbl(con, "mtcars") %>%
count(cyl) %>%
collect() # finally bringing data into R
  • Disconnect when you're done!
dbDisconnect(con)

Let's take stock

  1. dbWriteTable(con, "band_members", dplyr::band_members)
  2. library(DBI)
  3. dbListFields(con, "band_members")
  4. dbDisconnect(con)
  5. con <- dbConnect(RSQLite::SQLite(), ":memory:")
  6. tbl(con, "band_members")

2: library(DBI)
5: con <- dbConnect(RSQLite::SQLite(), ":memory:")
1: dbWriteTable(con, "band_members", dplyr::band_members)
3: dbListFields(con, "band_members")
6: tbl(con, "band_members")
4: dbDisconnect(con)

library(DBI)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(con, "band_members", dplyr::band_members)
dbListFields(con, "band_members")
## [1] "name" "band"
tbl(con, "band_members")
## # Source: table<band_members> [?? x 2]
## # Database: sqlite 3.33.0 [:memory:]
## name band
## <chr> <chr>
## 1 Mick Stones
## 2 John Beatles
## 3 Paul Beatles
dbDisconnect(con)

Our dataset

Alison Hill has put together some data from the Great British Bake off here.

I converted it to a SQLite DB that we can use to practise our skills on.

tbl(con, "baker_results") %>%
select(series, baker,
star_baker:technical_top3) %>%
head(3)
## # Source: lazy query [?? x 5]
## # Database: sqlite 3.33.0 [/cloud/project/great_brit_bakeoff.db]
## series baker star_baker technical_winner technical_top3
## <dbl> <chr> <int> <int> <int>
## 1 1 Annetha 0 0 1
## 2 1 David 0 0 1
## 3 1 Edd 0 2 4
tbl(con, "results") %>%
select(series, baker, result) %>%
filter(result == "WINNER") %>%
head(3)
## # Source: lazy query [?? x 3]
## # Database: sqlite 3.33.0 [/cloud/project/great_brit_bakeoff.db]
## series baker result
## <int> <chr> <chr>
## 1 1 Edd WINNER
## 2 2 Joanne WINNER
## 3 3 John WINNER
  • It is a SQLite DB in your working directory.
  • It is named great_brit_bakeoff.db

Which code will make the connection?

a.

library(DBI)
con <- dbConnect(
drv = RSQLite::SQLite(),
dbname = "great_brit_bakeoff.db"
)

b.

library(DBI)
con <- dbConnect(
drv = RSQLite::SQLite(),
user = "admin", password = "password",
dbname = "great_brit_bakeoff.db"
)

c.

library(DBI)
con <- dbConnect(
drv = RPostgres::Postgres(), host = "localhost",
user = "admin", password = "password",
dbname = "great_brit_bakeoff.db"
)

Open up exercise-01.Rmd and:

  1. Complete the connection code.
  2. Recalling that we are joining the baker_results and the results tables, complete the join code.
  3. Amend eval = FALSE in the setup code chunk to eval = TRUE.
  4. Knit the document.
  5. Interested in a different winner? Fill in the code to view how your baker performed across the season.

Acknowledgements

Slides inspired by Dr. Alison Hill

Teaching in Production

Yihui Xie's and Garrick Aden-Buie's excellent

xaringan 📦 xaringanExtra 📦

Mine Çetinkaya-Rundel "Start with Cake"

Let them eat cake (first)!

More Resources

Looking for more practise

Check out:

  1. Databases using R 🎉 ❤️.
  2. I have a package called reclues which contains a database which mimics the SQL Murder Mystery done by Northwestern’s Knight Lab. The 📦 site is here.

  3. I included the SQL Create scripts in the scripts folder - you can make a PostgreSQL version of The Great British Bake Off data to practise on 🙋 🍪.

Thank you!

Who am I?

Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
oTile View: Overview of Slides
Esc Back to slideshow