Introduction to
 using Databases in R,
with Tidyverse toolsWorking directly in a databaseVebash Naidoo
  

Who am I?

Who are you?

You've got data in a database

Who are you?

You've got data in a database

You know enough of the tidyverse

You have good tidyverse knowledge...most especially dplyr.

Who are you?

You've got data in a database

You know enough of the tidyverse

You want to flex that skill

You have good tidyverse knowledge...most especially dplyr.

You want to leverage that knowledge and connect directly to your database

Who are you?

You've got data in a database

You know enough of the tidyverse

You want to flex that skill

Lighter cognitive load

You have good tidyverse knowledge...most especially dplyr.

You want to leverage that knowledge and connect directly to your database

It's hard switching context between SQL and R tidyverse. While it is good to know both sometimes in an analysis you'd like to have a lower cognitive load.

Quick Knowledge Check

Do you understand what this piece of code is doing? Will you discuss with your neighbour for a minute, and make a choice on the next tab.

palmerpenguins::penguins %>% 
  select(species, sex, flipper_length_mm) %>%
  drop_na() %>% 
  group_by(species, sex) %>% 
  summarise(avg_flipper_len = mean(flipper_length_mm))

Returns only the average flipper length of each species.
Returns the entire dataset, with the average flipper length
of the species, tagged on as new column.
It returns the average flipper length split by species and sex. Only observations which have values for species, sex and flipper_length_mm are included.
It gives an error because it should be summarize() not summarise().

The answer is c.

palmerpenguins::penguins %>% 
  select(species, sex, 
         flipper_length_mm) %>%
  drop_na() %>% 
  group_by(species, sex) %>% 
  summarise(avg_flipper_len =
      mean(flipper_length_mm))

## `summarise()` regrouping output by 'species' (override with `.groups` argument)
## # A tibble: 6 x 3
## # Groups:   species [3]
##   species   sex    avg_flipper_len
##   <fct>     <fct>            <dbl>
## 1 Adelie    female            188.
## 2 Adelie    male              192.
## 3 Chinstrap female            192.
## 4 Chinstrap male              200.
## 5 Gentoo    female            213.
## 6 Gentoo    male              222.

Connecting to a Database

Install the DBI package.
The DBI package is a DataBase Interface for R.
Install the odbc package.
The odbc package is a DBI compliant interface to Open DataBase Connectivity (ODBC) drivers.
Install the dbplyr package which is a dplyr back-end for databases.

install.packages("DBI")
install.packages("odbc")
install.packages("dbplyr")

Install the RSQLite package.

install.packages("RSQLite")

We make a connection to the DB, using dbConnect(), providing two arguments:
- drv: The driver we're using, here RSQLite::SQLite().
- A path to the DB - the example here uses :memory: which is an in memory SQLite DB.

library(DBI)
# Create an in-memory RSQLite DB
con <- dbConnect(RSQLite::SQLite(), ":memory:")
# List tables available through our connection
dbListTables(con) # nothing there yet -> character(0)

Install the DBI-compliant R 📦
We'll use the RPostgres 📦 as an example here (similar process for RMariaDB etc.).

install.packages("RPostgres")

We make a connection to the DB, using dbConnect() as before, this time providing more info such as user name, password, host name etc.

con <- dbConnect(drv = RPostgres::Postgres(), # driver we're using this time
    dbname = "DB_name", 
    host = "URL_or_localhost",
    user = "user_name", 
    password = "pswd",
    port = "5432" # common port for PostgreSQL, but check with DBA
)

You can download the ODBC driver from your database vendor's website.
List the drivers you have available using odbcListDrivers().

library(odbc)
sort(unique(odbcListDrivers()[[1]]))

We make a connection to the DB, using dbConnect() as before, this time through the applicable odbc driver.

con <- dbConnect(
    odbc::odbc(),
    Driver = "e.g.PostgreSQL ODBC Driver(UNICODE)",
    Database = "DB_name",
    Server = "URL_or_localhost",
    UID = "user_name",
    PWD = "pswd",
    port = "5432"
)

The RSQLite package embeds a SQLite database engine in R, and is a DBI-compliant interface.

We're connected, now what?

Write data into a database table
List all tables
List the fields in a particular table

dbWriteTable(con, # using connection con
    "mtcars", # write into a table "mtcars"
     mtcars, # the dataset mtcars
     overwrite = TRUE # overwrite the data there if it exists
     )

dbListTables(con)
## [1] "mtcars"

dbListFields(con, "mtcars")
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" 
# "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

Connect to table tbl(con, "tbl_name")
Select subsets of data
Anayse data
Filter data

tbl(con, # using our connection
    'mtcars' # reach into the mtcars table
    ) %>% 
  select(cyl) %>%  distinct()
# Source:   lazy query [?? x 1]
# Database: sqlite 3.33.0 [:memory:]
#     cyl
# 1     6 ...

tbl(con, "mtcars") %>% 
  count(cyl) 
#     cyl     n
#   <dbl> <int>
# 1     4    11
# 2     6     7 ...

tbl(con, "mtcars") %>% 
  filter(am == 0)

Join tables from your database NOTE: tbl(con, "tbl_name1") and tbl(con, "tbl_name2").

What's the SQL Looking like?

tbl(con, "tbl_name1") %>% 
  left_join(tbl(con, "tbl_name2"),
      by = c("col1" = "name_1",
             "coln" = "name_n")

tbl(con, "mtcars") %>% 
  count(cyl) %>% 
  show_query()
# <SQL>
# SELECT `cyl`, COUNT(*) AS `n`
# FROM `mtcars`
# GROUP BY `cyl`

Always remember to ...

Collect your data only when you need to (close to the end).

my_db_mtcars <- tbl(con, "mtcars") %>% 
  count(cyl) %>% 
  collect() # finally bringing data into R

Disconnect when you're done!

dbDisconnect(con)

Let's take stock

dbWriteTable(con, "band_members", dplyr::band_members)
library(DBI)
dbListFields(con, "band_members")
dbDisconnect(con)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
tbl(con, "band_members")

2: library(DBI)
5: con <- dbConnect(RSQLite::SQLite(), ":memory:")
1: dbWriteTable(con, "band_members", dplyr::band_members)
3: dbListFields(con, "band_members")
6: tbl(con, "band_members")
4: dbDisconnect(con)

library(DBI)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(con, "band_members", dplyr::band_members)
dbListFields(con, "band_members")
## [1] "name" "band"
tbl(con, "band_members")
## # Source:   table<band_members> [?? x 2]
## # Database: sqlite 3.33.0 [:memory:]
##   name  band   
##   <chr> <chr>  
## 1 Mick  Stones 
## 2 John  Beatles
## 3 Paul  Beatles
dbDisconnect(con)

Our dataset

Alison Hill has put together some data from the Great British Bake off here.

I converted it to a SQLite DB that we can use to practise our skills on.

tbl(con, "baker_results") %>% 
  select(series, baker,
    star_baker:technical_top3) %>% 
  head(3)
## # Source:   lazy query [?? x 5]
## # Database: sqlite 3.33.0 [/cloud/project/great_brit_bakeoff.db]
##   series baker   star_baker technical_winner technical_top3
##    <dbl> <chr>        <int>            <int>          <int>
## 1      1 Annetha          0                0              1
## 2      1 David            0                0              1
## 3      1 Edd              0                2              4

tbl(con, "results") %>% 
  select(series, baker, result) %>% 
  filter(result == "WINNER") %>% 
  head(3)
## # Source:   lazy query [?? x 3]
## # Database: sqlite 3.33.0 [/cloud/project/great_brit_bakeoff.db]
##   series baker  result
##    <int> <chr>  <chr> 
## 1      1 Edd    WINNER
## 2      2 Joanne WINNER
## 3      3 John   WINNER

It is a SQLite DB in your working directory.
It is named great_brit_bakeoff.db

Which code will make the connection?

library(DBI)
con <- dbConnect(
    drv = RSQLite::SQLite(),
    dbname = "great_brit_bakeoff.db"
)

library(DBI)
con <- dbConnect(
    drv = RSQLite::SQLite(),
    user = "admin", password = "password",
    dbname = "great_brit_bakeoff.db"
)

library(DBI)
con <- dbConnect(
    drv = RPostgres::Postgres(), host = "localhost",
    user = "admin", password = "password",
    dbname = "great_brit_bakeoff.db"
)

Open up exercise-01.Rmd and:

Complete the connection code.
Recalling that we are joining the baker_results and the results tables, complete the join code.
Amend eval = FALSE in the setup code chunk to eval = TRUE.
Knit the document.
Interested in a different winner? Fill in the code to view how your baker performed across the season.

Acknowledgements

More Resources

Looking for more practise

Check out:

Databases using R 🎉 ❤️.
I have a package called reclues which contains a database which mimics the SQL Murder Mystery done by Northwestern’s Knight Lab. The 📦 site is here.
I included the SQL Create scripts in the scripts folder - you can make a PostgreSQL version of The Great British Bake Off data to practise on 🙋 🍪.

Thank you!

Introduction to
 using Databases in R,
with Tidyverse toolsWorking directly in a databaseVebash Naidoo
  

Who am I?

Who are you?

You've got data in a database

Who are you?

You've got data in a database

You know enough of the tidyverse

You have good tidyverse knowledge...most especially dplyr.

Who are you?

You've got data in a database

You know enough of the tidyverse

You want to flex that skill

You have good tidyverse knowledge...most especially dplyr.

You want to leverage that knowledge and connect directly to your database

Who are you?

You've got data in a database

You know enough of the tidyverse

You want to flex that skill

Lighter cognitive load

You have good tidyverse knowledge...most especially dplyr.

You want to leverage that knowledge and connect directly to your database

It's hard switching context between SQL and R tidyverse. While it is good to know both sometimes in an analysis you'd like to have a lower cognitive load.

Quick Knowledge Check

Do you understand what this piece of code is doing? Will you discuss with your neighbour for a minute, and make a choice on the next tab.

palmerpenguins::penguins %>% 
  select(species, sex, flipper_length_mm) %>%
  drop_na() %>% 
  group_by(species, sex) %>% 
  summarise(avg_flipper_len = mean(flipper_length_mm))

Returns only the average flipper length of each species.
Returns the entire dataset, with the average flipper length
of the species, tagged on as new column.
It returns the average flipper length split by species and sex. Only observations which have values for species, sex and flipper_length_mm are included.
It gives an error because it should be summarize() not summarise().

The answer is c.

palmerpenguins::penguins %>% 
  select(species, sex, 
         flipper_length_mm) %>%
  drop_na() %>% 
  group_by(species, sex) %>% 
  summarise(avg_flipper_len =
      mean(flipper_length_mm))

## `summarise()` regrouping output by 'species' (override with `.groups` argument)
## # A tibble: 6 x 3
## # Groups:   species [3]
##   species   sex    avg_flipper_len
##   <fct>     <fct>            <dbl>
## 1 Adelie    female            188.
## 2 Adelie    male              192.
## 3 Chinstrap female            192.
## 4 Chinstrap male              200.
## 5 Gentoo    female            213.
## 6 Gentoo    male              222.

Connecting to a Database

Install the DBI package.
The DBI package is a DataBase Interface for R.
Install the odbc package.
The odbc package is a DBI compliant interface to Open DataBase Connectivity (ODBC) drivers.
Install the dbplyr package which is a dplyr back-end for databases.

install.packages("DBI")
install.packages("odbc")
install.packages("dbplyr")

Install the RSQLite package.

install.packages("RSQLite")

We make a connection to the DB, using dbConnect(), providing two arguments:
- drv: The driver we're using, here RSQLite::SQLite().
- A path to the DB - the example here uses :memory: which is an in memory SQLite DB.

library(DBI)
# Create an in-memory RSQLite DB
con <- dbConnect(RSQLite::SQLite(), ":memory:")
# List tables available through our connection
dbListTables(con) # nothing there yet -> character(0)

Install the DBI-compliant R 📦
We'll use the RPostgres 📦 as an example here (similar process for RMariaDB etc.).

install.packages("RPostgres")

We make a connection to the DB, using dbConnect() as before, this time providing more info such as user name, password, host name etc.

con <- dbConnect(drv = RPostgres::Postgres(), # driver we're using this time
    dbname = "DB_name", 
    host = "URL_or_localhost",
    user = "user_name", 
    password = "pswd",
    port = "5432" # common port for PostgreSQL, but check with DBA
)

You can download the ODBC driver from your database vendor's website.
List the drivers you have available using odbcListDrivers().

library(odbc)
sort(unique(odbcListDrivers()[[1]]))

We make a connection to the DB, using dbConnect() as before, this time through the applicable odbc driver.

con <- dbConnect(
    odbc::odbc(),
    Driver = "e.g.PostgreSQL ODBC Driver(UNICODE)",
    Database = "DB_name",
    Server = "URL_or_localhost",
    UID = "user_name",
    PWD = "pswd",
    port = "5432"
)

The RSQLite package embeds a SQLite database engine in R, and is a DBI-compliant interface.

We're connected, now what?

Write data into a database table
List all tables
List the fields in a particular table

dbWriteTable(con, # using connection con
    "mtcars", # write into a table "mtcars"
     mtcars, # the dataset mtcars
     overwrite = TRUE # overwrite the data there if it exists
     )

dbListTables(con)
## [1] "mtcars"

dbListFields(con, "mtcars")
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" 
# "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

Connect to table tbl(con, "tbl_name")
Select subsets of data
Anayse data
Filter data

tbl(con, # using our connection
    'mtcars' # reach into the mtcars table
    ) %>% 
  select(cyl) %>%  distinct()
# Source:   lazy query [?? x 1]
# Database: sqlite 3.33.0 [:memory:]
#     cyl
# 1     6 ...

tbl(con, "mtcars") %>% 
  count(cyl) 
#     cyl     n
#   <dbl> <int>
# 1     4    11
# 2     6     7 ...

tbl(con, "mtcars") %>% 
  filter(am == 0)

Join tables from your database NOTE: tbl(con, "tbl_name1") and tbl(con, "tbl_name2").

What's the SQL Looking like?

tbl(con, "tbl_name1") %>% 
  left_join(tbl(con, "tbl_name2"),
      by = c("col1" = "name_1",
             "coln" = "name_n")

tbl(con, "mtcars") %>% 
  count(cyl) %>% 
  show_query()
# <SQL>
# SELECT `cyl`, COUNT(*) AS `n`
# FROM `mtcars`
# GROUP BY `cyl`

Always remember to ...

Collect your data only when you need to (close to the end).

my_db_mtcars <- tbl(con, "mtcars") %>% 
  count(cyl) %>% 
  collect() # finally bringing data into R

Disconnect when you're done!

dbDisconnect(con)

Let's take stock

dbWriteTable(con, "band_members", dplyr::band_members)
library(DBI)
dbListFields(con, "band_members")
dbDisconnect(con)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
tbl(con, "band_members")

library(DBI)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(con, "band_members", dplyr::band_members)
dbListFields(con, "band_members")
## [1] "name" "band"
tbl(con, "band_members")
## # Source:   table<band_members> [?? x 2]
## # Database: sqlite 3.33.0 [:memory:]
##   name  band   
##   <chr> <chr>  
## 1 Mick  Stones 
## 2 John  Beatles
## 3 Paul  Beatles
dbDisconnect(con)

Our dataset

Alison Hill has put together some data from the Great British Bake off here.

I converted it to a SQLite DB that we can use to practise our skills on.

tbl(con, "baker_results") %>% 
  select(series, baker,
    star_baker:technical_top3) %>% 
  head(3)
## # Source:   lazy query [?? x 5]
## # Database: sqlite 3.33.0 [/cloud/project/great_brit_bakeoff.db]
##   series baker   star_baker technical_winner technical_top3
##    <dbl> <chr>        <int>            <int>          <int>
## 1      1 Annetha          0                0              1
## 2      1 David            0                0              1
## 3      1 Edd              0                2              4

tbl(con, "results") %>% 
  select(series, baker, result) %>% 
  filter(result == "WINNER") %>% 
  head(3)
## # Source:   lazy query [?? x 3]
## # Database: sqlite 3.33.0 [/cloud/project/great_brit_bakeoff.db]
##   series baker  result
##    <int> <chr>  <chr> 
## 1      1 Edd    WINNER
## 2      2 Joanne WINNER
## 3      3 John   WINNER

It is a SQLite DB in your working directory.
It is named great_brit_bakeoff.db

Which code will make the connection?

library(DBI)
con <- dbConnect(
    drv = RSQLite::SQLite(),
    dbname = "great_brit_bakeoff.db"
)

library(DBI)
con <- dbConnect(
    drv = RSQLite::SQLite(),
    user = "admin", password = "password",
    dbname = "great_brit_bakeoff.db"
)

library(DBI)
con <- dbConnect(
    drv = RPostgres::Postgres(), host = "localhost",
    user = "admin", password = "password",
    dbname = "great_brit_bakeoff.db"
)

Open up exercise-01.Rmd and:

Complete the connection code.
Recalling that we are joining the baker_results and the results tables, complete the join code.
Amend eval = FALSE in the setup code chunk to eval = TRUE.
Knit the document.
Interested in a different winner? Fill in the code to view how your baker performed across the season.

Acknowledgements

Slides inspired by Dr. Alison Hill

Teaching in Production

Yihui Xie's and Garrick Aden-Buie's excellent

xaringan 📦 xaringanExtra 📦

Mine Çetinkaya-Rundel "Start with Cake"

Let them eat cake (first)!

More Resources

Looking for more practise

Check out:

Databases using R 🎉 ❤️.
I have a package called reclues which contains a database which mimics the SQL Murder Mystery done by Northwestern’s Knight Lab. The 📦 site is here.
I included the SQL Create scripts in the scripts folder - you can make a PostgreSQL version of The Great British Bake Off data to practise on 🙋 🍪.

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help
o	Tile View: Overview of Slides

Introduction to using Databases in R,with Tidyverse tools

Working directly in a database

Vebash Naidoo

Who am I?

Who are you?

You've got data in a database

Who are you?

You've got data in a database

You know enough of the tidyverse

Who are you?

You've got data in a database

You know enough of the tidyverse

You want to flex that skill

Who are you?

You've got data in a database

You know enough of the tidyverse

You want to flex that skill

Lighter cognitive load

Quick Knowledge Check

Connecting to a Database

We're connected, now what?

Always remember to ...

Let's take stock

Our dataset

Acknowledgements

Slides inspired by Dr. Alison Hill

Yihui Xie's and Garrick Aden-Buie's excellent

Mine Çetinkaya-Rundel "Start with Cake"

More Resources

Looking for more practise

Thank you!

Who am I?

Help

Introduction to using Databases in R, with Tidyverse tools

Introduction to using Databases in R,with Tidyverse tools

Working directly in a database

Vebash Naidoo

Who am I?

Who are you?

You've got data in a database

Who are you?

You've got data in a database

You know enough of the tidyverse

Who are you?

You've got data in a database

You know enough of the tidyverse

You want to flex that skill

Who are you?

You've got data in a database

You know enough of the tidyverse

You want to flex that skill

Lighter cognitive load

Quick Knowledge Check

Connecting to a Database

We're connected, now what?

Always remember to ...

Let's take stock

Our dataset

Acknowledgements

Slides inspired by Dr. Alison Hill

Yihui Xie's and Garrick Aden-Buie's excellent

Mine Çetinkaya-Rundel "Start with Cake"

More Resources

Looking for more practise

Thank you!

Introduction to
using Databases in R,
with Tidyverse tools

Introduction to
using Databases in R,
with Tidyverse tools