You have good tidyverse knowledge...most especially dplyr.
You have good tidyverse knowledge...most especially dplyr.
You want to leverage that knowledge and connect directly to your database
You have good tidyverse knowledge...most especially dplyr.
You want to leverage that knowledge and connect directly to your database
It's hard switching context between SQL and R tidyverse. While it is good to know both sometimes in an analysis you'd like to have a lower cognitive load.
Do you understand what this piece of code is doing? Will you discuss with your neighbour for a minute, and make a choice on the next tab.
palmerpenguins::penguins %>% select(species, sex, flipper_length_mm) %>% drop_na() %>% group_by(species, sex) %>% summarise(avg_flipper_len = mean(flipper_length_mm))
Returns only the average flipper length of each species
.
Returns the entire dataset, with the average flipper length
of the species
, tagged on as new column.
It returns the average flipper length split by species
and sex
.
Only observations which have values for species, sex and flipper_length_mm are included.
It gives an error because it should be summarize() not summarise().
The answer is c
.
palmerpenguins::penguins %>% select(species, sex, flipper_length_mm) %>% drop_na() %>% group_by(species, sex) %>% summarise(avg_flipper_len = mean(flipper_length_mm))
## `summarise()` regrouping output by 'species' (override with `.groups` argument)## # A tibble: 6 x 3## # Groups: species [3]## species sex avg_flipper_len## <fct> <fct> <dbl>## 1 Adelie female 188.## 2 Adelie male 192.## 3 Chinstrap female 192.## 4 Chinstrap male 200.## 5 Gentoo female 213.## 6 Gentoo male 222.
DBI
package.DBI
package is a DataBase Interface for R.odbc
package.odbc
package is a DBI compliant interface to Open DataBase Connectivity (ODBC) drivers.dbplyr
package which is a dplyr back-end for databases.install.packages("DBI")install.packages("odbc")install.packages("dbplyr")
RSQLite
package.install.packages("RSQLite")
We make a connection to the DB, using dbConnect()
, providing two arguments:
drv
: The driver we're using, here RSQLite::SQLite()
.:memory:
which is an in memory SQLite DB.library(DBI)# Create an in-memory RSQLite DBcon <- dbConnect(RSQLite::SQLite(), ":memory:")# List tables available through our connectiondbListTables(con) # nothing there yet -> character(0)
RPostgres
📦 as an example here (similar process for RMariaDB
etc.).install.packages("RPostgres")
dbConnect()
as before, this time providing more info such as user name, password, host name etc. con <- dbConnect(drv = RPostgres::Postgres(), # driver we're using this time dbname = "DB_name", host = "URL_or_localhost", user = "user_name", password = "pswd", port = "5432" # common port for PostgreSQL, but check with DBA)
odbcListDrivers()
.library(odbc)sort(unique(odbcListDrivers()[[1]]))
dbConnect()
as before, this time through the applicable odbc
driver. con <- dbConnect( odbc::odbc(), Driver = "e.g.PostgreSQL ODBC Driver(UNICODE)", Database = "DB_name", Server = "URL_or_localhost", UID = "user_name", PWD = "pswd", port = "5432")
RSQLite
package embeds a SQLite database engine in R, and is a DBI-compliant interface.dbWriteTable(con, # using connection con "mtcars", # write into a table "mtcars" mtcars, # the dataset mtcars overwrite = TRUE # overwrite the data there if it exists )
dbListTables(con)## [1] "mtcars"
dbListFields(con, "mtcars")## [1] "mpg" "cyl" "disp" "hp" "drat" # "wt" "qsec" "vs" "am" "gear"## [11] "carb"
Connect to table tbl(con, "tbl_name")
Select subsets of data
tbl(con, # using our connection 'mtcars' # reach into the mtcars table ) %>% select(cyl) %>% distinct()# Source: lazy query [?? x 1]# Database: sqlite 3.33.0 [:memory:]# cyl# 1 6 ...
tbl(con, "mtcars") %>% count(cyl) # cyl n# <dbl> <int># 1 4 11# 2 6 7 ...
tbl(con, "mtcars") %>% filter(am == 0)
tbl(con, "tbl_name1") %>% left_join(tbl(con, "tbl_name2"), by = c("col1" = "name_1", "coln" = "name_n")
tbl(con, "mtcars") %>% count(cyl) %>% show_query()# <SQL># SELECT `cyl`, COUNT(*) AS `n`# FROM `mtcars`# GROUP BY `cyl`
my_db_mtcars <- tbl(con, "mtcars") %>% count(cyl) %>% collect() # finally bringing data into R
dbDisconnect(con)
2: library(DBI)
5: con <- dbConnect(RSQLite::SQLite(), ":memory:")
1: dbWriteTable(con, "band_members", dplyr::band_members)
3: dbListFields(con, "band_members")
6: tbl(con, "band_members")
4: dbDisconnect(con)
library(DBI)con <- dbConnect(RSQLite::SQLite(), ":memory:")dbWriteTable(con, "band_members", dplyr::band_members)dbListFields(con, "band_members")## [1] "name" "band"tbl(con, "band_members")## # Source: table<band_members> [?? x 2]## # Database: sqlite 3.33.0 [:memory:]## name band ## <chr> <chr> ## 1 Mick Stones ## 2 John Beatles## 3 Paul BeatlesdbDisconnect(con)
Alison Hill has put together some data from the Great British Bake off here.
I converted it to a SQLite DB that we can use to practise our skills on.
tbl(con, "baker_results") %>% select(series, baker, star_baker:technical_top3) %>% head(3)## # Source: lazy query [?? x 5]## # Database: sqlite 3.33.0 [/cloud/project/great_brit_bakeoff.db]## series baker star_baker technical_winner technical_top3## <dbl> <chr> <int> <int> <int>## 1 1 Annetha 0 0 1## 2 1 David 0 0 1## 3 1 Edd 0 2 4
tbl(con, "results") %>% select(series, baker, result) %>% filter(result == "WINNER") %>% head(3)## # Source: lazy query [?? x 3]## # Database: sqlite 3.33.0 [/cloud/project/great_brit_bakeoff.db]## series baker result## <int> <chr> <chr> ## 1 1 Edd WINNER## 2 2 Joanne WINNER## 3 3 John WINNER
great_brit_bakeoff.db
Which code will make the connection?
a.
library(DBI)con <- dbConnect( drv = RSQLite::SQLite(), dbname = "great_brit_bakeoff.db")
b.
library(DBI)con <- dbConnect( drv = RSQLite::SQLite(), user = "admin", password = "password", dbname = "great_brit_bakeoff.db")
c.
library(DBI)con <- dbConnect( drv = RPostgres::Postgres(), host = "localhost", user = "admin", password = "password", dbname = "great_brit_bakeoff.db")
Open up exercise-01.Rmd
and:
baker_results
and the results
tables, complete the join code.eval = FALSE
in the setup code chunk to eval = TRUE
.Check out:
I have a package called reclues which contains a database which mimics the SQL Murder Mystery done by Northwestern’s Knight Lab. The 📦 site is here.
I included the SQL Create scripts in the scripts folder - you can make a PostgreSQL version of The Great British Bake Off
data to practise on 🙋 🍪.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
o | Tile View: Overview of Slides |
Esc | Back to slideshow |
You have good tidyverse knowledge...most especially dplyr.
You have good tidyverse knowledge...most especially dplyr.
You want to leverage that knowledge and connect directly to your database
You have good tidyverse knowledge...most especially dplyr.
You want to leverage that knowledge and connect directly to your database
It's hard switching context between SQL and R tidyverse. While it is good to know both sometimes in an analysis you'd like to have a lower cognitive load.
Do you understand what this piece of code is doing? Will you discuss with your neighbour for a minute, and make a choice on the next tab.
palmerpenguins::penguins %>% select(species, sex, flipper_length_mm) %>% drop_na() %>% group_by(species, sex) %>% summarise(avg_flipper_len = mean(flipper_length_mm))
Returns only the average flipper length of each species
.
Returns the entire dataset, with the average flipper length
of the species
, tagged on as new column.
It returns the average flipper length split by species
and sex
.
Only observations which have values for species, sex and flipper_length_mm are included.
It gives an error because it should be summarize() not summarise().
The answer is c
.
palmerpenguins::penguins %>% select(species, sex, flipper_length_mm) %>% drop_na() %>% group_by(species, sex) %>% summarise(avg_flipper_len = mean(flipper_length_mm))
## `summarise()` regrouping output by 'species' (override with `.groups` argument)## # A tibble: 6 x 3## # Groups: species [3]## species sex avg_flipper_len## <fct> <fct> <dbl>## 1 Adelie female 188.## 2 Adelie male 192.## 3 Chinstrap female 192.## 4 Chinstrap male 200.## 5 Gentoo female 213.## 6 Gentoo male 222.
DBI
package.DBI
package is a DataBase Interface for R.odbc
package.odbc
package is a DBI compliant interface to Open DataBase Connectivity (ODBC) drivers.dbplyr
package which is a dplyr back-end for databases.install.packages("DBI")install.packages("odbc")install.packages("dbplyr")
RSQLite
package.install.packages("RSQLite")
We make a connection to the DB, using dbConnect()
, providing two arguments:
drv
: The driver we're using, here RSQLite::SQLite()
.:memory:
which is an in memory SQLite DB.library(DBI)# Create an in-memory RSQLite DBcon <- dbConnect(RSQLite::SQLite(), ":memory:")# List tables available through our connectiondbListTables(con) # nothing there yet -> character(0)
RPostgres
📦 as an example here (similar process for RMariaDB
etc.).install.packages("RPostgres")
dbConnect()
as before, this time providing more info such as user name, password, host name etc. con <- dbConnect(drv = RPostgres::Postgres(), # driver we're using this time dbname = "DB_name", host = "URL_or_localhost", user = "user_name", password = "pswd", port = "5432" # common port for PostgreSQL, but check with DBA)
odbcListDrivers()
.library(odbc)sort(unique(odbcListDrivers()[[1]]))
dbConnect()
as before, this time through the applicable odbc
driver. con <- dbConnect( odbc::odbc(), Driver = "e.g.PostgreSQL ODBC Driver(UNICODE)", Database = "DB_name", Server = "URL_or_localhost", UID = "user_name", PWD = "pswd", port = "5432")
RSQLite
package embeds a SQLite database engine in R, and is a DBI-compliant interface.dbWriteTable(con, # using connection con "mtcars", # write into a table "mtcars" mtcars, # the dataset mtcars overwrite = TRUE # overwrite the data there if it exists )
dbListTables(con)## [1] "mtcars"
dbListFields(con, "mtcars")## [1] "mpg" "cyl" "disp" "hp" "drat" # "wt" "qsec" "vs" "am" "gear"## [11] "carb"
Connect to table tbl(con, "tbl_name")
Select subsets of data
tbl(con, # using our connection 'mtcars' # reach into the mtcars table ) %>% select(cyl) %>% distinct()# Source: lazy query [?? x 1]# Database: sqlite 3.33.0 [:memory:]# cyl# 1 6 ...
tbl(con, "mtcars") %>% count(cyl) # cyl n# <dbl> <int># 1 4 11# 2 6 7 ...
tbl(con, "mtcars") %>% filter(am == 0)
tbl(con, "tbl_name1") %>% left_join(tbl(con, "tbl_name2"), by = c("col1" = "name_1", "coln" = "name_n")
tbl(con, "mtcars") %>% count(cyl) %>% show_query()# <SQL># SELECT `cyl`, COUNT(*) AS `n`# FROM `mtcars`# GROUP BY `cyl`
my_db_mtcars <- tbl(con, "mtcars") %>% count(cyl) %>% collect() # finally bringing data into R
dbDisconnect(con)
2: library(DBI)
5: con <- dbConnect(RSQLite::SQLite(), ":memory:")
1: dbWriteTable(con, "band_members", dplyr::band_members)
3: dbListFields(con, "band_members")
6: tbl(con, "band_members")
4: dbDisconnect(con)
library(DBI)con <- dbConnect(RSQLite::SQLite(), ":memory:")dbWriteTable(con, "band_members", dplyr::band_members)dbListFields(con, "band_members")## [1] "name" "band"tbl(con, "band_members")## # Source: table<band_members> [?? x 2]## # Database: sqlite 3.33.0 [:memory:]## name band ## <chr> <chr> ## 1 Mick Stones ## 2 John Beatles## 3 Paul BeatlesdbDisconnect(con)
Alison Hill has put together some data from the Great British Bake off here.
I converted it to a SQLite DB that we can use to practise our skills on.
tbl(con, "baker_results") %>% select(series, baker, star_baker:technical_top3) %>% head(3)## # Source: lazy query [?? x 5]## # Database: sqlite 3.33.0 [/cloud/project/great_brit_bakeoff.db]## series baker star_baker technical_winner technical_top3## <dbl> <chr> <int> <int> <int>## 1 1 Annetha 0 0 1## 2 1 David 0 0 1## 3 1 Edd 0 2 4
tbl(con, "results") %>% select(series, baker, result) %>% filter(result == "WINNER") %>% head(3)## # Source: lazy query [?? x 3]## # Database: sqlite 3.33.0 [/cloud/project/great_brit_bakeoff.db]## series baker result## <int> <chr> <chr> ## 1 1 Edd WINNER## 2 2 Joanne WINNER## 3 3 John WINNER
great_brit_bakeoff.db
Which code will make the connection?
a.
library(DBI)con <- dbConnect( drv = RSQLite::SQLite(), dbname = "great_brit_bakeoff.db")
b.
library(DBI)con <- dbConnect( drv = RSQLite::SQLite(), user = "admin", password = "password", dbname = "great_brit_bakeoff.db")
c.
library(DBI)con <- dbConnect( drv = RPostgres::Postgres(), host = "localhost", user = "admin", password = "password", dbname = "great_brit_bakeoff.db")
Open up exercise-01.Rmd
and:
baker_results
and the results
tables, complete the join code.eval = FALSE
in the setup code chunk to eval = TRUE
.Check out:
I have a package called reclues which contains a database which mimics the SQL Murder Mystery done by Northwestern’s Knight Lab. The 📦 site is here.
I included the SQL Create scripts in the scripts folder - you can make a PostgreSQL version of The Great British Bake Off
data to practise on 🙋 🍪.