Sciencificity's Blog: Using the tidyverse with Databases - Part III

Vebash Naidoo

In Part II we progressed a bit further, with more in-depth {dplyr} workflows, and we also brought the data into R after doing most of the computation on the database itself.

What are we tackling in Part III?

Setup a MySQL DB

Create a local MySQL instance

That’s it for the MySQL DBMS itself, the rest of the work for setting up our DB will be done in R.

Our data

In the {dplyr} 📦 we have a dataset starwars which has information for the characters.

I broke this dataset apart to create different tables containing subsets of the information so we can practise on:

Data Dictionary for the Star Wars Database
	Field Type	Notes
films
id	integer	Identification field for the film e.g. 5
films	string	Movie Name e.g. "The Empire Strikes Back"
year	double	Year movie was released e.g. 1980
vehicles
id	integer	Identification field for the vehicle e.g. 3
vehicles	string	Name of vehicle e.g. "Tribubble bongo"
starships
id	integer	Identification field for the starship e.g. 10
starships	string	Name of starship e.g. "Millennium Falcon"
characters
name	string	Character Name e.g. Leia Organa
height	integer	Height of character
mass	double	Mass of character
hair_color	string	Hair Color of character
skin_color	string	Skin Color of character
eye_color	string	Eye Color of character
birth_year	double	Birth Year of character
sex	string	Sex of character
gender	string	Gender of character
homeworld	string	Homeworld of character
species	string	Species of character
appearances
film_id	integer	Link into the films table, e.g. 4 (denoting character appeared in "A New Hope")
name	string	Character Name e.g. Leia Organa
vehicles_piloted
vehicle_id	integer	Link into the vehicles table, e.g. 3 (denoting character drove a "Tribubble bongo")
name	string	Character Name e.g. Obi-Wan Kenobi
starships_piloted
starship_id	integer	Link into the starships table, e.g. 10 (denoting character piloted a "Millennium Falcon")
name	string	Character Name e.g. Chewbacca
survey_levels
id	integer	Identification for the survey response e.g. 1
level	string	The survey response e.g. "Very unfavorably"
survey
respondent_id	integer	Identification for the survey respondent e.g. 3292879998
any_of_6	string	Did respondent watch any of the 6 movies? (Original 6 before reboot) - e.g. Yes, No
star_wars_fan	string	Is the respondent a star wars fan? - e.g. Yes, No
watched_The Phantom Menace	string	Did the respondent watch said movie? - e.g. Yes, No
watched_A New Hope	string	Did the respondent watch said movie? - e.g. Yes, No
watched_Attack of the Clones	string	Did the respondent watch said movie? - e.g. Yes, No
watched_Return of the Jedi	string	Did the respondent watch said movie? - e.g. Yes, No
watched_Revenge of the Sith	string	Did the respondent watch said movie? - e.g. Yes, No
watched_The Empire Strikes Back	string	Did the respondent watch said movie? - e.g. Yes, No
rank_A New Hope	double	How does the respondent rank the movie? 1=Best, 6 =Worst
rank_Attack of the Clones	double	How does the respondent rank the movie? 1=Best, 6 =Worst
rank_Return of the Jedi	double	How does the respondent rank the movie? 1=Best, 6 =Worst
rank_Revenge of the Sith	double	How does the respondent rank the movie? 1=Best, 6 =Worst
rank_The Empire Strikes Back	double	How does the respondent rank the movie? 1=Best, 6 =Worst
rank_The Phantom Menace	double	How does the respondent rank the movie? 1=Best, 6 =Worst
`Han Solo`	integer	Link into survey_levels table where 5 = Very favorably
`Luke Skywalker`	integer	Link into survey_levels table where 5 = Very favorably
`Leia Organa`	integer	Link into survey_levels table where 5 = Very favorably
`Anakin Skywalker`	integer	Link into survey_levels table where 5 = Very favorably
`Obi-Wan Kenobi`	integer	Link into survey_levels table where 5 = Very favorably
`Palpatine`	integer	Link into survey_levels table where 5 = Very favorably
`Darth Vader`	integer	Link into survey_levels table where 5 = Very favorably
`Lando Calrissian`	integer	Link into survey_levels table where 5 = Very favorably
`Boba Fett`	integer	Link into survey_levels table where 5 = Very favorably
`C-3P0`	integer	Link into survey_levels table where 5 = Very favorably
`R2-D2`	integer	Link into survey_levels table where 5 = Very favorably
`Jar Jar Binks`	integer	Link into survey_levels table where 5 = Very favorably
`Padme Amidala`	integer	Link into survey_levels table where 5 = Very favorably
`Yoda`	integer	Link into survey_levels table where 5 = Very favorably
who_shot_first	string	Han, Greedo, or "I don't understand the question"
know_expanded_universe	string	Does respondent know the expanded universe? Yes / No
fan_expanded_universe	string	Does respondent like the expanded universe? Yes / No
trekkie	string	Is respondent a Star Trek fan? Yes / No
Gender	string	Gender e.g. Male
Age	string	Age range e.g. 30-44
Household Income	string	Income range e.g. $50,000 - $99,999
Education	string	Education level e.g. Some college or Associate degree
Location (Census Region)	string	Location of respondent e.g. East South Central
franchise
franchise	string	Star Wars
revenue_category	string	Category of revenue generation e.g. Book sales
revenue_billion_dollars	double	Revenue earned from category, in billions of dollars
year_created	double	Year the franchise was created
original_media	string	Original media the franchise was released on e.g. Book, Film
creators	string	Who created the franchise
owners	string	The owners of the franchise

Create some tables

Here’s an example of how I created the films, and the associated appearances tables.

film_years <- tribble(~name,      ~year,
        #------------------------#------
        "The Empire Strikes Back", 1980,
        "Revenge of the Sith"    , 2005,
        "Return of the Jedi"     , 1983,
        "A New Hope"             , 1977,
        "The Force Awakens"      , 2015,
        "Attack of the Clones"   , 2002,
        "The Phantom Menace"     , 1999)

films <- films %>% 
  # Join the tables to tag on the `year` column
  inner_join(film_years,
             # left table column = films, right table column = name
             by = c("films" = "name"))

# Order of films in terms of episodes, not release date
# We're going to use this to create a factor
film_levels <- c("The Phantom Menace", "Attack of the Clones", "Revenge of the Sith",
                 "A New Hope", "The Empire Strikes Back", "Return of the Jedi",
                 "The Force Awakens")
(films <- films %>% 
  # make "films" a factor using the film_levels we created above
  # this will ensure the id = 4 is associated to 'A New Hope'
  mutate(films = factor(films, film_levels),) %>% 
  # create an id column - we will use this later
  # the .before just says I want the id column to come before the films column
  mutate(id = as.integer(films), .before = "films") %>% 
  arrange(id))

Connections Pane

Make a connection

ODBC

We use the connection string generated via the Connections Pane to connect to our MySQL DB, and write our data frames into database tables.

Appropriate DBI-compliant package

We may alternatively use the appropriate DBI compliant package. (RMariaDB::MariaDB()).

Writing to a MySQL Database from RStudio

Once we have connected to the database, we’re ready to write our data frames into tables in the DBMS.

Communicate with our MySQL Database

Alright, we’re all set now. We can now start to query our database tables in our MySQL DBMS.

Connect

Take a look around

Connecting via the Connections Pane has some additional perks, in that you can have a look at your tables as though you’re in the DBMS itself. You may also preview the first 1000 rows.

Explore data

Let’s see if we can have a look at the popularity of the characters as per the survey dataset.

character_info <- character_info %>% 
  mutate(name = as.factor(name)) %>% 
  # Let's create aggregated survey levels by combining
  # Somewhat favorable and Very favourable into the Favourable category
  # We're essentially trimming down the categories as per the FiveThirtyEight
  # article
  mutate(sub_level = case_when(
   str_detect(level, "Neither favorably nor unfavorably")  ~  "Neutral",
   (str_detect(level, " unfavorably") |
      str_detect(level, "Somewhat unfavorably"))           ~  "Unfavourable",
   (str_detect(level, "Very favorably") |
      str_detect(level, "Somewhat favorably"))             ~ "Favourable",
   str_detect(level, "Unfamiliar")                         ~ "Unfamiliar",
   TRUE                                                    ~ "None"
   )) %>% 
  mutate(sub_level = factor(sub_level, levels = c("Favourable", 
                                                  "Neutral", "Unfavourable", 
                                                  "Unfamiliar"))) 

# Processing to setup the waffle plot
# We want to understand each character's popularity
character_info <- character_info %>% 
  select(respondent_id, name, sub_level) %>% 
  distinct() %>% 
  group_by(name) %>% 
  mutate(n = n()) %>% 
  ungroup() %>% 
  group_by(name, sub_level) %>% 
  mutate(
    nn = n(),
    perc = nn / n * 100.0) %>% 
  ungroup() %>% 
  select(name, sub_level, n, nn, perc) %>% 
  distinct() %>% 
  inner_join(character_info) %>% 
  select(respondent_id:respondent_gender, name, 
         survey_id, level, sub_level:perc, gender:year)

We’ll create a waffle plot to have a look at how popular the characters are. In a few cases the overall percentage is slightly less than 100%. This is due to rounding quirks.

Done? Remember to disconnect!

Fin

That’s it for this series of working with databases. I hope it was useful, if you have any comments or feedback please let me know on Twitter.

Using the tidyverse with Databases - Part III

Author

Affiliation

Published

Citation

Recap

Part I

Part II

What are we tackling in Part III?

Setup a MySQL DB

Create a local MySQL instance

Our data

Create some tables

`Connections` Pane

Make a connection

ODBC

Appropriate DBI-compliant package

Writing to a MySQL Database from RStudio

Communicate with our MySQL Database

Connect

Take a look around

Explore data

Done? Remember to disconnect!

Fin

Acknowledgements

More resources

Footnotes

Citation

Using the tidyverse with Databases - Part III

Author

Affiliation

Published

Citation

Recap

Part I

Part II

What are we tackling in Part III?

Setup a MySQL DB

Create a local MySQL instance

Our data

Create some tables

Connections Pane

Make a connection

ODBC

Appropriate DBI-compliant package

Writing to a MySQL Database from RStudio

Communicate with our MySQL Database

Connect

Take a look around

Explore data

Done? Remember to disconnect!

Fin

Acknowledgements

More resources

Footnotes

Citation

`Connections` Pane