Feel like the cat that got the cream with {forcats}

Categorical Variables Factors forcats

Do mi do mi do so mi do,

Every truly cultured tidyverse stud-ent knows,

You must learn your forcats and your d-plyr.

– Adapted from ‘Scales and Arpeggios’ from the Aristocats

Vebash Naidoo https://twitter.com/Sciencificity
2021-01-31
No need to be a scared-y cat, “A lot of good tricks. I will show them to you. Your mother will not mind at all if I do.”
— The Cat In The Hat

Photo by MIKHAIL VASILYEV on Unsplash

Terminology

To set the stage, let’s talk about the types of data. These are concepts more for beginners, so if you’re familiar with these please feel free to skip ahead.

We have two types of data:

Numerical (Quantitative) data

Non-Numerical (Qualitative) data


      Numb, Smells Like Teen Spirit, Sweet Child O' Mine, Smells Like Teen Spirit, Numb,
      Smells Like Teen Spirit, Sweet Child O' Mine, Numb, Smells Like Teen Spirit, Numb,
      Sweet Child O' Mine, Sweet Child O' Mine, Smells Like Teen Spirit
      
      I.e. 5, 2, 1, 2, 5, 2, 1, 5, 2, 5, 1, 1, 2
      

I can’t say Numb is 4x better, or 4x worse than Sweet Child O’ Mine (in other words, it makes no sense to take Level Number of Numb [5], and subtract Level Number of Sweet Child ’O Mine [1]).

Neither can I say that Hotel California is the average category chosen (notice it was not chosen once) but if I take the average of the numerical placeholders for the responses received, I would get 2.62 which is 3 if I round up. In other words, we can’t understand what the “average” song is.

{forcats} 📦

The forcats 📦 is meant to handle factors which is R’s data type for categorical data. forcats is for categorical data, and is an anagram for factors 🆒.

The functions in the package start with fct_.

There is non-numeric data where it is useful to work with the data as factors (age-ranges, occupation, etc.), but we must also keep in mind that some non-numeric data should be kept as character data.

We’re going to work with non-numeric data that may be treated as factors in this post, and learn how to use the forcats 📦 to make that task easier for us.

Data

We’re going to use the data from the awesome TidyTuesday Project ✨:

library(tidyverse)
brewing_materials <-
   read_csv(
    # hacky solution to show readers the full path of file
    # str_glue just pastes the various strings next to each other
    str_glue('https://raw.githubusercontent.com/rfordatascience/',
            'tidytuesday/master/data/2020/2020-03-31/',
            'brewing_materials.csv'))
beer_taxed <- 
    read_csv(
     str_glue('https://raw.githubusercontent.com/rfordatascience/', 
              'tidytuesday/master/data/2020/2020-03-31/',
              'beer_taxed.csv'))

brewer_size <- 
    read_csv(
     str_glue('https://raw.githubusercontent.com/rfordatascience/',
     'tidytuesday/master/data/2020/2020-03-31/',
     'brewer_size.csv'))

beer_states <- 
    read_csv(
     str_glue('https://raw.githubusercontent.com/rfordatascience/', 
     'tidytuesday/master/data/2020/2020-03-31/',
     'beer_states.csv'))

beer_awards <- 
    read_csv(
     str_glue('https://raw.githubusercontent.com/rfordatascience/', 
     'tidytuesday/master/data/2020/2020-10-20/',
     'beer_awards.csv'))

Let’s have a look at the data - here we’re showing a sample from each table.

Show code
brewing_materials %>% 
  sample_n(50) %>% 
  DT::datatable(filter = 'top', list(scrollX = TRUE,
                     pageLength = 5),
    caption = htmltools::tags$caption(
      style = 'caption-side: bottom; text-align: center;',
     'Table Name: ', htmltools::em('Brewing Materials')
  ))
Show code
beer_taxed %>% 
  head(50) %>% 
  DT::datatable(filter = 'top', list(scrollX = TRUE,
                     pageLength = 5),    
    caption = htmltools::tags$caption(
      style = 'caption-side: bottom; text-align: center;',
     'Table Name: ', htmltools::em('Beer Taxed')
  ))
Show code
brewer_size %>% 
  sample_n(50) %>% 
  DT::datatable(filter = 'top', list(scrollX = TRUE,
                     pageLength = 5),
    caption = htmltools::tags$caption(
      style = 'caption-side: bottom; text-align: center;',
     'Table Name: ', htmltools::em('Brewer Size')
  ))
Show code
beer_states %>% 
  sample_n(50) %>% 
  DT::datatable(filter = 'top', list(scrollX = TRUE,
                     pageLength = 5),
    caption = htmltools::tags$caption(
      style = 'caption-side: bottom; text-align: center;',
     'Table Name: ', htmltools::em('Beer States')
  ))
Show code
beer_awards %>% 
  sample_n(50) %>% 
  DT::datatable(filter = 'top', list(scrollX = TRUE,
                     pageLength = 5),
    caption = htmltools::tags$caption(
      style = 'caption-side: bottom; text-align: center;',
     'Table Name: ', htmltools::em('Beer Awards')
  ))

Convert variable to factor

To convert a variable to a factor we use factor() / as.factor() or forcats::as_factor(). These functions converts each distinct category to some number placeholder in the background.

Let’s get a better feel for the non-numeric data in the datasets we will be considering here.

brewing_materials

Let’s have a look at the material_type and type fields in the brewing_materials dataset.

brewing_materials %>% 
  count(material_type)
# A tibble: 5 x 2
  material_type                n
  <chr>                    <int>
1 Grain Products             600
2 Non-Grain Products         480
3 Total Grain products       120
4 Total Non-Grain products   120
5 Total Used                 120
brewing_materials %>% 
  count(type)
# A tibble: 12 x 2
   type                           n
   <chr>                      <int>
 1 Barley and barley products   120
 2 Corn and corn products       120
 3 Hops (dry)                   120
 4 Hops (used as extracts)      120
 5 Malt and malt products       120
 6 Other                        120
 7 Rice and rice products       120
 8 Sugar and syrups             120
 9 Total Grain products         120
10 Total Non-Grain products     120
11 Total Used                   120
12 Wheat and wheat products     120
brewing_materials %>% 
  filter(stringr::str_to_lower(material_type) %in% 
           c('grain products',
             'non-grain products')) %>% 
  count(material_type, type)
# A tibble: 9 x 3
  material_type      type                           n
  <chr>              <chr>                      <int>
1 Grain Products     Barley and barley products   120
2 Grain Products     Corn and corn products       120
3 Grain Products     Malt and malt products       120
4 Grain Products     Rice and rice products       120
5 Grain Products     Wheat and wheat products     120
6 Non-Grain Products Hops (dry)                   120
7 Non-Grain Products Hops (used as extracts)      120
8 Non-Grain Products Other                        120
9 Non-Grain Products Sugar and syrups             120

factor() / as.factor()

  1. These are conversion functions to convert a variable to a factor in Base R.

  2. To convert a variable to a factor we may use:

    1. df <- df %>% mutate(var = factor(var))
    2. df <- df %>% mutate(var = as.factor(var))
  3. To figure out what number placeholder a category was given behind the scenes, use levels().

  4. The default order of factor() is sorted. According to the help page: “The levels of a factor are by default sorted, but the sort order may well depend on the locale at the time of creation, and should not be assumed to be ASCII.”

    brewing_materials %>% 
       mutate(material_type = factor(material_type)) %>% 
       # use dplyr::pull which acts like $ to get the variable
       pull(material_type) %>% 
       # let us get the number placeholder attached to each category
       levels()
     
    [1] "Grain Products"           "Non-Grain Products"      
    [3] "Total Grain products"     "Total Non-Grain products"
    [5] "Total Used"              
    brewing_materials %>% 
       mutate(material_type = as.factor(material_type)) %>% 
       # can also use count() to count how many in each level
       count(material_type)
     
    # A tibble: 5 x 2
      material_type                n
      <fct>                    <int>
    1 Grain Products             600
    2 Non-Grain Products         480
    3 Total Grain products       120
    4 Total Non-Grain products   120
    5 Total Used                 120

    Notice that the base R functions factor() / as.factor() created levels in the alphabetical sorted order (my locale is ASCII). The output of the first part of the code block shows that "Grain Products" was coded as 1, while "Total Non-Grain products" was coded as 4, and "Total Used" was coded as the last level which was 5.

  5. What if I wanted to specify the levels myself? I could specify the levels in an argument as shown: levels = c("Level1", ..., "LevelN").

    brewing_materials %>% 
       mutate(material_type = factor(material_type,
                     # I want to make a factor but I want the order to be
                     # as follows:
                     levels = c("Grain Products",
                                "Total Grain products",
                                "Non-Grain Products",
                                "Total Non-Grain products",
                                "Total Used"))) %>% 
       pull(material_type) %>% 
       levels()
     
    [1] "Grain Products"           "Total Grain products"    
    [3] "Non-Grain Products"       "Total Non-Grain products"
    [5] "Total Used"              

    My level specification is used to create the levels, so the numeric encoding follows my specification this time. "Total Grain products" is coded as level 2 this time (it was level 3 in the default creation where no levels argument was specified).

  6. What if I wanted to include levels that may exist in future datasets, but don’t as yet in the dataset we have? This is similar to the months example in R for Data Science. Let’s try it with adding a Not Applicable level, which is not in our dataset’s material_type variable.

    brewing_materials %>% 
       mutate(material_type = factor(material_type,
                               levels = c("Grain Products",
                                          "Total Grain products",
                                          "Non-Grain Products",
                                          "Total Non-Grain products",
                                          "Total Used",
                                          "Not Applicable"))) %>% 
       pull(material_type) %>% 
       levels()
     
    [1] "Grain Products"           "Total Grain products"    
    [3] "Non-Grain Products"       "Total Non-Grain products"
    [5] "Total Used"               "Not Applicable"          

    Has it been created?

    brewing_materials %>% 
       mutate(material_type = factor(material_type,
                               levels = c("Grain Products",
                                          "Total Grain products",
                                          "Non-Grain Products",
                                          "Total Non-Grain products",
                                          "Total Used",
                                          "Not Applicable"))) %>%
       # we can also count as before but notice that
       # one category that has no data is missing -
       # the artificial `Not Applicable` we added
       count(material_type)
     
    # A tibble: 5 x 2
      material_type                n
      <fct>                    <int>
    1 Grain Products             600
    2 Total Grain products       120
    3 Non-Grain Products         480
    4 Total Non-Grain products   120
    5 Total Used                 120

    A simple count() does not quite let us know, but if we add an argument .drop = FALSE we get counts for all categories, even those with no observations (i.e. that category has a count of 0). By default the count() function drops categories with 0 counts from the output. By adding .drop = FALSE we’re asking for these to be included.

    brewing_materials %>% 
       mutate(material_type = factor(material_type,
                               levels = c("Grain Products",
                                          "Total Grain products",
                                          "Non-Grain Products",
                                          "Total Non-Grain products",
                                          "Total Used",
                                          "Not Applicable"))) %>%
       # we can get all categories by adding the .drop = FALSE
       count(material_type, .drop = FALSE)
     
    # A tibble: 6 x 2
      material_type                n
      <fct>                    <int>
    1 Grain Products             600
    2 Total Grain products       120
    3 Non-Grain Products         480
    4 Total Non-Grain products   120
    5 Total Used                 120
    6 Not Applicable               0

forcats::as_factor()

as_factor() behaves differently to as.factor() in that it creates levels in the order in which they appear, hence we get the same factor levels across different locales.

Base R’s as.factor()

test_factor_var <-  c("012star", "DogsRule", "!this", "%abc",
            "Abc#", "abc$", "$bb", "AreYouKiddingCatsRule!")

test_factor_var %>% 
  as.factor() %>% 
  print(width = Inf)
[1] 012star                DogsRule              
[3] !this                  %abc                  
[5] Abc#                   abc$                  
[7] $bb                    AreYouKiddingCatsRule!
Levels: !this $bb %abc 012star Abc# abc$ AreYouKiddingCatsRule! DogsRule

Note that the levels (seen in the output Levels: !this $bb ...) follow the [ASCII] sort on my machine, this may be completely different based on your locale.

Contrast with forcats::as_factor()

Now let’s consider as_factor().

# here is the raw variable again
test_factor_var
[1] "012star"                "DogsRule"              
[3] "!this"                  "%abc"                  
[5] "Abc#"                   "abc$"                  
[7] "$bb"                    "AreYouKiddingCatsRule!"
# Now let's make it a factor, this time using
# as_factor()
test_factor_var %>% 
  as_factor() %>% 
  print(width = Inf)
[1] 012star                DogsRule              
[3] !this                  %abc                  
[5] Abc#                   abc$                  
[7] $bb                    AreYouKiddingCatsRule!
Levels: 012star DogsRule !this %abc Abc# abc$ $bb AreYouKiddingCatsRule!

Note that as_factor() kept the order as it appears (Levels: 012star DogsRule ...), this will be the same for you, even if your locale is different.

Convert brewer data

Let’s perform the same conversion we did with Base R functions, but now using forcats::as_factor().

fct_inorder()

We can also explicitly use the fct_inorder() function to reorder the factor levels by first appearance. I add it here just so you’re aware of this option.

For example, the beer_awards$medal column would be made alphabetical in my locale if I use as.factor().

beer_awards %>% 
  head(3)
# A tibble: 3 x 7
  medal  beer_name        brewery      city    state category     year
  <chr>  <chr>            <chr>        <chr>   <chr> <chr>       <dbl>
1 Gold   Volksbier Vienna Wibby Brewi~ Longmo~ CO    American A~  2020
2 Silver Oktoberfest      Founders Br~ Grand ~ MI    American A~  2020
3 Bronze Amber Lager      Skipping Ro~ Staunt~ VA    American A~  2020
beer_awards %>% 
  mutate(medal = as.factor(medal)) %>% 
  pull(medal) %>% 
  levels()
[1] "Bronze" "Gold"   "Silver"

Notice the alphabetical ordering of levels (Bronze, Gold, Silver).

If I follow this with a fct_inorder() the ordering of levels is now using the order of appearance instead.

beer_awards %>% 
  mutate(medal = as.factor(medal) %>% fct_inorder()) %>% 
  pull(medal) %>% 
  levels()
[1] "Gold"   "Silver" "Bronze"

In most cases we’d want this to be Bronze, Silver, Gold in order of increasing award type. We’ll see how to do that just now.

Manually order levels

fct_relevel()

Collapse multiple levels

brewer_size

Let’s have a look at the brewer_size field in the brewer_size dataset.

brewer_size %>% 
  count(brewer_size)
# A tibble: 16 x 2
   brewer_size                        n
   <chr>                          <int>
 1 1 to 1,000 Barrels                11
 2 1,000,000 to 6,000,000 Barrels     1
 3 1,000,001 to 1,999,999 Barrels     9
 4 1,000,001 to 6,000,000 Barrels     1
 5 1,001 to 7,500 Barrels            11
 6 100,001 to 500,000 Barrels        11
 7 15,001 to 30,000 Barrels          11
 8 2,000,000 to 6,000,000 Barrels     9
 9 30,001 to 60,000 Barrels          11
10 500,001 to 1,000,000 Barrels      11
11 6,000,001 Barrels and Over        11
12 60,001 to 100,000 Barrels         11
13 7,501 to 15,000 Barrels           11
14 Total                             11
15 Under 1 Barrel                     6
16 Zero Barrels                       1
brewer_size %>% 
  count(brewer_size, 
        # count's default is to consider the number of rows
        # in each group, we can change it using wt (weight)
        # weight in this example is:
        # the number of brewers in each brewer size category, 
        # so count will sum up `n_of_brewers` for each category of brewer_size
        wt = n_of_brewers)
# A tibble: 16 x 2
   brewer_size                        n
   <chr>                          <dbl>
 1 1 to 1,000 Barrels             27956
 2 1,000,000 to 6,000,000 Barrels     5
 3 1,000,001 to 1,999,999 Barrels    45
 4 1,000,001 to 6,000,000 Barrels     4
 5 1,001 to 7,500 Barrels          8368
 6 100,001 to 500,000 Barrels       439
 7 15,001 to 30,000 Barrels         728
 8 2,000,000 to 6,000,000 Barrels    47
 9 30,001 to 60,000 Barrels         556
10 500,001 to 1,000,000 Barrels      92
11 6,000,001 Barrels and Over       174
12 60,001 to 100,000 Barrels        291
13 7,501 to 15,000 Barrels         1163
14 Total                          41946
15 Under 1 Barrel                  1602
16 Zero Barrels                     476

Notice that the brewer_size variable has a few categories which are slightly different, but which overlap.

brewer_size
1,000,000 to 6,000,000 Barrels
1,000,001 to 6,000,000 Barrels
1,000,001 to 1,999,999 Barrels
2,000,000 to 6,000,000 Barrels

If you look closely it seems as though 1,000,000 to 6,000,000 Barrels may be a typo, since 500,001 to 1,000,000 Barrels is already a category in that year.

It also looks as if 1,000,001 to 6,000,000 Barrels was split into 1,000,001 to 1,999,999 Barrels and 2,000,000 to 6,000,000 Barrels from 2011 onwards.

fct_collapse()

We can consolidate these levels into one level by using fct_collapse().

# what are the current levels in this variable
brewer_size %>% 
  mutate(brewer_size = as_factor(brewer_size)) %>% 
  pull(brewer_size) %>% 
  levels()
 [1] "6,000,001 Barrels and Over"     "1,000,001 to 6,000,000 Barrels"
 [3] "500,001 to 1,000,000 Barrels"   "100,001 to 500,000 Barrels"    
 [5] "60,001 to 100,000 Barrels"      "30,001 to 60,000 Barrels"      
 [7] "15,001 to 30,000 Barrels"       "7,501 to 15,000 Barrels"       
 [9] "1,001 to 7,500 Barrels"         "1 to 1,000 Barrels"            
[11] "Under 1 Barrel"                 "Total"                         
[13] "1,000,000 to 6,000,000 Barrels" "2,000,000 to 6,000,000 Barrels"
[15] "1,000,001 to 1,999,999 Barrels" "Zero Barrels"                  
brewer_size %>% 
  mutate(brewer_size = as_factor(brewer_size)) %>% 
  mutate(brewer_size = fct_collapse(brewer_size,
            # the new category name
            "1,000,000 to 6,000,000 Barrels" = 
            # the current categories that must become
            # the new category
              c("1,000,000 to 6,000,000 Barrels",
                "1,000,001 to 6,000,000 Barrels",
                "1,000,001 to 1,999,999 Barrels",
                "2,000,000 to 6,000,000 Barrels")
        )) %>% 
  pull(brewer_size) %>% 
  levels()
 [1] "6,000,001 Barrels and Over"     "1,000,000 to 6,000,000 Barrels"
 [3] "500,001 to 1,000,000 Barrels"   "100,001 to 500,000 Barrels"    
 [5] "60,001 to 100,000 Barrels"      "30,001 to 60,000 Barrels"      
 [7] "15,001 to 30,000 Barrels"       "7,501 to 15,000 Barrels"       
 [9] "1,001 to 7,500 Barrels"         "1 to 1,000 Barrels"            
[11] "Under 1 Barrel"                 "Total"                         
[13] "Zero Barrels"                  

Notice that our previous 16 levels are now 13.

In this case we’d also want to reorder the levels further by using fct_relevel().

brewer_size %>% 
  mutate(brewer_size = as_factor(brewer_size)) %>% 
  mutate(brewer_size = fct_collapse(brewer_size,
            "1,000,000 to 6,000,000 Barrels" = 
              c("1,000,000 to 6,000,000 Barrels",
                "1,000,001 to 6,000,000 Barrels",
                "1,000,001 to 1,999,999 Barrels",
                "2,000,000 to 6,000,000 Barrels")
        )) %>% 
  mutate(brewer_size = fct_relevel(brewer_size,
        "Zero Barrels",
        "Under 1 Barrel",
        "1 to 1,000 Barrels",
        "1,001 to 7,500 Barrels",
        "7,501 to 15,000 Barrels",
        "15,001 to 30,000 Barrels",
        "30,001 to 60,000 Barrels",
        "60,001 to 100,000 Barrels",
        "100,001 to 500,000 Barrels",
        "500,001 to 1,000,000 Barrels",
        "1,000,000 to 6,000,000 Barrels",
        "6,000,001 Barrels and Over")) %>% 
  pull(brewer_size) %>% 
  levels()
 [1] "Zero Barrels"                   "Under 1 Barrel"                
 [3] "1 to 1,000 Barrels"             "1,001 to 7,500 Barrels"        
 [5] "7,501 to 15,000 Barrels"        "15,001 to 30,000 Barrels"      
 [7] "30,001 to 60,000 Barrels"       "60,001 to 100,000 Barrels"     
 [9] "100,001 to 500,000 Barrels"     "500,001 to 1,000,000 Barrels"  
[11] "1,000,000 to 6,000,000 Barrels" "6,000,001 Barrels and Over"    
[13] "Total"                         

Reduce categories

We saw that fct_collapse() is used to reduce categories. In the above example, there was some order to the levels so the best we can do is collapse levels into fewer categories, i.e. an Other category does not make much sense in the example we used above.

Some times you have way too many levels to visualise, or be useful in considerations, but there isn’t any inherent order in the levels. We will discuss different category reduction strategies for these (i.e. where an Other category is a viable option).

beer_awards

Let’s use the beer_awards dataset for this part.

beer_awards %>% 
  count(brewery, sort = TRUE) %>% 
  DT::datatable()
beer_awards %>% 
  count(category, sort = TRUE) %>% 
  DT::datatable()
beer_awards %>% 
  count(city, sort = TRUE) %>% 
  DT::datatable()

fct_other()

We can also collapse levels by grouping together some levels into Other using fct_other().

In fct_other() we can either:

- specify which categories we want to __keep__, where all the rest will be bucketed into the `Other` category.
- specify which categories we want to __drop__ - i.e. which categories do we want to be bucketed into the `Other` category.



  1. Let’s say that we’re only interested in the Pilseners in the category variable.

    We can keep these Pilseners, and combine all others into a Non-Pilseners category using fct_other() with the keep argument.

    beer_awards %>% 
       mutate(category = as_factor(category)) %>% 
       mutate(category = fct_other(category,
                 # which levels do you want to keep?
                 keep = c('German-Style Pilsener',
                 'Bohemian Style Pilsener',
                 'Bohemian-Style Pilsener',
                 'European-Style Pilsener',
                 'American-Style or International-Style Pilsener',
                 'International-Style Pilsener',
                 'European Pilsner',
                 'European Style Pilsener',
                 'American Light Pilsners',
                 'American-Style Lager or American-Style Pilsener',
                 'American-Style Pilsener',
                 'American-Style Pilsener or International-Style Pilsener',
                 'American Pilsener',
                 'American Pilseners',
                 'American Pilsners',
                 'American Premium Dark Pilseners',
                 'Continental Pilsners',
                 'European Classic Pilseners',
                 'German Style Pilsener',
                 'International Pilsener',
                 'Mixed, European Pilsener',
                 'American Premium Dark Pilsners',
                 'American Premium Pilseners',
                 'American Premium Pilsners',
                 'Contemporary American-Style Pilsener',
                 'Pilsener'),
                 # relabel the 'Other' level
                 other_level = "Non-Pilseners"
             )) %>% 
       pull(category) %>% 
       levels()
     
     [1] "American Pilsener"                                      
     [2] "Bohemian-Style Pilsener"                                
     [3] "German-Style Pilsener"                                  
     [4] "International Pilsener"                                 
     [5] "American-Style Lager or American-Style Pilsener"        
     [6] "Contemporary American-Style Pilsener"                   
     [7] "International-Style Pilsener"                           
     [8] "American-Style Pilsener"                                
     [9] "American-Style Pilsener or International-Style Pilsener"
    [10] "American-Style or International-Style Pilsener"         
    [11] "Bohemian Style Pilsener"                                
    [12] "European-Style Pilsener"                                
    [13] "European Style Pilsener"                                
    [14] "German Style Pilsener"                                  
    [15] "Pilsener"                                               
    [16] "European Pilsner"                                       
    [17] "Mixed, European Pilsener"                               
    [18] "European Classic Pilseners"                             
    [19] "American Light Pilsners"                                
    [20] "American Pilsners"                                      
    [21] "American Premium Dark Pilsners"                         
    [22] "American Premium Pilsners"                              
    [23] "American Pilseners"                                     
    [24] "American Premium Dark Pilseners"                        
    [25] "American Premium Pilseners"                             
    [26] "Continental Pilsners"                                   
    [27] "Non-Pilseners"                                          

    We have successfully kept all Pilseners, and all others have been grouped into the “Non-Pilseners” category.

  2. What if we want to keep everything else, and group all Pilseners into a separate category? We can instead use the drop argument.

        beer_awards %>% 
           mutate(category = as_factor(category)) %>% 
           mutate(category = fct_other(category,
                     # which levels do you want to drop?
                     drop = c('German-Style Pilsener',
                     'Bohemian Style Pilsener',
                     'Bohemian-Style Pilsener',
                     'European-Style Pilsener',
                     'American-Style or International-Style Pilsener',
                     'International-Style Pilsener',
                     'European Pilsner',
                     'European Style Pilsener',
                     'American Light Pilsners',
                     'American-Style Lager or American-Style Pilsener',
                     'American-Style Pilsener',
                     'American-Style Pilsener or International-Style Pilsener',
                     'American Pilsener',
                     'American Pilseners',
                     'American Pilsners',
                     'American Premium Dark Pilseners',
                     'Continental Pilsners',
                     'European Classic Pilseners',
                     'German Style Pilsener',
                     'International Pilsener',
                     'Mixed, European Pilsener',
                     'American Premium Dark Pilsners',
                     'American Premium Pilseners',
                     'American Premium Pilsners',
                     'Contemporary American-Style Pilsener',
                     'Pilsener'),
                     # relabel the 'Other' level
                     other_level = "Pilseners"
                 )) %>% 
           pull(category) %>% 
           levels() %>% 
           as_tibble() %>% 
           DT::datatable()
     

    There are no individual “Pilsener” observations now, instead all Pilseners have been grouped into the “Other” category which we renamed to be “Pilseners”.

fct_lump()

In the beer_awards$city variable we have 803 cities. Say we’re only interested in the top 10 cities represented in the awards.

Re-labeling levels

Sometimes your factor categories will have long names. You many want to shorten these for graphs etc.

brewing_materials$type

brewing_materials %>% 
  count(type, sort = TRUE)
# A tibble: 12 x 2
   type                           n
   <chr>                      <int>
 1 Barley and barley products   120
 2 Corn and corn products       120
 3 Hops (dry)                   120
 4 Hops (used as extracts)      120
 5 Malt and malt products       120
 6 Other                        120
 7 Rice and rice products       120
 8 Sugar and syrups             120
 9 Total Grain products         120
10 Total Non-Grain products     120
11 Total Used                   120
12 Wheat and wheat products     120

fct_recode()

The brewing_materials$type variable has some long names. Let’s use fct_recode() to rename these.

brewing_materials %>% 
  mutate(type = as_factor(type)) %>% 
  mutate(type = fct_recode(type,
          # "new_name" = "old_name"                 
          "Barley"     = "Barley and barley products",
          "Malt"       = "Malt and malt products",
          "Rice"       = "Rice and rice products",
          "Corn"       = "Corn and corn products",
          "Wheat"      = "Wheat and wheat products",
          # Notice here I am kinda doing the equivalent of fct_collapse()
          # by assigning 2 categories to new "Hops" category
          "Hops"       = "Hops (dry)",
          "Hops"       = "Hops (used as extracts)"
          )) %>% 
  count(type, sort = TRUE)
# A tibble: 11 x 2
   type                         n
   <fct>                    <int>
 1 Hops                       240
 2 Malt                       120
 3 Corn                       120
 4 Rice                       120
 5 Barley                     120
 6 Wheat                      120
 7 Total Grain products       120
 8 Sugar and syrups           120
 9 Other                      120
10 Total Non-Grain products   120
11 Total Used                 120

Reorder levels

When visualising data we often want to reorder the levels of our factors. We can use fct_reorder(), fct_reorder2(), fct_infreq() and fct_rev() for ordering our factors for visuals.

fct_infreq() and fct_rev()

fct_infreq() orders by the frequency of each category.

Let’s say we were interested in which breweries did the best over time in terms of winning gold medals.

theme_set(theme_light())

beer_awards %>% 
  # I only care about Gold medals
  filter(medal == "Gold") %>% 
  # I am only interested in top 10 most successful breweries
  mutate(brewery = fct_lump(brewery, n = 10,
                            # if there are ties in 10th and 11th position, keep the first one
                            ties.method = "first")) %>% 
  # add a count variable which will be named `n`
  add_count(brewery, medal) %>% 
  # remove the Other category since it overwhelms the plot
  filter(brewery != "Other") %>% 
  # order the brewery by the frequency in each brewery
  ggplot(aes(y = fct_infreq(brewery))) +
  geom_bar(fill = "#f4e4e7") +
  geom_text(aes(label = as.character(n),
                x = n + .06), hjust = "center",
            colour = "#5196b4") +
  labs(x = "",
       y = "",
       title = "Breweries that won the most gold medals over time") +
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank())

The fct_infreq() orders it from most frequent (which is shown at bottom of plot, because ggplot plots the levels from the bottom going up) to least frequent (shown at top of plot). Visually we see the least frequent at top of the plot, and most frequent at the bottom of the plot.

To see what fct_infreq() does let’s look at the levels.

beer_awards %>% 
  filter(medal == "Gold") %>% 
  mutate(brewery = fct_lump(brewery, n = 10,
                            ties.method = "first")) %>%
  filter(brewery != "Other") %>%
  mutate(brewery = fct_infreq(brewery)) %>% 
  pull(brewery) %>% 
  levels()
 [1] "Pabst Brewing Co."               
 [2] "Firestone Walker Brewing Co."    
 [3] "Boston Beer Co."                 
 [4] "Miller Brewing Co."              
 [5] "Anheuser-Busch, Inc"             
 [6] "New Belgium Brewing Co."         
 [7] "Alaskan Brewing and Bottling Co."
 [8] "Marin Brewing Co."               
 [9] "Pelican Pub & Brewery"           
[10] "New Glarus Brewing Co."          
[11] "Other"                           

It is sometimes better to see bar plots in descending order on the visual (i.e. we want to see the most frequent at top of plot, and least frequent at the bottom). This can be accomplished by combining fct_rev() with fct_infreq().

beer_awards %>% 
  filter(medal == "Gold") %>% 
  mutate(brewery = fct_lump(brewery, n = 10,
                            ties.method = "first")) %>% 
  add_count(brewery, medal) %>% 
  filter(brewery != "Other") %>%
  # show in decreasing order of winners on plot
  ggplot(aes(y = fct_rev(fct_infreq(brewery)))) +
  geom_bar(fill = "#f4e4e7") +
  geom_text(aes(label = as.character(n),
                x = n + .06), hjust = 1.2,
            colour = "#5196b4",
            size = 4.5,
            position = position_dodge(width = 1)) +
  labs(x = "",
       y = "",
       title = "Breweries that won the most gold medals over time") +
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank())

beer_awards %>% 
  mutate(brewery = fct_lump(brewery, n = 10)) %>% 
  add_count(brewery, medal) %>% 
  filter(brewery != "Other") %>% 
  ggplot(aes(y = fct_rev(fct_infreq(brewery)))) +
  geom_bar(fill = "#f4e4e7") +
  geom_text(aes(label = as.character(n),
                x = n + .06), hjust = 1.2,
            colour = "#5196b4",
            size = 4.2,
            position = position_dodge(width = 1)) +
  facet_wrap(~ fct_inorder(medal)) +
  labs(x = "",
       y = "",
       title = "Most successful breweries over time",
       subtitle = "In all medal categories") +
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank())

fct_reorder()

On occasion you may want to reorder to make your visuals easier to read. You can do this using fct_reorder().

fct_reorder(var, some_other_var, some_func) where:

Ultimately we want to plot the most successful breweries from most successful to least successful in the top 10.

beer_awards %>% 
  # keep top 10 breweries, lump all the rest into Other
  mutate(brewery = fct_lump(brewery, n = 10)) %>% 
  add_count(brewery, medal) %>% 
  select(brewery, medal, n) %>% 
  distinct() %>% 
  # remove the "Other" category because it makes it hard
  # to see all the rest
  filter(brewery != "Other") %>% 
  # reorder the brewery by the sum of the number of all
  # medals such that brewery with most medals are at top
  mutate(brewery = fct_reorder(brewery, n, sum)) %>% 
  mutate(medal = fct_relevel(medal,
                                "Bronze",
                                "Silver",
                                "Gold")) %>% 
  ggplot(aes(n, brewery, 
             fill = medal)) +
  geom_col() +
  geom_text(aes(label = n, fontface = "bold"), 
            position = position_stack(vjust = 0.5),
            colour = "white") +
  scale_fill_viridis_d() +
  labs(x = "",
       y = "",
       title = "Breweries with the most medals",
       fill = "Medal") +
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank()) +
  guides(fill = guide_legend(reverse = TRUE))

Code

You will find the RMarkdown replica of this post here.

Session Info

Show code
R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=English_South Africa.1252 
[2] LC_CTYPE=English_South Africa.1252   
[3] LC_MONETARY=English_South Africa.1252
[4] LC_NUMERIC=C                         
[5] LC_TIME=English_South Africa.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
 [1] forcats_0.5.1   stringr_1.4.0   dplyr_1.0.7     purrr_0.3.4    
 [5] readr_2.0.1     tidyr_1.1.3     tibble_3.1.5    ggplot2_3.3.5  
 [9] tidyverse_1.3.1 formatR_1.11   

loaded via a namespace (and not attached):
 [1] httr_1.4.2        sass_0.4.0        viridisLite_0.4.0
 [4] bit64_4.0.5       vroom_1.5.5       jsonlite_1.7.2   
 [7] modelr_0.1.8      bslib_0.3.0       assertthat_0.2.1 
[10] highr_0.9         cellranger_1.1.0  yaml_2.2.1       
[13] pillar_1.6.4      backports_1.2.1   glue_1.4.2       
[16] digest_0.6.28     rvest_1.0.2       colorspace_2.0-2 
[19] htmltools_0.5.2   pkgconfig_2.0.3   broom_0.7.9      
[22] haven_2.4.3       scales_1.1.1      distill_1.3      
[25] tzdb_0.1.2        downlit_0.2.1     generics_0.1.1   
[28] farver_2.1.0      ellipsis_0.3.2    DT_0.19          
[31] withr_2.4.2       cli_3.0.1         magrittr_2.0.1   
[34] crayon_1.4.2      readxl_1.3.1      evaluate_0.14    
[37] fs_1.5.0          fansi_0.5.0       xml2_1.3.2       
[40] tools_4.1.1       hms_1.1.0         lifecycle_1.0.1  
[43] emoji_0.2.0       munsell_0.5.0     reprex_2.0.1     
[46] compiler_4.1.1    jquerylib_0.1.4   rlang_0.4.12     
[49] grid_4.1.1        rstudioapi_0.13   htmlwidgets_1.5.4
[52] crosstalk_1.1.1   labeling_0.4.2    rmarkdown_2.11   
[55] gtable_0.3.0      codetools_0.2-18  DBI_1.1.1        
[58] curl_4.3.2        R6_2.5.1          lubridate_1.7.10 
[61] knitr_1.36        fastmap_1.1.0     bit_4.0.4        
[64] utf8_1.2.2        stringi_1.7.5     parallel_4.1.1   
[67] Rcpp_1.0.7        vctrs_0.3.8       dbplyr_2.1.1     
[70] tidyselect_1.1.1  xfun_0.27        

Further Resources

  1. Chapter on Factors in R for Data Science.
  2. {forcats} Tidyverse Page.
  3. Vignette on {forcats} by Emily Robinson.
  4. David Robinson’s TidyTuesday screencasts often include much factor wrangling.

Citation

For attribution, please cite this work as

Naidoo (2021, Jan. 31). Sciencificity's Blog: Feel like the cat that got the cream with {forcats}. Retrieved from https://sciencificity-blog.netlify.app/posts/2021-01-30-control-your-factors-with-forcats/

BibTeX citation

@misc{naidoo2021feel,
  author = {Naidoo, Vebash},
  title = {Sciencificity's Blog: Feel like the cat that got the cream with {forcats}},
  url = {https://sciencificity-blog.netlify.app/posts/2021-01-30-control-your-factors-with-forcats/},
  year = {2021}
}