Chapter 10 dplyr - data manipulation

While base R has many tools that can do the job, dplyr and other Tidyverse packages can easily work together and allow the easy creation of pipelines to accomplish a task as was demonstration in the previous chapter 8.3.1.

In this chapter we’ll explore a few dplyr verbs

  • select to choose columns
  • filter to check data on rows
  • arrange to order data
  • mutate to compute new values
  • summarise to create condensed data
  • group_by to select specific data

In addition we’ll learn about some conditional selection within the functions described by these verbs.

We’ll use the Master4 large data file that was created previously. (See all download and merge code in Appendix D if you need to recreated it.) Master4 is a merge, in order of the following datasets (links are to the documentation):

DEMO_I, BMI_I, PFAS_I, TCHOL_I, ALB_CR_I

10.1 selecting columns

To limit the output length most examples will be piped in the head() function to only print the column headers followed by 6 data lines. %>% head() may terminate the command but the pipeline can be extended further.

The selection of columns is easy and just requires the name of the data frame containing the data and specifying the column names that we want to keep for further use in an analysis. The total length (number of rows) would be printed, here limited by piping into head().

select(Master4, SEQN, LBXMFOS, URXUCR, BMXBMI) %>% head()
   SEQN LBXMFOS URXUCR BMXBMI
1 83732      NA     41   27.8
2 83733      NA    181   30.8
3 83734      NA     70   28.8
4 83735      NA    102   42.4
5 83736     0.6    315   20.3
6 83737      NA     64   28.6

We can note that in column LBXMFOS there are 5 NA, in the next section we’ll see how to get rid of them.

Note that the square bracket subsetting also works, for example to select the first 7 columns we could write select(Master4, 1:7) which would be equivallent to Master4[, 1:7] in this case.

10.2 Filtering rows

The filtering of rows depends on the desired outcome. One step that is useful and often necessary is to remove the NA values. If we continue with the selected columns we can remove those with:

select(Master4, SEQN, LBXMFOS, URXUCR, BMXBMI) %>% head() %>%
  filter(!is.na(LBXMFOS))
   SEQN LBXMFOS URXUCR BMXBMI
1 83736     0.6    315   20.3

Since the head() function was first, we are only filtering the first 6 rows, and 5 are therefore eliminated. If we move the head() function after, we’ll select the first 6 rows that do not have any NA.

select(Master4, SEQN, LBXMFOS, URXUCR, BMXBMI) %>% 
  filter(!is.na(LBXMFOS)) %>%
  head() 
   SEQN LBXMFOS URXUCR BMXBMI
1 83736     0.6    315   20.3
2 83745     0.8    178   25.0
3 83750     1.9     81   24.1
4 83754     5.4    148   43.7
5 83762     0.4    317   38.0
6 83767     1.0     65   26.3

We can use other operator (Appendix B) to filter the rows with conditional statements. For example we could ask to keep the BMI values below or equal to 25.0 and more, and at the same time removing NA values from chosen columns:

select(Master4, SEQN, LBXMFOS, URXUCR, BMXBMI) %>% 
  filter(BMXBMI <= 25.0) %>%
  head() 
   SEQN LBXMFOS URXUCR BMXBMI
1 83736     0.6    315   20.3
2 83738      NA    100   18.1
3 83739      NA     25   15.7
4 83745     0.8    178   25.0
5 83746      NA     34   16.1
6 83748      NA     14   16.1

We could have the filter() function more than once within the pipeline, but we can also have multiple statements at the same time:

select(Master4, SEQN, LBXMFOS, URXUCR, BMXBMI) %>% 
  filter(!is.na(BMXBMI),
         BMXBMI <= 25.0,
         !is.na(LBXMFOS),
         URXUCR == 25.0) %>%
  head() 
   SEQN LBXMFOS URXUCR BMXBMI
1 85253     2.8     25   24.2
2 86086     1.1     25   20.3
3 88829     0.5     25   20.1
4 89326     0.7     25   22.9
5 89399     1.1     25   24.0
6 90178     0.6     25   20.4

Note on style: writing each conditional statement on a separate line makes it easier to read and understand the code.

The dyplr function drop_na() would also remove NA from all rows, or specified rows.

10.3 Arrange data

For dplyr to arrange data is to sort data in order. The following example accomplishes many things at the same time and the “seed” makes the results always the same. To see just 6 lines of ordered data based on age, the data is first randomly sampled to keep only 100 rows with a dyplr function sample_n(). All rows in the age column RIDAGEYR without value are removed with drop_na(). Four columns are selected, and the age is filtered to keep only ages above 12. The data is then arranged (ordered) by the first (age) and then second column (creatinine) as designated. The data is first ordered by age and then by the second parameter.

set.seed(18) ; 
Master4 %>%
sample_n(100) %>% 
drop_na(RIDAGEYR)  %>%   
select(SEQN, URXUCR, BMXBMI, RIDAGEYR) %>% 
filter(RIDAGEYR >= 12) %>%
arrange(RIDAGEYR, URXUCR) %>%
  head() 
   SEQN URXUCR BMXBMI RIDAGEYR
1 89664    114   16.1       12
2 91777    148   20.3       13
3 90183    330   26.2       13
4 92077    116   17.8       14
5 84253    144   20.6       14
6 92721    161   25.2       14

Note that the order in the pipe may be important, but the modularity of the code makes it easy to try if there is an effect. For example, since we are only looking at 6 values here, it makes no difference if the sampling of 100 rows occurs before or after the dropping of NA in the age column.

10.4 mutating data

As we explored in the previous demonstration (8.3.1) we can create a new column when computing data with the mutate() function. As an example we can compute a ratio of \(\frac{height}{weight}\) as it is often a useful measure. In the BMX_I data we can find the relevant column names:

  • BMXWT - Weight (kg)
  • BMXHT - Standing Height (cm)
  • BMXBMI - Body Mass Index (kg/m**2)

We can then make a selection of columns and compute the ratio which will be saved in a new column with the name that we define for example HWRATIO. As an example the data is then rounded to just 1 decimal point as most of the rest of the data:

Master4 %>%
select(SEQN, BMXWT, BMXHT, BMXBMI) %>%
mutate(HWRATIO = BMXHT / BMXWT,
       HWRATIO2 = round(HWRATIO, digit=1)) %>% 
head()
   SEQN BMXWT BMXHT BMXBMI  HWRATIO HWRATIO2
1 83732  94.8 184.5   27.8 1.946203      1.9
2 83733  90.4 171.4   30.8 1.896018      1.9
3 83734  83.4 170.1   28.8 2.039568      2.0
4 83735 109.8 160.9   42.4 1.465392      1.5
5 83736  55.2 164.9   20.3 2.987319      3.0
6 83737  64.4 150.0   28.6 2.329193      2.3

10.4.1 mutate with conditional statement

When analyzing tabular data we may have the choice will be made for each row. Here is an example found online: Creating New Variables in R with mutate() and ifelse()32.

# Create 3 vectors
section <- c("MATH111", "MATH111", "ENG111")
grade <- c(78, 93, 56)
student <- c("David", "Kristina", "Mycroft")
# Combine vectors into a dataframe
gradebook <- data.frame(section, grade, student)
# Test of grade level and assign an outcome
mutate(gradebook, Pass.Fail = ifelse(grade > 60, "Pass", "Fail"))
  section grade  student Pass.Fail
1 MATH111    78    David      Pass
2 MATH111    93 Kristina      Pass
3  ENG111    56  Mycroft      Fail
# Results are printed out

By “nesting” multiple ifelse() statements we can manage to provide multiple choices: the second argument is replaced by another ifelse() statements in succession. The final iteration has a single option that is not an ifelse() statements. In the example below, from the same source, the gradebook is read one row at a time, and then we need to provide a grade based on the letter system: A through D or F.

mutate(gradebook, 
       letter = 
          ifelse(grade %in% 60:69, "D",
                 ifelse(grade %in% 70:79, "C",
                        ifelse(grade %in% 80:89, "B",
                               ifelse(grade %in% 90:99, "A", 
                                      "F")))))
  section grade  student letter
1 MATH111    78    David      C
2 MATH111    93 Kristina      A
3  ENG111    56  Mycroft      F

This makes use of the %in% operator that can easily be understood as being able to test if a value is within the proposed range. On the first line we ask if the test of column grade for each row is between 60 and 69. If the answer is “yes” which means the test is TRUE, then the value will be “D”. If the test is FALSE we’ll go to the next question and so on until the last line where the final choice is F. (Note that for readability the code is indented, and “F” is the alternate option for the test grade %in% 90:99.)

We’ll use this statement is a very useful way in a future section (10.7.)

10.5 Summarising and grouping data

The following data can be found in the DEMO_I demographic data file:

  • RIAGENDR: 1-male, 2-female (none missing)
  • DMDMARTL: 1 Married, 2 Widowed, etc.
  • DMDEDUC2 - Education level - Adults 20+
  • INDHHIN2 - Annual household income

We want to get some information by group, in this case we’ll group by marital status and then count the total number of observations for each case and we’ll store this in column Counts. For each marital status code, we’ll then count how many women and men are in each category that we’ll report in columns Mnum and Wnum (the sum of these 2 on each line should be equal to the reported Counts column.)

For simplicity with income and education we’ll simply compute the mean of the codes, which should still give us an indication of the level of income and education for each category of marital status.

We’ll place the results in a user-defined R object so that we can reuse it in the next section without re-writing the complete pipeline. Let’s call it Xsum for example

Xsum <- Master4 %>%
select(SEQN, RIAGENDR, DMDMARTL, DMDEDUC2, INDHHIN2) %>%
drop_na() %>%
group_by(DMDMARTL) %>%
summarise(Counts = n() ,
         Mnum = sum(RIAGENDR == 1),
         Wnum = sum(RIAGENDR == 2),
         MeanIncCode =  mean(INDHHIN2),
         MeanEducCode =  mean(DMDEDUC2))

Xsum # print output
# A tibble: 8 × 6
  DMDMARTL Counts  Mnum  Wnum MeanIncCode MeanEducCode
     <dbl>  <int> <int> <int>       <dbl>        <dbl>
1        1   2792  1481  1311       12.5          3.52
2        2    401   106   295       11.1          3.00
3        3    597   239   358       10.1          3.49
4        4    184    66   118        9.48         2.80
5        5    999   475   524       11.6          3.60
6        6    533   270   263       10.7          3.25
7       77      2     1     1       42            4   
8       99      1     0     1       99            9   

10.6 Recoding: string replacement

The above results are OK but it would be nice to be able to change some of the code numbers to actual English word such as “married” or “single” in text. This can be accomplished in many ways in dplyr but one of the simplest is to use recode() as a special case within mutate() to overwrite a column.

The DMDMARTL codes from NHANES DEMO_I
Code or Value Value Description Count Cumulative
1 Married 2886 2886
2 Widowed 421 3307
3 Divorced 614 3921
4 Separated 192 4113
5 Never married 1048 5161
6 Living with partner 555 5716
77 Refused 2 5718
99 Don’t Know 1 5719
. Missing 4252 9971

We can then recode the DMDMARTL column of Xsum with the following pipeline:

Xsum2 <- Xsum %>% mutate(DMDMARTL = recode(DMDMARTL, 
                                  `1` = "Married",
                                  `2` = "Widowed",
                                  `3` = "Divorced",
                                  `4` = "Separated",
                                  `5` = "Nevermarried",
                                  `6` = "Living with partner",
                                  `77` = "Refused",
                                  `99` = "Don' Know")
                         )
Xsum2 # print out
# A tibble: 8 × 6
  DMDMARTL            Counts  Mnum  Wnum MeanIncCode MeanEducCode
  <chr>                <int> <int> <int>       <dbl>        <dbl>
1 Married               2792  1481  1311       12.5          3.52
2 Widowed                401   106   295       11.1          3.00
3 Divorced               597   239   358       10.1          3.49
4 Separated              184    66   118        9.48         2.80
5 Nevermarried           999   475   524       11.6          3.60
6 Living with partner    533   270   263       10.7          3.25
7 Refused                  2     1     1       42            4   
8 Don' Know                1     0     1       99            9   

(Credit: this example inspired by How to Recode a Column with dplyr in R33)

In fact 2 of the columns are not well named and we should rename Mnum and Wnum to simply Men and Women in the table. This is done with the dplyr rename() function:

Xsum2 %>% rename(Men = Mnum, Women = Wnum)
# A tibble: 8 × 6
  DMDMARTL            Counts   Men Women MeanIncCode MeanEducCode
  <chr>                <int> <int> <int>       <dbl>        <dbl>
1 Married               2792  1481  1311       12.5          3.52
2 Widowed                401   106   295       11.1          3.00
3 Divorced               597   239   358       10.1          3.49
4 Separated              184    66   118        9.48         2.80
5 Nevermarried           999   475   524       11.6          3.60
6 Living with partner    533   270   263       10.7          3.25
7 Refused                  2     1     1       42            4   
8 Don' Know                1     0     1       99            9   

10.7 Getting it all together

We now know enough that we should be able to get it all together in an annotated pipeline starting with Master4.

10.7.1 Example 1: by gender

Let’s try to answer the question: “What is the average level of total cholesterol in men and women from the Master4 dataset.?

To answer the question we’ll need the following:

  1. create an object to contain the results and redirect to it
  2. Start with Master4 (inject data in the pipeline)
  3. Select relevant columns
  4. Remove NA rows
  5. Group by gender
  6. Summarize.

We now know how to do this in a data stream with a pipeline. Let’s write it and add comment lines so we can remember later the purpose of the code:

TcholGender <- Master4 %>%
# select columns
select(SEQN, RIAGENDR, LBXTC ) %>%
# fitler all rows to remove NAs
drop_na() %>% 
# Group by gender
group_by(RIAGENDR) %>%
summarise(
         Men = sum(RIAGENDR == 1),
         Women = sum(RIAGENDR == 2),
         MeanTChol = mean(LBXTC)) 

# Print results:
TcholGender
# A tibble: 2 × 4
  RIAGENDR   Men Women MeanTChol
     <dbl> <int> <int>     <dbl>
1        1  3545     0      178.
2        2     0  3711      183.

The final table will be short but that is exactly what a summary should be.

The value for men is 177.75 based on a total of 3545 observations.

For women the we see is 182.65 based on a total of 3711 observations.

(See section 13.2.4 later to learn how these numbers were embedded within the text automatically without copy/paste!)

10.7.2 Example 2: by gender and age

Let’s make things a bit more interesting with the question:

Base on gender and age group, what is the mean and standard deviation of PFAS compounds, as well as the mean, standard deviation, minimum, and maximum values of total cholesterol in men and women?

One of the key word is “age group” as in the NHANES data age is a single integer number with a range from 1 to 80 in RIDAGEYR column and therefore it is “up to us” to create the age groups! This can be done rather easily within a combination of mutate() and with nested ifelse() statements within. Nesting the ifelse() function within itself allows making multiple choices. (For a refresher on ifelse() see section 10.4.1.)

We’ll have to use the appropriate columns of data, compute the creatinine adjustment and summarize these values by age group for men and women.

A new base R function is introduced here formatC() which will force the numbers to be printed in scientific exponent for better clarity. (Try without and see the difference! - this option was suggested on Stack Overflow.)

The pipeline below is annotated to specify the function of each line and the results are saved within a user-defined R object Example2.

Example2 <- Master4 %>%
# select columns
select(SEQN, RIAGENDR, RIDAGEYR, LBXMFOS, URXUCR, LBXTC ) %>%
# fitler all rows to remove NAs
drop_na() %>% 
# Creatinine adjustment
mutate (RATIO = (LBXMFOS/URXUCR)*10^-4) %>%
# categorize ages in  5 groups: 
#  Children: G0TO18, younger adults: G19TO35, 
# and older adults: G36TO65, seniors: G66TO79, 
# and 80 and older: G80.
mutate(AGEGROUP = ifelse(RIDAGEYR %in% 0:18, "G0TO18",
                ifelse(RIDAGEYR %in% 19:35, "G19TO35",
                        ifelse(RIDAGEYR %in% 36:65, "G36TO65",
                               ifelse(RIDAGEYR %in% 66:79, 
                                      "G66TO79", "G80"))))) %>%
#
# NOTE: this would be a place to split the pipeline in 2 sections
#
# Group by gender first, age second
group_by(RIAGENDR, AGEGROUP) %>%
# Summarize: count totals, 
summarise(
         Men = sum(RIAGENDR == 1),
         Women = sum(RIAGENDR == 2),
         MeanPFASR =  formatC(mean(RATIO), format = "e", digits = 2),
         sdPFASR =  formatC(sd(RATIO), format = "e", digits = 2),
         MeanTChol = mean(LBXTC),
         sdTChol =  sd(LBXTC),
         minTChol =  min(LBXTC),
         maxTChol =  max(LBXTC)
         ) 
`summarise()` has grouped output by 'RIAGENDR'. You can override using the
`.groups` argument.
# Print out the results
Example2
# A tibble: 10 × 10
# Groups:   RIAGENDR [2]
   RIAGENDR AGEGROUP   Men Women MeanPFASR sdPFASR  MeanTChol sdTChol minTChol
      <dbl> <chr>    <int> <int> <chr>     <chr>        <dbl>   <dbl>    <dbl>
 1        1 G0TO18     175     0 8.47e-07  8.68e-07      152.    28.6       94
 2        1 G19TO35    235     0 1.68e-06  1.88e-06      180.    35.6       98
 3        1 G36TO65    382     0 3.24e-06  5.99e-06      198.    46.7      111
 4        1 G66TO79    117     0 3.95e-06  3.92e-06      169.    35.1       90
 5        1 G80         44     0 5.12e-06  6.42e-06      165.    35.0      101
 6        2 G0TO18       0   141 7.70e-07  7.88e-07      158.    29.5      103
 7        2 G19TO35      0   238 9.68e-07  1.31e-06      176.    38.5      100
 8        2 G36TO65      0   459 2.35e-06  3.80e-06      201.    36.3      106
 9        2 G66TO79      0   124 4.58e-06  6.32e-06      202.    48.5       84
10        2 G80          0    53 5.86e-06  6.29e-06      187.    34.1      126
# … with 1 more variable: maxTChol <dbl>

The print-out might be truncated (depending on the format of this document.)

As expected the output is a tibble. A subtle but important note is in the second line of the output: # Groups: RIAGENDR [2]. It may be important at a later stage to use the dplyr function ungroup() to modify e.g. the column name or the values within that column.

The highest cholesterol value is 545 and is for a person of gender code 1 in age group G36TO65.

(See section 13.2.4 later to learn how these numbers were embedded within the text automatically without copy/paste!)

10.7.2.1 Sorting / arranging

The Example2 is a summary table (in tibble / data frame format) and can be further sorted with the arrange() function. For example to see a table with the cholesterol levels:

Example2 %>%
select(RIAGENDR, AGEGROUP, maxTChol) %>%
arrange(-maxTChol) %>%
head(2)
# A tibble: 2 × 3
# Groups:   RIAGENDR [2]
  RIAGENDR AGEGROUP maxTChol
     <dbl> <chr>       <dbl>
1        1 G36TO65       545
2        2 G66TO79       358

10.7.3 Base R Bar plot

This summary table can be used with the base R function barplot() to create a representation of the mean cholesterol data. The default plot yields just gray bars and not horizontal label. Here are examples plotted together on a single graph that show the effect of some options.

As a reminder:

  • las=2 is rotating the labels (seen in section 6.3.)
  • col = RIAGENDR + 1 selects two of the nine colors in the base R color palette (see colors in section 5.5.) Since RIAGENDR values are 1 and 2 the resulting colors would be black and red which is not very appealing. Adding + 1 will choose red and green. Other numbers will select the next colors in the list.
  • names.arg= specifies the source of the horizontal axis names displayed.

These commands use the with() function but could also be written with the $ method, for example with Example2$AGEGROUP.

par(mfrow = c(1,2))
# Plain version
with(Example2,barplot(MeanTChol))
# Add white/gray alternate colors, rotated horiz labels
with(Example2,barplot(MeanTChol, 
                      col = RIAGENDR + 1, 
                      names.arg=AGEGROUP, las =2 ))

par(mfrow =  c(1,1))

Error bars? There are methods in base R to add error bars and various examples can be found online. However, the methods are all quirky and the best is now to use ggplot to create such graphs.

10.7.4 ggplot2 versions

Example plots have been moved to section 11.2.