Chapter 12 Using NHANES weights

Using weights

An excellent demonstration of incorporating NHANES provided weights as a commented R code page is available on this blog post: How to Use Survey Weights in R35 by Mike Burke.

The code has a few base R commands but most of the code is a perfect demonstration of the usefulness of the dplyr package and how to combine commands in streams on small pipelines. It is also worth noting how the comments within the code facilitates the understanding of the successive steps.

The code contains all the relevant and necessary information, including the activation of packages with the library() function. If all packages are already installed the code can be “Copied/Pasted” in its entirety and proceed without errors.

To review the code we’ll cut it in smaller portion and perhaps add a few commands the check the content or status of the R objects that are defined along the way.

12.1 Header comments and packages

The top of the code contains well defined titles for each section informing the purpose of the program and the needed packages.

When it is run typical information on the loading of tidyverse is displayed.

################################################################################
# General Information #
################################################################################
 
# This is an RScript for showing how to use survey weights with NHANES data. You
# can find a narrative to this script at:
# https://stylizeddata.com/how-to-use-survey-weights-in-r/
 
################################################################################
# Packages #
################################################################################
 
# For reading SAS XPT file from NHANES website
# haven::read_xpt
 
library(haven)
 
# For using survey weights
# survey::svydesign, svymean, svyglm
 
library(survey)
 
# For data wrangling
#dplyr::select, mutate, select, recode
 
library(dplyr)

12.2 Acquiring NHANES data

This chunk uses the haven package that has an easier method to download the data directly from the web site without the need of an intermediate file as we saw in section 6.2.1.

################################################################################
# Load the dataset #
################################################################################
 
# Import NHANES demographic data
 
nhanesDemo <- read_xpt(url(
  "https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.XPT"))

12.3 Data wrangling: renaming and selecting data

This section of the code reorganizes and renames for easier understanding.

12.3.1 Renaming columns

This code portion has a goal to make the data easier for humans to understand by changing the name of the data columns. It is performed with base R subsetting with $ and overwriting the specified column. However, this could also have been done with Tidyverse method by using the rename() function as we have seen in section 10.6.

Only columns that will be selected in the next step are altered. The only data specification is for INDFMPIR others can be found on the DEMO_I web page and listed below:

Chosen columns and their description
Code Description range
INDFMPIR Ratio of family income to poverty 0 - 5
RIDAGEYR Age in years at screening 0 - 80
RIAGENDR Gender 1 -2
WTINT2YR Full sample 2 year interview weight 3293.928267 - 233755.84185
SDMVPSU Masked variance pseudo-PSU 1 to 2
SDMVSTRA Masked variance pseudo-stratum 119 to 133

NHANES data uses the method described in article Primary sampling unit (PSU) masking and variance estimation in complex surveys available online36.

################################################################################
# Data Wrangling #
################################################################################
 
# Copy and rename variables so they are more intuitive. "fpl" is percent
# of the federal poverty level. It ranges from 0 to 5.
 
nhanesDemo$fpl        <- nhanesDemo$INDFMPIR
 
nhanesDemo$age        <- nhanesDemo$RIDAGEYR
 
nhanesDemo$gender     <- nhanesDemo$RIAGENDR
 
nhanesDemo$persWeight <- nhanesDemo$WTINT2YR
 
nhanesDemo$psu        <- nhanesDemo$SDMVPSU
 
nhanesDemo$strata     <- nhanesDemo$SDMVSTRA

12.3.2 Selecting columns

This chunck selects only the columns that are desired with a pipeline using the select() function.

# Since there are 47 variables, we will select only the variables we will use in
# this analysis.
 
nhanesAnalysis <- nhanesDemo %>%
                    select(fpl,
                           age,
                           gender,
                           persWeight,
                           psu,
                           strata)

12.3.3 Changing variable status to a factor

We encountered this problem before (section 10.6) when we needed to convert a variable from “continuous” to a “factor” so that the data would be seen as a “category” with just a few “options”. Just changing Integer numbers to characters could accomplish that.

Here the author chooses to keep the code as integers but change the value for men from 1 to 0 and for women from 2 to 1. The cryptic L in 0L and 1L in the code below is a special code that means “Long” and therefore does not represent the letter L as a character but is a form of “coercion” “forcing” the number to be an integer in its forms recorded by the computer.

What does L mean?

There are multiple entries in Stack Overflow for this, including:

The author of this in R never explained why he chose the notation, but it is shorter than as.integer(10), and more efficient as the coercion is done at parse time.

See more info in links:

Also in the R Language Definition book (R Core Team (2020b))37,38 it is stated, in reference to the number 1 (as in: class(c(1)))

“Perhaps unexpectedly, the number returned from the expression 1 is a numeric. In most cases, the difference between an integer and a numeric value will be unimportant as R will do the right thing when using the numbers. There are, however, times when we would like to explicitly create an integer value for a constant. We can do this by calling the function as.integer() or using various other techniques. But perhaps the simplest approach is to qualify our constant with the suffix character L.”

“We can use the L suffix to qualify any number with the intent of making it an explicit integer.”

Finally, internally this is related also to the amount of memory (RAM) used by the computer to hold an integer and that may depend if it is a 32 bit or a 64 bit system. To know the larger Integer that R can hold can be revealed by command .Machine$integer.max (starts with a dot.) For a 32 bit computer this will be exactly 2147483647. Using 64 bit may be extremely beneficial for very large dataset. A 64 bit version of R can be downloaded from info at http://r.research.att.com/ or https://mac.r-project.org/

Note that the first command mutating the gender column nhanesAnalysis is in tidyverse format using the pipe, while the last command is in base R format.

# Recode gender
 
nhanesAnalysis <- nhanesAnalysis %>%
                    mutate(gender = recode(gender, `1` = 0L,
                                                   `2` = 1L))
 
 
# Convert "gender" to a factor variable. We need to do this so it isn't treated
# as a continuous variable in our analyses
 
nhanesAnalysis$gender <- as.factor(nhanesAnalysis$gender)

At this point we can add commands to get a better understanding of the data format and content. For example:

head(nhanesAnalysis$gender)
[1] 0 0 0 1 1 1
Levels: 0 1
class(nhanesAnalysis)
[1] "tbl_df"     "tbl"        "data.frame"
dim(nhanesAnalysis)
[1] 9971    6
str( nhanesAnalysis)
tibble [9,971 × 6] (S3: tbl_df/tbl/data.frame)
 $ fpl       : num [1:9971] 4.39 1.32 1.51 5 1.23 2.82 1.18 4.22 NA 2.08 ...
  ..- attr(*, "label")= chr "Ratio of family income to poverty"
 $ age       : num [1:9971] 62 53 78 56 42 72 11 4 1 22 ...
  ..- attr(*, "label")= chr "Age in years at screening"
 $ gender    : Factor w/ 2 levels "0","1": 1 1 1 2 2 2 2 1 1 1 ...
 $ persWeight: num [1:9971] 134671 24329 12400 102718 17628 ...
  ..- attr(*, "label")= chr "Full sample 2 year interview weight"
 $ psu       : num [1:9971] 1 1 1 1 2 1 1 2 1 2 ...
  ..- attr(*, "label")= chr "Masked variance pseudo-PSU"
 $ strata    : num [1:9971] 125 125 131 131 126 128 120 124 119 128 ...
  ..- attr(*, "label")= chr "Masked variance pseudo-stratum"
head(nhanesAnalysis)
# A tibble: 6 × 6
    fpl   age gender persWeight   psu strata
  <dbl> <dbl> <fct>       <dbl> <dbl>  <dbl>
1  4.39    62 0         134671.     1    125
2  1.32    53 0          24329.     1    125
3  1.51    78 0          12400.     1    131
4  5       56 1         102718.     1    131
5  1.23    42 1          17628.     2    126
6  2.82    72 1          11252.     1    128

The gender assignment can be confusing as the values are still 1 and 2 but the levels are now 0 and 1 as shown in the short table showing gender as <fct> meaning factor, and as can be deciphered from the str() output for the gender line:

$ gender    : Factor w/ 2 levels "0","1": 1 1 1 2 2 2 2 1 1 1 ...

12.3.4 Adding the weight information

################################################################################
# Survey Weights #
################################################################################
 
# Here we use "svydesign" to assign the weights. We will use this new design
# variable "nhanesDesign" when running our analyses.
 
nhanesDesign <- svydesign(id      = ~psu,
                          strata  = ~strata,
                          weights = ~persWeight,
                          nest    = TRUE,
                          data    = nhanesAnalysis)
 
# Here we use "subset" to tell "nhanesDesign" that we want to only look at a
# specific subpopulation (i.e., those age between 18-79 years). This is
# important to do. If you don't do this and just restrict it in a different way
# your estimates won't have correct SEs.
 
ageDesign <- subset(nhanesDesign, age > 17 &
                                  age < 80)

We can printout the content of these:

nhanesDesign
Stratified 1 - level Cluster Sampling design (with replacement)
With (30) clusters.
svydesign(id = ~psu, strata = ~strata, weights = ~persWeight, 
    nest = TRUE, data = nhanesAnalysis)
ageDesign
Stratified 1 - level Cluster Sampling design (with replacement)
With (30) clusters.
subset(nhanesDesign, age > 17 & age < 80)

12.3.5 Statistics

################################################################################
# Statistics #
################################################################################
 
# We will use "svymean" to calculate the population mean for age. The na.rm
# argument "TRUE" excludes missing values from the calculation. We see that
# the mean age is 45.648 and the standard error is 0.5131.
 
svymean(~age, ageDesign, na.rm = TRUE)
      mean     SE
age 45.648 0.5131
# Since gender is a factor variable, "svymean" will treat it as such and give us
# the proportion of women. We see that men are 48.601% and woman are 51.399% of
# the population in this age of 18 to 79.
 
svymean(~gender, ageDesign, na.rm = TRUE)
           mean    SE
gender0 0.48601 0.006
gender1 0.51399 0.006
# Now we will run a general linear model (glm) with a gaussian link function.
# We tell svyglm that nhanesAnalysis is the dataset to use and to apply the
# "svydesign" object "ageDesign." I won't dive into the results here, but you
# can see that age is positively correlated with FPL and that women are
# predicted to have a lower FPL than men.
 
output <- svyglm(fpl ~ age +
                       gender,
          family = gaussian(),
          data   = nhanesAnalysis,
          design = ageDesign)

We can add the following to show output results:

output
Stratified 1 - level Cluster Sampling design (with replacement)
With (30) clusters.
subset(nhanesDesign, age > 17 & age < 80)

Call:  svyglm(formula = fpl ~ age + gender, design = ageDesign, family = gaussian(), 
    data = nhanesAnalysis)

Coefficients:
(Intercept)          age      gender1  
    2.44519      0.01488     -0.21055  

Degrees of Freedom: 5013 Total (i.e. Null);  13 Residual
  (602 observations deleted due to missingness)
Null Deviance:      14170 
Residual Deviance: 13820    AIC: 20920

In the comments it is said “… you can see that age is positively correlated with FPL and that women are predicted to have a lower FPL than men.”

The definition of the function svyglm() in help is: “Fit a generalised linear model to data from a complex survey design, with inverse-probability weighting and design-based standard errors.”

The final data are located in output$coefficients:

output$coefficients
(Intercept)         age     gender1 
  2.4451935   0.0148763  -0.2105470 

Most likely from the conclusion the following is done:

  1. Compute the regression line for age
  2. Compute a correlation coefficient for gender

From the printed results we can see:

  • The intercept 2.4451935 is where the regression line crosses the vertical \(y\) axis.

  • The slope defining the relationship with age is 0.0148763

  • The third item would be the correlation coefficient with gender, and is negative at -0.210547. Since it is labeled as gender1 that should mean “gender level 1” which is the code for women after the changes added with the mutate() function to change the label numbers.

References

———. 2020b. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.