Using weights.
An excellent demonstration of incorporating NHANES provided weights as a commented R
code page is available on this blog post: How to Use Survey Weights in R1 by Mike Burke.
2024-06-10
Using weights.
An excellent demonstration of incorporating NHANES provided weights as a commented R
code page is available on this blog post: How to Use Survey Weights in R1 by Mike Burke.
The code has a few base R
commands but most of the code is a perfect demonstration of the usefulness of the dplyr
package and how to combine commands in streams on small pipelines.
It is also worth noting how the comments within the code facilitates the understanding of the successive steps.
The code contains all the relevant and necessary information, including the activation of packages with the library()
function.
If all packages are already installed the code can be “Copied/Pasted” in its entirety and proceed without errors.
To review the code we’ll cut it in smaller portion and perhaps add a few commands the check the content or status of the R
objects that are defined along the way.
The top of the code contains well defined titles for each section informing the purpose of the program and the needed packages.
When it is run typical information on the loading of tidyverse
is displayed.
################################################################################ ## General Information ## ################################################################################ ## This is an RScript for showing how to use survey weights with NHANES data. You ## can find a narrative to this script at: ## https://stylizeddata.com/how-to-use-survey-weights-in-r/ ################################################################################ ## Packages ## ################################################################################ ## For reading SAS XPT file from NHANES website ## haven::read_xpt library(haven) ## For using survey weights ## survey::svydesign, svymean, svyglm library(survey) ## For data wrangling #dplyr::select, mutate, select, recode library(dplyr)
This chunk uses the haven
package that has an easier method to download the data directly from the web site without the need of an intermediate file as we saw in section “download PFAS data.”
################################################################################ ## Load the dataset ## ################################################################################ ## Import NHANES demographic data nhanesDemo <- read_xpt(url( "https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.XPT"))
wrangling: renaming and selecting data
This section of the code reorganizes and renames for easier understanding. .
This code portion has a goal to make the data easier for humans to understand by changing the name of the data columns.
It is performed with base R
subsetting with $
and overwriting the specified column.
However, this could also have been done with Tidyverse method by using the rename()
function as we have seen in section @ref(recodingstringreplacement).
Only columns that will be selected in the next step are altered.
The only data specification is for INDFMPIR
others can be found on the DEMO_I web page and listed below:
Code | Description | range |
---|---|---|
INDFMPIR |
Ratio of family income to poverty | 0 - 5 |
RIDAGEYR |
Age in years at screening | 0 - 80 |
RIAGENDR |
Gender | 1 -2 |
WTINT2YR |
Full sample 2 year interview weight | 3293.928267 - 233755.84185 |
SDMVPSU |
Masked variance pseudo-PSU | 1 to 2 |
SDMVSTRA |
Masked variance pseudo-stratum | 119 to 133 |
NHANES data uses the method described in article.
Primary sampling unit (PSU) masking and variance estimation in complex surveys available online2.
################################################################################ ## Data Wrangling ## ################################################################################ ## Copy and rename variables so they are more intuitive. "fpl" is percent ## of the federal poverty level. It ranges from 0 to 5. nhanesDemo$fpl <- nhanesDemo$INDFMPIR nhanesDemo$age <- nhanesDemo$RIDAGEYR nhanesDemo$gender <- nhanesDemo$RIAGENDR nhanesDemo$persWeight <- nhanesDemo$WTINT2YR nhanesDemo$psu <- nhanesDemo$SDMVPSU nhanesDemo$strata <- nhanesDemo$SDMVSTRA
This chunck selects only the columns that are desired with a pipeline using the select()
function.
## Since there are 47 variables, we will select only the variables we will use in ## this analysis. nhanesAnalysis <- nhanesDemo %>% select(fpl, age, gender, persWeight, psu, strata)
We encountered this problem before (section “recoding string replacement”) when we needed to convert a variable from “continuous” to a “factor” so that the data would be seen as a “category” with just a few “options”.
Just changing Integer numbers to characters could accomplish that.
Here the author chooses to keep the code as integers but change the value for men from 1
to 0
and for women from 2
to 1
.
The cryptic L
in 0L
and 1L
in the code below is a special code that means “Long” and therefore does not represent the letter L
as a character but is a form of “coercion” “forcing” the number to be an integer in its forms recorded by the computer.
What does L
mean?.
There are multiple entries in Stack Overflow for this, including:
The author of this in R
never explained why he chose the notation, but it is shorter than as.integer(10)
, and more efficient as the coercion is done at parse time.
See more info in links:
- Clarification of L in R.
- What’s the difference between 1L
and 1
?.
Also in the R Language Definition book (@R-base)3,4 it is stated, in reference to the number 1
(as in: class(c(1))
).
“Perhaps unexpectedly, the number returned from the expression 1
is a numeric. In most cases, the difference between an integer and a numeric value will be unimportant as R will do the right thing when using the numbers.
There are, however, times when we would like to explicitly create an integer value for a constant. We can do this by calling the function as.integer()
or using various other techniques. But perhaps the simplest approach is to qualify our constant with the suffix character L
.”
“We can use the L
suffix to qualify any number with the intent of making it an explicit integer.” Finally, internally this is related also to the amount of memory (RAM) used by the computer to hold an integer and that may depend if it is a 32 bit
or a 64 bit
system.
To know the larger Integer that R
can hold can be revealed by command .Machine$integer.max
(starts with a dot.) For a 32 bit
computer this will be exactly 2147483647
. Using 64 bit
may be extremely beneficial for very large dataset.
A 64 bit
version of R
can be downloaded from info at http://r.research.att.com/ or https://mac.r-project.org/.
Note that the first command mutating the gender
column nhanesAnalysis
is in tidyverse format using the pipe, while the last command is in base R
format.
## Recode gender nhanesAnalysis <- nhanesAnalysis %>% mutate(gender = recode(gender, `1` = 0L, `2` = 1L)) ## Convert "gender" to a factor variable. We need to do this so it isn't treated ## as a continuous variable in our analyses nhanesAnalysis$gender <- as.factor(nhanesAnalysis$gender)
.
At this point we can add commands to get a better understanding of the data format and content.
For example:
head(nhanesAnalysis$gender)
[1] 0 0 0 1 1 1 Levels: 0 1
class(nhanesAnalysis)
[1] "tbl_df" "tbl" "data.frame"
dim(nhanesAnalysis)
[1] 9971 6
str( nhanesAnalysis) tibble [9,971 × 6] (S3: tbl_df/tbl/data.frame) $ fpl : num [1:9971] 4.39 1.32 1.51 5 1.23 2.82 1.18 4.22 NA 2.08 ... ..- attr(*, "label")= chr "Ratio of family income to poverty" $ age : num [1:9971] 62 53 78 56 42 72 11 4 1 22 ... ..- attr(*, "label")= chr "Age in years at screening" $ gender : Factor w/ 2 levels "0","1": 1 1 1 2 2 2 2 1 1 1 ... $ persWeight: num [1:9971] 134671 24329 12400 102718 17628 ... ..- attr(*, "label")= chr "Full sample 2 year interview weight" $ psu : num [1:9971] 1 1 1 1 2 1 1 2 1 2 ... ..- attr(*, "label")= chr "Masked variance pseudo-PSU" $ strata : num [1:9971] 125 125 131 131 126 128 120 124 119 128 ... ..- attr(*, "label")= chr "Masked variance pseudo-stratum"
The gender assignment can be confusing as the values are still 1
and 2
but the levels are now 0
and 1
as shown in the short table showing gender
as <fct>
meaning factor, and as can be deciphered from the str()
output for the gender
line: $ gender : Factor w/ 2 levels "0","1": 1 1 1 2 2 2 2 1 1 1 ...
head(nhanesAnalysis)
# A tibble: 6 × 6 fpl age gender persWeight psu strata <dbl> <dbl> <fct> <dbl> <dbl> <dbl> 1 4.39 62 0 134671. 1 125 2 1.32 53 0 24329. 1 125 3 1.51 78 0 12400. 1 131 4 5 56 1 102718. 1 131 5 1.23 42 1 17628. 2 126 6 2.82 72 1 11252. 1 128
################################################################################ ## Survey Weights ## ################################################################################ ## Here we use "svydesign" to assign the weights. We will use this new design ## variable "nhanesDesign" when running our analyses. nhanesDesign <- svydesign(id = ~psu, strata = ~strata, weights = ~persWeight, nest = TRUE, data = nhanesAnalysis) ## Here we use "subset" to tell "nhanesDesign" that we want to only look at a ## specific subpopulation (i.e., those age between 18-79 years). This is ## important to do. If you don't do this and just restrict it in a different way ## your estimates won't have correct SEs. ageDesign <- subset(nhanesDesign, age > 17 & age < 80)
We can printout the content of these:
nhanesDesign
Stratified 1 - level Cluster Sampling design (with replacement) With (30) clusters. svydesign(id = ~psu, strata = ~strata, weights = ~persWeight, nest = TRUE, data = nhanesAnalysis)
ageDesign
Stratified 1 - level Cluster Sampling design (with replacement) With (30) clusters. subset(nhanesDesign, age > 17 & age < 80)
################################################################################ ## Statistics ## ################################################################################ ## We will use "svymean" to calculate the population mean for age. The na.rm ## argument "TRUE" excludes missing values from the calculation. We see that ## the mean age is 45.648 and the standard error is 0.5131. svymean(~age, ageDesign, na.rm = TRUE) mean SE age 45.6 0.51 ## Since gender is a factor variable, "svymean" will treat it as such and give us ## the proportion of women. We see that men are 48.601% and woman are 51.399% of ## the population in this age of 18 to 79. svymean(~gender, ageDesign, na.rm = TRUE) mean SE gender0 0.486 0.01 gender1 0.514 0.01
################################################################################ ## Now we will run a general linear model (glm) with a gaussian link function. ## We tell svyglm that nhanesAnalysis is the dataset to use and to apply the ## "svydesign" object "ageDesign." I won't dive into the results here, but you ## can see that age is positively correlated with FPL and that women are ## predicted to have a lower FPL than men. output <- svyglm(fpl ~ age + gender, family = gaussian(), data = nhanesAnalysis, design = ageDesign)
We can add the following to show output results:
output Stratified 1 - level Cluster Sampling design (with replacement) With (30) clusters. subset(nhanesDesign, age > 17 & age < 80) Call: svyglm(formula = fpl ~ age + gender, design = ageDesign, family = gaussian(), data = nhanesAnalysis) Coefficients: (Intercept) age gender1 2.4452 0.0149 -0.2105 Degrees of Freedom: 5013 Total (i.e. Null); 13 Residual (602 observations deleted due to missingness) Null Deviance: 14200 Residual Deviance: 13800 AIC: 20900
In the comments it is said “… you can see that age is positively correlated with FPL and that women are predicted to have a lower FPL than men.”.
The definition of the function svyglm()
in help is: “Fit a generalised linear model to data from a complex survey design, with inverse-probability weighting and design-based standard errors.” The final data are located in output$coefficients
:
output$coefficients (Intercept) age gender1 2.4452 0.0149 -0.2105
Most likely from the conclusion the following is done:
From the printed results we can see:
The intercept 2.445 is where the regression line crosses the vertical \(y\) axis.
The slope defining the relationship with age is 0.015.
The third item would be the correlation coefficient with gender, and is negative at -0.211.
Since it is labeled as gender1
that should mean “gender level 1” which is the code for women after the changes added with the mutate()
function to change the label numbers.