Acknowledgements

This is a slightly modified version of Martin Morgan’s original vignette.

1 Colors

Colors should reflect the nature of the data and be carefully chosen to convey equivalent information to all viewers. The RColorBrewer package provides an easy way to choose colors; see also the colorbrewer2 web site.

library(RColorBrewer)
display.brewer.all()

We’ll use a color scheme from the ‘qualitative’ series, to represent different levels of factors and for choice of colors. We’ll get the first four colors.

palette <- brewer.pal(4, "Dark2")

2 Quick overview of ‘Base’ Graphics

We’ll illustrate ‘base’ graphics using the built-in mtcars data set

data(mtcars)     # load the data set
head(mtcars)     # show header and top 6 lines

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

The basic model is to plot data, e.g., the relationshiop between miles per gallon and horsepower. The relationship is symbolized by ~.

plot(mpg ~ hp, mtcars)

Figure 1: Plotting example
Within the default plot each point is represented by an open circle.

The appearance can be influenced by arguments, see ?plot then ?plot.default and par.

pch represents the plot character (or symbol.)
cex modifies the default character size
col allows to chose the color, here from the palette we defined previously.

For example, the code below is modified to change to a green filled circle, larger than the default size.

plot(mpg ~ hp, mtcars, pch=20, cex=2, col=palette[1])

Figure 2: Modified plot example

More complicated plots can be composed via a series of commands, e.g., to plot a linear regression: we first make the plot, and then add the regression line using abline().

The line is first computer with the lm() linear model function.

We can also define the line width (lwd) and its color (col.)

# Make the default plot
plot(mpg ~ hp, mtcars)
# Compute regression line
fit <- lm(mpg ~ hp, mtcars)
# Add line to the plot
abline(fit, col=palette[1], lwd=3)

3 Overview of ggplot2 Graphics

3.1 Grammar of Graphics

ggplot2 is an implementation of Leland Wilkinson’s Grammar of Graphics (Wilkinson 2006) as a data visualization package for R.

Just as a sentence is composed of words, the graphic can be decomposed in elements This data visualization method breaks up graphs into semantic components such as scales and layers.

As we’ll se below, the basic ggplot command will not plot anything else than the framework of the plot. Other layers have then to be added.

3.2 Start ggplot2

Start by loading the ggplot2 library

library(ggplot2)

3.3 Basics

Tell ggplot2 what to plot using ggplot() and aes(); we’ll use the columns hp (horsepower) and mpg (miles per gallon).

ggplot() is the main command, and we tell it to use mtcars as the dataset.
aes() or aesthetic mapping describes how variables in the data are mapped to visual properties (aesthetics) of geoms (geometrical graphics options.)

ggplot(mtcars, aes(x=hp, y=mpg))

Figure 3: ggplot2 does not show data points until specified by a geom_* command
See below.

Note the neutral gray background with white gridlines to provide unobtrusive orientation. Note the relatively small size of the axis and tick labels, to avoid distracting from the pattern provided by the data.

ggplot2 uses different geom_* to add to the basic plot.

Add points with geom_point() and note how additional elements are added with the + sign:

ggplot(mtcars, aes(x=hp, y=mpg)) + geom_point()

Figure 4: + geom_point() added data point onto the base plot

Add a linear regression line and standard error…

Different models of smoothing are built-in ggplot2 within the geom_smooth() option (See ??geom_smooth for a list of methods.) The error is calculated at the same time and discplayed as a grayed zone.

Note the continuation of adding to the plot with the + sign.

ggplot(mtcars, aes(x=hp, y=mpg)) + geom_point() +
    geom_smooth(method=lm, col=palette[1])

Figure 5: geom_smooth() can add linear regression line and standard error…

If the method is not specified (or specified as auto) the smoothing method is chosen based on the size of the largest group (across all panels). loess() (locally smoothed regression) is used for less than 1,000 observations as is the case here.

This fact (using loess())will be reported on the text output.

Note the continuation of adding to the plot with the + sign.

ggplot(mtcars, aes(x=hp, y=mpg)) + geom_point() +
    geom_smooth(method=lm, col=palette[1]) +
    geom_smooth(col=palette[2])

`geom_smooth()` using method = 'loess' and formula 'y ~ x'

Figure 6: loess() local smoothing is added by default

3.4 Density plots

The following section calls on a dataset used in a different session.

We will explore a subset of data collected by the CDC through its extensive Behavioral Risk Factor Surveillance System (BRFSS) telephone survey. Check out the link for more information. We’ll look at a subset of the data.

The comma-delimited (csv) file BRFSS-subset.csv can be found via the file.choose() graphical interactive method (navigate to the location of the file.)

To illustrate additional features, load the BRFSS data subset

path <- file.choose()

# IF the file is found with file.choose() then use this command
brfss <- read.csv(path)

# If file is in current directory you can use this command
brfss <- read.csv("BRFSS-subset.csv")

# 
# brfss <- read.csv(path)
# If file is in same directory:
brfss <- read.csv("BRFSS-subset.csv")

Plot the distribution of weights using geom_density()

ggplot(brfss, aes(x=Weight)) + geom_density()

Plot the weights separately for each year, using fill=factor(Year) and alpha=.5 arguments in the aes() argument

ggplot(brfss, aes(x=Weight, fill=factor(Year))) +
    geom_density(alpha=0.5)

Americans are getting heavier, and the variation in weights is increasing.

3.5 Facets

Create separate panels for each sex using facet_grid(), with a formula describing the factor(s) to use for rows (left-hand side of the formula) and columns (right-hand side).

ggplot(brfss, aes(x=Weight, fill=factor(Year))) +
    geom_density() +
    facet_grid(Sex ~ .)

REFERENCES

Wilkinson, Leland. 2006. The Grammar of Graphics. Springer Science & Business Media.

R Visualization: Base and ggplot2

16 October, 2018

Contents