This is a slightly modified version of Martin Morgan’s original vignette.
Colors should reflect the nature of the data and be carefully chosen to convey equivalent information to all viewers. The RColorBrewer package provides an easy way to choose colors; see also the colorbrewer2 web site.
library(RColorBrewer)
display.brewer.all()
We’ll use a color scheme from the ‘qualitative’ series, to represent different levels of factors and for choice of colors. We’ll get the first four colors.
palette <- brewer.pal(4, "Dark2")
We’ll illustrate ‘base’ graphics using the built-in mtcars
data set
data(mtcars) # load the data set
head(mtcars) # show header and top 6 lines
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
The basic model is to plot data, e.g., the relationshiop between miles per gallon and horsepower. The relationship is symbolized by ~
.
plot(mpg ~ hp, mtcars)
The appearance can be influenced by arguments, see ?plot
then ?plot.default
and par
.
pch
represents the plot character (or symbol.)cex
modifies the default character sizecol
allows to chose the color, here from the palette we defined previously.For example, the code below is modified to change to a green filled circle, larger than the default size.
plot(mpg ~ hp, mtcars, pch=20, cex=2, col=palette[1])
More complicated plots can be composed via a series of commands, e.g., to plot a linear regression: we first make the plot, and then add the regression line using abline()
.
The line is first computer with the lm()
linear model function.
We can also define the line width (lwd
) and its color (col
.)
# Make the default plot
plot(mpg ~ hp, mtcars)
# Compute regression line
fit <- lm(mpg ~ hp, mtcars)
# Add line to the plot
abline(fit, col=palette[1], lwd=3)
ggplot2
is an implementation of Leland Wilkinson’s Grammar of Graphics (Wilkinson 2006) as a data visualization package for R
.
Just as a sentence is composed of words, the graphic can be decomposed in elements This data visualization method breaks up graphs into semantic components such as scales and layers.
As we’ll se below, the basic ggplot
command will not plot anything else than the framework of the plot. Other layers have then to be added.
Start by loading the ggplot2 library
library(ggplot2)
Tell ggplot2 what to plot using ggplot()
and aes()
; we’ll use the columns hp
(horsepower) and mpg
(miles per gallon).
ggplot()
is the main command, and we tell it to use mtcars
as the dataset.aes()
or aesthetic mapping describes how variables in the data are mapped to visual properties (aesthetics) of geoms (geometrical graphics options.)ggplot(mtcars, aes(x=hp, y=mpg))
Note the neutral gray background with white gridlines to provide unobtrusive orientation. Note the relatively small size of the axis and tick labels, to avoid distracting from the pattern provided by the data.
ggplot2 uses different geom_*
to add to the basic plot.
Add points with geom_point()
and note how additional elements are added with the +
sign:
ggplot(mtcars, aes(x=hp, y=mpg)) + geom_point()
Add a linear regression line and standard error…
Different models of smoothing are built-in ggplot2
within the geom_smooth()
option (See ??geom_smooth
for a list of methods.) The error is calculated at the same time and discplayed as a grayed zone.
Note the continuation of adding to the plot with the +
sign.
ggplot(mtcars, aes(x=hp, y=mpg)) + geom_point() +
geom_smooth(method=lm, col=palette[1])
If the method is not specified (or specified as auto
) the smoothing method is chosen based on the size of the largest group (across all panels). loess()
(locally smoothed regression) is used for less than 1,000 observations as is the case here.
This fact (using loess()
)will be reported on the text output.
Note the continuation of adding to the plot with the +
sign.
ggplot(mtcars, aes(x=hp, y=mpg)) + geom_point() +
geom_smooth(method=lm, col=palette[1]) +
geom_smooth(col=palette[2])
`geom_smooth()` using method = 'loess' and formula 'y ~ x'
The following section calls on a dataset used in a different session.
We will explore a subset of data collected by the CDC through its extensive Behavioral Risk Factor Surveillance System (BRFSS) telephone survey. Check out the link for more information. We’ll look at a subset of the data.
The comma-delimited (csv
) file BRFSS-subset.csv
can be found via the file.choose()
graphical interactive method (navigate to the location of the file.)
To illustrate additional features, load the BRFSS data subset
path <- file.choose()
# IF the file is found with file.choose() then use this command
brfss <- read.csv(path)
# If file is in current directory you can use this command
brfss <- read.csv("BRFSS-subset.csv")
#
# brfss <- read.csv(path)
# If file is in same directory:
brfss <- read.csv("BRFSS-subset.csv")
Plot the distribution of weights using geom_density()
ggplot(brfss, aes(x=Weight)) + geom_density()
Plot the weights separately for each year, using fill=factor(Year)
and alpha=.5
arguments in the aes()
argument
ggplot(brfss, aes(x=Weight, fill=factor(Year))) +
geom_density(alpha=0.5)
Americans are getting heavier, and the variation in weights is increasing.
Create separate panels for each sex using facet_grid()
, with a formula describing the factor(s) to use for rows (left-hand side of the formula) and columns (right-hand side).
ggplot(brfss, aes(x=Weight, fill=factor(Year))) +
geom_density() +
facet_grid(Sex ~ .)
REFERENCES
Wilkinson, Leland. 2006. The Grammar of Graphics. Springer Science & Business Media.