04-1: Data Visualization with ggplot2: Basics

Tips to make the most of the lecture notes

  • Click on the three horizontally stacked lines at the bottom left corner of the slide, then you will see table of contents, and you can jump to the section you want

  • Hit letter “o” on your keyboard and you will have a panel view of all the slides

  • The box area with a hint of blue as the background color is where you can write code (hereafter referred to as the “code area”).
  • Hit the “Run Code” button to execute all the code inside the code area.
  • You can evaluate (run) code selectively by highlighting the parts you want to run and hitting Command + Enter for Mac (Ctrl + Enter for Windows).
  • If you want to run the codes on your computer, you can first click on the icon with two sheets of paper stacked on top of each other (top right corner of the code chunk), which copies the code in the code area. You can then paste it onto your computer.
  • You can click on the reload button (top right corner of the code chunk, left to the copy button) to revert back to the original code.

Preparation

Install the package if you have not.

install.packages("ggplot2")


Or, when you load the tidyverse package, it automatically loads it.

#--- load ggplot2 along with others in the tidyverse package ---#
library(tidyverse)

#--- or ---#
library(ggplot2)

We use county_yield, which records corn and soybean yield data by county over multiple years.

  • soy_yield: soybean yield (bu/acre)
  • corn_yield: corn yield (bu/acre)
  • d0_5_9: ratio of weeks under drought severity of 0 from May to September
  • d1_5_9: ~ drought severity of 1 from May to September
  • d2_5_9: ~ drought severity of 2 from May to September
  • d3_5_9: ~ drought severity of 3 from May to September
  • d4_5_9: ~ drought severity of 4 from May to September

We also use the derivative of county_yield, which records average corn yield by year.

ggplot2 basics


ggplot2 basics

The very first job you need to do in creating a figure using the ggplot2 package is to let R know the dataset you are trying to visualize, which can be done using ggplot() like below:



When you create a figure using the ggplot2 package, ggplot() is always the function you call first.

Let’s now see what is inside g_fig:



Well, it’s blank. Obviously, g_fig still does not have enough information to create any kind of figures. You have not told R anything specific about how you would like to use the information in the dataset.

The next thing you need to do is tell g_fig what type of figure you want by geom_*() functions. For example, we use geom_point() to create a scatter plot. To create a scatter plot, R needs to know which variables should be on the y-axis and x-axis. These information can be passed to g_fig by the following code:



Here,

  • geom_point() was added to g_fig to declare that you want a scatter plot
  • aes(x = d3_5_9, y = corn_yield) inside geom_point() tells R that you want to create a scatter plot where you have d3_5_9 on the x-axis and corn_yield on the y-axis

This is what g_fig_scatter looks:

Going back to the code,



Note that x = d3_5_9, y = corn_yield are inside aes().


Important

aes() is used to make the aesthetic of the figure to be a function of variables in the dataset that you told ggplot to use (here, county_yield).


aes(x = d3_5_9, y = corn_yield) is telling ggplot to use d3_5_9 and corn_yield variables in the county_yield dataset for the x-axis and y-axis, respectively.

If you do not have x = d3_5_9, y = corn_yield inside aes(), R is going to look for d3_5_9 and corn_yield themselves (but not in county_yield), which you have not defined.

Try:

  • ggplot(data = dataset) to initiate the process of creating a figure

  • add geom_*() to declare what kind of figure you would like to make

  • specify what variables in the dataset to use and how they are used inside aes()

  • place the aes() you defined above in the geom_*() you specified above

Different types of figures


Different types of figures


ggplot2 lets you create lots of different kinds of figures via various geom_*() functions.

  • geom_histogram()/geom_density()
  • geom_line()
  • geom_boxplot()
  • geom_bar()

How to specify aesthetics vary by geom_*().

Note

geom_histogram() only needs x.

Note

geom_density() only needs x.

Note

geom_line() needs x and y.

Note

  • geom_boxplot() needs x and y.
  • Why factor(year)?

Note

geom_bar() needs x and y

Modifying how figures look

All the elements in the figures we have created so far are in black and white.

You can change how figure elements look by providing options inside geom_*().

Here are the list of options to control the aesthetics of figures:

  • fill
  • color
  • size
  • shape
  • linetype

Elements of figures that you can modify differ by geom types

The same element name can mean different things based on geom types

Exercises

This exercise use the diamonds dataset from the ggplot2() package. First, load the dataset and extract observations with Premium cut whose color is one of E, I, and F:

Using carat and price variables from premium, generate the figure below:

Code
ggplot(data = premium) +
  geom_point(aes(x = carat, y = price), color = "red")

Using price variables from premium, generate a histogram of price shown below:

Code
ggplot(data = premium) +
  geom_histogram(aes(x = price), fill = "white", color = "blue")

Other supplementary geom_*()s


Other supplementary geom_*()s

Here are the list of useful geom_.

  • geom_vline(): draw a vertical line
  • geom_hline(): draw a horizontal line
  • geom_abline(): draw a line with the specified intercept and slope
  • geom_smooth(): draw an OLS-estimated regression line (other regression methods available)
  • geom_ribbon(): create a shaded area
  • geom_text() and annotate(): add texts in the figure

We will use g_fig_scatter to illustrate how these functions work.

Note

  • xintercept in geom_vline: where the vertical line is placed
  • yintercept in geom_hline: where the horizontal line is placed

Note

\[y = a + b\times x\]

  • intercept: \(a\)
  • slope: \(b\)

Note

Also try adding method = "lm".

Note

  • ymin: lower bound of the ribbon
  • ymax: upper bound of the ribbon

It is useful when drawing confidence intervals.

Note

  • x, y: position of where texts are placed
  • label: variable to print

Note

  • x: where on x-axis
  • y: where on y-axis
  • label: text to print (break the line)
  • size: font size