07-1: Date and String Management

Tips to make the most of the lecture notes

Interactive navigation tools
Running and writing codes

Click on the three horizontally stacked lines at the bottom left corner of the slide, then you will see table of contents, and you can jump to the section you want
Hit letter “o” on your keyboard and you will have a panel view of all the slides

The box area with a hint of blue as the background color is where you can write code (hereafter referred to as the “code area”).
Hit the “Run Code” button to execute all the code inside the code area.
You can evaluate (run) code selectively by highlighting the parts you want to run and hitting Command + Enter for Mac (Ctrl + Enter for Windows).
If you want to run the codes on your computer, you can first click on the icon with two sheets of paper stacked on top of each other (top right corner of the code chunk), which copies the code in the code area. You can then paste it onto your computer.
You can click on the reload button (top right corner of the code chunk, left to the copy button) to revert back to the original code.

Data Preparation

We use the pizzaplace dataset is available in the gt package.

Date

R has an object class called Date.

This is a date as character.

This is a date as Date.

Recording dates as an Date object instead of a string has several benefits:

calendar math is possible with Date objects
you can filter() based on the chronological order of dates
converting date into an another format is easy

Dates (as string) come in various formats. Several of them are:

2010-12-15
12/15/2010
Dec 15 10
15 December 2010

They all represent the same date.

We can use as.Date() to transform dates stored as characters into Dates.

#--- NOT RUN ---#  
as.Date(date in character, format)

In format you specify how day, month, and year are represented in the date characters you intend to convert using special symbols including:

%d: day as a number (0-31)
%m: month (00, 01, 02, \(\dots\), 12)
%b: abbreviated month (Jan, \(\dots\), Dec)
%B: unabbreviated month (January, \(\dots\), December)
%y: 2-digit year (96 for 1996, 02 for 2002)
%Y: 4-digit year (1996, 2012)

Example

Alternatively, you can use the lubridate package to easily convert dates recorded in characters into Dates.

Using lubridate, you do not need to provide the format information unlike as.Date()

Instead, you simply use y (year), m (month), d (day) in the order they appear in the dates in character.

Example

It is often the case that date values are not formatted in the way you want (e.g., when you are creating figures).

While you can use string manipulation functions to reformat dates (which we learn next in this lecture), it is easier to just use the format() function.

#--- NOT RUN ---#  
format(Date, format)

You can use the same rule for the format argument as the one we saw earlier when using as.Date().

Example

You can extract components (year, month, day) from a Date object using various helper functions offered by lubridate.

year(): year
month(): month
mday(): day of month
yday(): day of year
wday(): day of week

Examples

Unlike dates in character, you can do some math on Date objects.

addition and subtraction
sequence of dates
filter (logical evaluation)

You can use years(), months(), days() from the lubridate package to add specified years, months, and days, respectively.

You can use seq() to create a sequence of dates, where the incremental step is defined by the by option.

Strings manipulation

String manipulation

Introduction
Concatenate
Split
Replace
Detect
letter case
padding

Package

For string (character) manipulation, we use the stringr package, which is part of the tidyverse package. So, you have installed it already.

stringr is loaded automatically when you load tidyverse. So, just load tidyverse.

library(tidyverse)

Resources

Functions

Here are the select functions we learn in this lecture:

join and split
- stringr::str_c()
- stringr::str_split() (tidyr::separate())
mutate strings
- stringr::str_replace()
detect matches
- stringr::str_detect()
manage lengths
- stringr::str_trim()
- stringr::str_pad()

stringr::str_c() lets you concatenate a vector of strings. It is basically the same as paste().

join 1
join 2
join 3
use cases

concatenate

order matters

separator

more than two strings

a string and a vector of strings

Each of the vector elements (verbs) are concatenated with a string ("R")
The separator ("+") applied to all the vector elements

collapsing a vector of strings to a single string

The collapse option collapse all the vector elements into a single string with the collapse separator (here, %) placed between the individual vector elements
sep = "+" is applied when concatenating a vector of strings and a string, and collapse = "%" is applied when concatenating the resulting vector of strings.

two vectors of equal length

nth element of a vector (software_types) is met with n the element of the other vector (verbs).

two vectors of different lengths

nth element of a vector (software_types) is met with n the element of the other vector (verbs) with verbs recycled for the elements in software_type that are missing positional matches.

all combinations

take advantage of the recycling feature to create all possible combinations of values

Concatenating string variables in a dataset
Reading files

Sometimes, you want to concatenate two (or more) string variables into one variable.

For example, suppose you would like to combine pizza size and type into a single variable to make it easier to create faceted figures by size-type.

You can use stringr::str_c() to create a vector of file names that have a common pattern.

For example suppose you have files that are named following this convention: “corn_yield_X.csv”, where X represents year.

You have such csv files starting from 2000 to 2020. Then,

file_names <- stringr::str_c("corn_yield_", 2000:2020, ".csv")

head(file_names)

[1] "corn_yield_2000.csv" "corn_yield_2001.csv" "corn_yield_2002.csv"
[4] "corn_yield_2003.csv" "corn_yield_2004.csv" "corn_yield_2005.csv"

Now, you can easily read each of them iteratively using a loop.

stringr::str_split() splits a string based on a pattern you provide:

But, if you are splitting a variable into two variables, tidyr::separate() is a better option.

Introduction
Use case

How

You can use stringr::str_replace() to replace parts of the texts matched with the user-specified texts.

#--- Syntax ---#
stringr::str_replace(string, pattern, replacement)

Example

Note that the only the first occurrence of “rock” in each of the string vector element was replaced with “rock big time.”

You need to use stringr::str_replace_all() to replace all the occurrences.

Suppose you would like to have a particular format of date in a figure you are trying to create using pizzaplace: e.g., 07/08/20 (month, day, year without the first 2 digits).

Pretend that date_text is the variable that indicates date and it looks like this:

So, you would like to replace “20” with “” (nothing).

Now you can create a figure with the dates in the desired format. From pizzaplace, you could have just done this:

Introduction
use cases

You can use stringr::str_detect() to check if a user-specified texts are part of strings.

It takes a vector of strings and a text pattern, and then return a vector of TRUE/FALSE.

Example

get the list of file names
Define a group from a variable

First clone this repository.

Inside data/data-for-loop-demo, there are two sets of files in a single folder: corn_experiment_x.rds and soy_experiment_y.rds, where both x and y range from 1 to 30.

You want to read only the soy files.

First, let’s get the name of the whole list of files in the working directory:

all_files <- 
  list.files(
    here::here("supplementary-material/data/data-for-loop-demo"),
    full.names = TRUE
  )

head(all_files, 2)

[1] "/Users/tmieno2/Dropbox/TeachingUNL/Data-Science-with-R-Quarto/supplementary-material/data/data-for-loop-demo/corn_experiment_1.rds" 
[2] "/Users/tmieno2/Dropbox/TeachingUNL/Data-Science-with-R-Quarto/supplementary-material/data/data-for-loop-demo/corn_experiment_10.rds"

tail(all_files, 2)

[1] "/Users/tmieno2/Dropbox/TeachingUNL/Data-Science-with-R-Quarto/supplementary-material/data/data-for-loop-demo/soy_experiment_8.rds"
[2] "/Users/tmieno2/Dropbox/TeachingUNL/Data-Science-with-R-Quarto/supplementary-material/data/data-for-loop-demo/soy_experiment_9.rds"

Now use stringr::str_detect() to find which elements of all_files include “soy.”

is_soy <- stringr::str_detect(all_files, "soy")

Okay so, here is the list of all the “soy” files:

all_files[is_soy]

 [1] "/Users/tmieno2/Dropbox/TeachingUNL/Data-Science-with-R-Quarto/supplementary-material/data/data-for-loop-demo/soy_experiment_1.rds" 
 [2] "/Users/tmieno2/Dropbox/TeachingUNL/Data-Science-with-R-Quarto/supplementary-material/data/data-for-loop-demo/soy_experiment_10.rds"
 [3] "/Users/tmieno2/Dropbox/TeachingUNL/Data-Science-with-R-Quarto/supplementary-material/data/data-for-loop-demo/soy_experiment_11.rds"
 [4] "/Users/tmieno2/Dropbox/TeachingUNL/Data-Science-with-R-Quarto/supplementary-material/data/data-for-loop-demo/soy_experiment_12.rds"
 [5] "/Users/tmieno2/Dropbox/TeachingUNL/Data-Science-with-R-Quarto/supplementary-material/data/data-for-loop-demo/soy_experiment_13.rds"
 [6] "/Users/tmieno2/Dropbox/TeachingUNL/Data-Science-with-R-Quarto/supplementary-material/data/data-for-loop-demo/soy_experiment_14.rds"
 [7] "/Users/tmieno2/Dropbox/TeachingUNL/Data-Science-with-R-Quarto/supplementary-material/data/data-for-loop-demo/soy_experiment_15.rds"
 [8] "/Users/tmieno2/Dropbox/TeachingUNL/Data-Science-with-R-Quarto/supplementary-material/data/data-for-loop-demo/soy_experiment_16.rds"
 [9] "/Users/tmieno2/Dropbox/TeachingUNL/Data-Science-with-R-Quarto/supplementary-material/data/data-for-loop-demo/soy_experiment_17.rds"
[10] "/Users/tmieno2/Dropbox/TeachingUNL/Data-Science-with-R-Quarto/supplementary-material/data/data-for-loop-demo/soy_experiment_18.rds"
[11] "/Users/tmieno2/Dropbox/TeachingUNL/Data-Science-with-R-Quarto/supplementary-material/data/data-for-loop-demo/soy_experiment_19.rds"
[12] "/Users/tmieno2/Dropbox/TeachingUNL/Data-Science-with-R-Quarto/supplementary-material/data/data-for-loop-demo/soy_experiment_2.rds" 
[13] "/Users/tmieno2/Dropbox/TeachingUNL/Data-Science-with-R-Quarto/supplementary-material/data/data-for-loop-demo/soy_experiment_20.rds"
[14] "/Users/tmieno2/Dropbox/TeachingUNL/Data-Science-with-R-Quarto/supplementary-material/data/data-for-loop-demo/soy_experiment_21.rds"
[15] "/Users/tmieno2/Dropbox/TeachingUNL/Data-Science-with-R-Quarto/supplementary-material/data/data-for-loop-demo/soy_experiment_22.rds"
[16] "/Users/tmieno2/Dropbox/TeachingUNL/Data-Science-with-R-Quarto/supplementary-material/data/data-for-loop-demo/soy_experiment_23.rds"
[17] "/Users/tmieno2/Dropbox/TeachingUNL/Data-Science-with-R-Quarto/supplementary-material/data/data-for-loop-demo/soy_experiment_24.rds"
[18] "/Users/tmieno2/Dropbox/TeachingUNL/Data-Science-with-R-Quarto/supplementary-material/data/data-for-loop-demo/soy_experiment_25.rds"
[19] "/Users/tmieno2/Dropbox/TeachingUNL/Data-Science-with-R-Quarto/supplementary-material/data/data-for-loop-demo/soy_experiment_26.rds"
[20] "/Users/tmieno2/Dropbox/TeachingUNL/Data-Science-with-R-Quarto/supplementary-material/data/data-for-loop-demo/soy_experiment_27.rds"
[21] "/Users/tmieno2/Dropbox/TeachingUNL/Data-Science-with-R-Quarto/supplementary-material/data/data-for-loop-demo/soy_experiment_28.rds"
[22] "/Users/tmieno2/Dropbox/TeachingUNL/Data-Science-with-R-Quarto/supplementary-material/data/data-for-loop-demo/soy_experiment_29.rds"
[23] "/Users/tmieno2/Dropbox/TeachingUNL/Data-Science-with-R-Quarto/supplementary-material/data/data-for-loop-demo/soy_experiment_3.rds" 
[24] "/Users/tmieno2/Dropbox/TeachingUNL/Data-Science-with-R-Quarto/supplementary-material/data/data-for-loop-demo/soy_experiment_30.rds"
[25] "/Users/tmieno2/Dropbox/TeachingUNL/Data-Science-with-R-Quarto/supplementary-material/data/data-for-loop-demo/soy_experiment_4.rds" 
[26] "/Users/tmieno2/Dropbox/TeachingUNL/Data-Science-with-R-Quarto/supplementary-material/data/data-for-loop-demo/soy_experiment_5.rds" 
[27] "/Users/tmieno2/Dropbox/TeachingUNL/Data-Science-with-R-Quarto/supplementary-material/data/data-for-loop-demo/soy_experiment_6.rds" 
[28] "/Users/tmieno2/Dropbox/TeachingUNL/Data-Science-with-R-Quarto/supplementary-material/data/data-for-loop-demo/soy_experiment_7.rds" 
[29] "/Users/tmieno2/Dropbox/TeachingUNL/Data-Science-with-R-Quarto/supplementary-material/data/data-for-loop-demo/soy_experiment_8.rds" 
[30] "/Users/tmieno2/Dropbox/TeachingUNL/Data-Science-with-R-Quarto/supplementary-material/data/data-for-loop-demo/soy_experiment_9.rds"

Now, you can loop to read all the files.

(
soy_data <- 
  lapply(all_files, \(x) readRDS(x)) %>%
  bind_rows()
)

# A tibble: 60,000 × 4
   N_rate     v corn_yield field_id
    <dbl> <dbl>      <dbl>    <dbl>
 1   248.  85.7       106.        1
 2   237.  56.4       105.        1
 3   227.  15.5       105.        1
 4   175.  33.3       105.        1
 5   236.  25.6       105.        1
 6   169. -13.6       105.        1
 7   237.  30.8       105.        1
 8   240.  32.4       105.        1
 9   158. -18.8       105.        1
10   247. -81.3       106.        1
# ℹ 59,990 more rows

Consider the following dataset of plant genes.

gene_data <- expand.grid(
  id = c("Zm_1", "Zm_2"), 
  gene = c("20_WW_BL_TP1", "20_WW_BL_TP", "20_WW_ML_TP1", "20_WW_ML_TP", "20_WW_TL_TP1", "20_WW_TL_TP3")
)

     id         gene
1  Zm_1 20_WW_BL_TP1
2  Zm_2 20_WW_BL_TP1
3  Zm_1  20_WW_BL_TP
4  Zm_2  20_WW_BL_TP
5  Zm_1 20_WW_ML_TP1
6  Zm_2 20_WW_ML_TP1
7  Zm_1  20_WW_ML_TP
8  Zm_2  20_WW_ML_TP
9  Zm_1 20_WW_TL_TP1
10 Zm_2 20_WW_TL_TP1
11 Zm_1 20_WW_TL_TP3
12 Zm_2 20_WW_TL_TP3

There are three different types of genes: those that have _BL_,_ML_, and _TL_. The objective here is to make a variable that indicates gene group from the gene variable.

gene_data %>% 
  mutate(gene_group = case_when(
    stringr::str_detect(gene, "_BL_") ~ "BL",
    stringr::str_detect(gene, "_ML_") ~ "ML",
    stringr::str_detect(gene, "_TL_") ~ "TL"
  ))

     id         gene gene_group
1  Zm_1 20_WW_BL_TP1         BL
2  Zm_2 20_WW_BL_TP1         BL
3  Zm_1  20_WW_BL_TP         BL
4  Zm_2  20_WW_BL_TP         BL
5  Zm_1 20_WW_ML_TP1         ML
6  Zm_2 20_WW_ML_TP1         ML
7  Zm_1  20_WW_ML_TP         ML
8  Zm_2  20_WW_ML_TP         ML
9  Zm_1 20_WW_TL_TP1         TL
10 Zm_2 20_WW_TL_TP1         TL
11 Zm_1 20_WW_TL_TP3         TL
12 Zm_2 20_WW_TL_TP3         TL

Here are the collection of functions that let you change the letter case of strings.

To upper case

To lower case

Only the first letter is capitalized

You can pad strings with symbols of your choice so the resulting string are of the length you specify.

#--- NOT RUN ---#
stringr::str_pad(strings, string length, side, padding symbol)

Examples

Exercises

Data preparation
Exercise 1
Exercise 2

We will work with the following data:

Use stringr::str_c() to combine, year, month, and day using “-” as the separator and convert the combined text to Date using lubridate.

Work here
Answer

Code

date_data %>%
  mutate(date_as_str = str_c(year, month, day, sep = "-")) %>%
  mutate(date_as_Date = ymd(date_as_str)) %>%
  select(date_as_Date)

Using Date math to recover the dates from year and day_of_year.

Work here
Answer

Code

date_data %>%
  mutate(first_day_of_year = ymd(str_c(year, "01-01"))) %>%
  mutate(date = first_day_of_year + day_of_year - 1)