4  Handling Dates

Date information is common. Often, getting dates into R for processing requires special attention.

The following sections overview some of the techniques, using several important packages, to get date variables into a proper format.

4.1 Common Date Formats

Usually, you’ll have dates written with all of the information (year, month, and day) placed together.

4.1.1 Notational Variations

There are a few notational differences to the components.

  • Day: is always an integer number (15).
  • Month: either an integer number (1), an abbreviation (JAN), or the full month name (January). Note that sometime, people have used Roman numerals for the date. Those can be converted to a number and used in the functions.
  • Year: either a two digit number (20) or a four digit number (2020).

4.1.2 Component Order

Besides these notational differences, the order in which the values are supplied can vary. Here are two common variations.

  • dmy: this is often done in the US (1-15-2020.
  • ymd: the format generally used in the rest of the world (2020-1-15).

4.1.3 Separating Symbols

There are several different symbols used to separate the three parts of the date. Here are two of the most common styles.

  • minus symbol: a minus or short-dash symbol (2020-1-15).
  • slash symbol: a “forward” slash (2020/1/15).
  • blank space: just a single blank space (2020 1 15).

The important thing is consistency. With all of the dates given in the same order (e.g., ymd, mdy), it is possible to convert them from a character string to a true R date value.

## Activate the Core Packages
library(tidyverse) ## Brings in a core of useful functions
library(gt)        ## Tables

## Specialized Packages
library(webshot2)  ## Output PNG or PDF files from gt
library(gtExtras)  ## gt table formatting and highlighting
library(RcppBDT)   ## Used for the getNthDayOfWeek function
library(timeDate)  ## Used for the Easter function

4.2 Recode with Lubridate

You will almost always need to convert data input values of dates into a proper Date data format.

R has a data type of date. The data used in all the timeline functions requires that the date values be in this date data type. Note that inside R, values of the Date data type are stored as the number of days from January 1, 1970. This is just an internal anchor that allows R to do the proper calculations and formatting.

The R library lubridate has functions that convert text and numeric representations of dates into true Date values.

Example are shown in the next few chunks.

Note that the dates in the example data are a mixture of formats. The important thing is that the order of the date elements (month, day, and year) are consistent.

4.2.1 Converting data input

The following chunk uses the tidyverse function read_csv to input some example data.

Note that all of the raw_date values are converted to the same format (Ymd). The date column holds Date values. This is possible because all the date components are in the same order (year, month, day).

## Read data in a data frame format using the CSV reader.
## All the date values are in the same order (ymd).
x1 <- read_csv(col_names = TRUE, show_col_types = FALSE, file =
             "raw_date,       comment
              2020-3-15,      standard Ymd with dashes
              2020/4/2,       slash delimeters
              2021 5 16,      blank delimeters
              2021-dec-30,    month abbrev
              20 July 15,     year abbrev and month name")

## Convert from text to the Date data type.
x1$date <- ymd(x1$raw_date)

## Print a neat table.
gt(x1) |>
  tab_source_note(source_note = "Source: Example data")
Various formats entered as ymd
raw_date comment date
2020-3-15 standard Ymd with dashes 2020-03-15
2020/4/2 slash delimeters 2020-04-02
2021 5 16 blank delimeters 2021-05-16
2021-dec-30 month abbrev 2021-12-30
20 July 15 year abbrev and month name 2020-07-15
Source: Example data

4.2.2 Additional date format challenges

This set of data, which also has format challenges, is arranged in a month, day, year (mdy) format. The result of the processing is a data frame with a proper date variable (date).

## Read data in a tibble format using tribble.
## The order of elements (month, day, year) is the same in each row.
x2 <- read_csv(col_names = TRUE, show_col_types = FALSE, file = 
  "raw_date,           comment
  2/15/22,             slash separators
  2-15-2022,           minus-sign separators
  Feb 15 2022,         month abbrev and blank separators
  February 15 2022,    full month name
  Feb 15 22,           two-digit year
  sep 1 22,            three letter month
  sept 2 22,           four letter month")

## Use lubridate mdy function to convert to the Date data type.
x2$date <- mdy(x2$raw_date)

## Print a neat table.
gt(x2)|>
  tab_source_note(source_note = "Source: Example data")
Various formats entered as mdy
raw_date comment date
2/15/22 slash separators 2022-02-15
2-15-2022 minus-sign separators 2022-02-15
Feb 15 2022 month abbrev and blank separators 2022-02-15
February 15 2022 full month name 2022-02-15
Feb 15 22 two-digit year 2022-02-15
sep 1 22 three letter month 2022-09-01
sept 2 22 four letter month 2022-09-02
Source: Example data

4.3 Roman numerals

Some researchers try to avoid the ambiguity of months versus days by using a Roman numeral for the month. This notation can be converted as shown in the following example.

## Roman numeral entry for months.
x3 <- read_csv(col_names = TRUE, show_col_types = FALSE, file = 
  "day, rmonth, year
  15,   X,      2020
  1,    VI,     2020
  12,   I,      2020")

## Convert Roman numeral to numeric, 
## paste parts together, and convert to a date.
x3$date <- dmy(paste(x3$day,
                     as.numeric(as.roman((x3$rmonth))),
                     x3$year))

## Print the results in a neat table.
gt(x3) |>
  tab_source_note(source_note = "Source: Example data")
Dates entered with months as Roman numerals
day rmonth year date
15 X 2020 2020-10-15
1 VI 2020 2020-06-01
12 I 2020 2020-01-12
Source: Example data

This conversion test assumed that each of the date components is in a separate variable. That’s not always the case. It is possible to do some programming to account for other formats, but let’s just give the problem to ChatGPT (or, likely, any of the other large-language models).

Note that the data are really messy. There is a variety of delimeters. Several dates are only two digits for the year.

Here is the ChatGPT-4 request.

Here are some dates that are in the format of day, month and year. The month is Roman numerals. Make these dates into numeric values in the format "y-m-d", please. Here are the dates:

25XII1947

5V2012

12 VIII 87

8 iv 2022

11-VI-77

I then asked for the result as a CSV table. That’s entered into the following code chunk.

roman_test <- read_csv(col_names=TRUE, show_col_types=FALSE, file=
        "before,     after
         25XII1947,  1947-12-25
         5V2012,     2012-05-05
         12 VIII 87, 1987-08-12
         8 iv 2022,  2022-04-08
         11-VI-77,   1977-06-11")

gt(roman_test) |>
  tab_source_note(source_note = "Source: Example data")
Messy dates with Roman numerals
before after
25XII1947 1947-12-25
5V2012 2012-05-05
12 VIII 87 1987-08-12
8 iv 2022 2022-04-08
11-VI-77 1977-06-11
Source: Example data

This might be a demonstration that large-language models can have considerable utility in rescuing data that are messy.

4.3.1 Putting pieces of the date together

A previous example (the first with Roman numerals) shows how to put separate values for the day, month and year together to form a date. It is simply a combination of pasting the three variables together and then converting the resulting string to the Date type with one of the lumbridate functions.

4.3.2 Reordering the data

Sometimes you need to do some wrangling to get the pieces you need and then arrange the dates in the proper format.

Each date entry must be a separate row in the data frame.

An example shows how this can be done. This is a fairly complex problem as several things are being done.

Here are some things to note.

  • There are nine events in the data. They are arranged as three rows.
  • The year is separate from each event’s date. The two values need to be combined.
  • The events (and their dates) need to be made into a data frame that has one event per row.
  • The year is an important element as it can be used to group events later.
## Read the data.
x4 <- read_csv(col_names = TRUE, show_col_types = FALSE, file = 
  "year, planning,  event,      review
  2017,  March 23,  April 25,   May 30
  2018,  January 4, February 9, February 25
  2019,  April 17,  May 12,     June 20")

## Show the original data in a table.
gt(x4) |>
  tab_source_note(source_note = "Source: Example data")
Data needing to have the arrangement changed
year planning event review
2017 March 23 April 25 May 30
2018 January 4 February 9 February 25
2019 April 17 May 12 June 20
Source: Example data

4.3.3 Reconfigure the data

The date data come from different columns. Each date is a different event. All this must be brought together and reformatted.

The dplyr package provides the tools for manipulating the data.

  • The mutate function creates new variables (columns).
  • The pivot_longer function shifts data from columns to rows.
  • The rename function gives columns new names.
  • The select function keeps just the named columns.
## Wrangle using data from the previous chunk.
x4_new <- x4 %>% 
  ## combine cols, make proper dates
  mutate(planning = ymd(paste(year,planning))) %>%      
  mutate(event    = ymd(paste(year,event))) %>% 
  mutate(review   = ymd(paste(year,review))) %>% 
  ## Reconfigure cols to rows
  pivot_longer(cols= c(planning, event, review)) %>% 
  ## Rename the columns
  rename(Event  = name, Date = value)

## Show the newly formatted data.
gt(x4_new) |>
  tab_source_note(source_note = "Source: Example data")
Reconfigured data (version 1)
year Event Date
2017 planning 2017-03-23
2017 event 2017-04-25
2017 review 2017-05-30
2018 planning 2018-01-04
2018 event 2018-02-09
2018 review 2018-02-25
2019 planning 2019-04-17
2019 event 2019-05-12
2019 review 2019-06-20
Source: Example data
## Alternative table presentation
gt(x4_new,
  groupname_col = "year")|>
  tab_source_note(source_note = "Source: Example data")
Reconfigured data (version 2)
Event Date
2017
planning 2017-03-23
event 2017-04-25
review 2017-05-30
2018
planning 2018-01-04
event 2018-02-09
review 2018-02-25
2019
planning 2019-04-17
event 2019-05-12
review 2019-06-20
Source: Example data

4.4 Calculate Holidays

Sometimes, it’s useful to have a way to calculate the date of a holiday for a future (or, maybe, past) date. There are two functions that accomplish this task.

  • getNthDayOfWeek
  • Easter

These functions come from the libraries RcppBDT and timeDate. The other functions come from lubridate, part of the tidyverse package.

## Set a year for the calculations
yr <- 2024

## Dates are not yet known, so they are entered as NA
date_table <- read_csv(col_names=TRUE, show_col_types=FALSE, file= 
    "Holiday,          Rule,                  date
     Hawaii Statehood, Third Friday in August, NA
     Memorial Day,     Last Monday in May,     NA
     Easter Sunday,    Complex definition,     NA
     Good Friday,      Two days before Easter, NA")

## Hawaii Statehood Day (3rd Friday in August)
hawaii_statehood <- getNthDayOfWeek(third, Fri, 8, yr)
date_table$date[1] <- as.character(hawaii_statehood)

## Memorial day is the Last Monday in May
## There isn't a "last" option for this function (must calculate)
mon5 <- getNthDayOfWeek(fifth, Mon, 5, yr)

if(mon5 < 23){memorial_day <- 
  getNthDayOfWeek(sixth, Mon, 5, yr)} else {memorial_day <- mon5}

date_table$date[2] <- as.character(memorial_day)

## Easter Sunday
easter_sunday <- as.Date(Easter(yr))
date_table$date[3] <- as.character(easter_sunday)

## Good Friday
good_friday <- as.Date(easter_sunday - 2)
date_table$date[4] <- as.character(good_friday)

## Put the results in a table
gt(date_table) |>
  tab_source_note(source_note = "Source: Rule calculations")
Calculated dates (year = 2024)
Holiday Rule date
Hawaii Statehood Third Friday in August 2024-08-16
Memorial Day Last Monday in May 2024-05-27
Easter Sunday Complex definition 2024-03-31
Good Friday Two days before Easter 2024-03-29
Source: Rule calculations