3  Data Input

All of the input strategies build on the ideas that the data needs to be observable, entered as easily as possible, and checked and updated with minimal effort.

Sometimes, the source of the data encourages a particular type of data entry. The quantity of data is important, too.

One data are entered and verified, it is likely the data structure will be changed in some way. A simple modification would be the addition of a new column or two. A more extreme change would be making a “long” structure from one that is “wide” (something explained later).

It is probably best to master a few basic strategies and stick to them. The reason is that there are a few gotcha things that can appear in your data structure if you make a slight change in your strategy. For example, one package can treat character strings as factors while a similar appearing package will keep strings as characters.

3.1 Data Frames and Tibbles

Tibbles (Müller and Wickham 2023) are a modern re-imagining of the data frame. Both of these structures hold data as rows and columns. A tibble is considered a more efficient and easier to use than a data frame, especially when working in the tidyverse ecosystem (Wickham et al. 2019).

As a result, data frames will be converted to tibbles in the examples shown here.

## Activate the Core Packages
library(tidyverse) ## Brings in a core of useful functions
library(gt)        ## Tables

## Specialized Packages
library(lubridate)     ## Date functions
library(googlesheets4) ## Read Google Sheets
library(Hmisc)         ## Comprehensive data summaries

3.2 Input Strategies

There are four basic input strategies shown here. Each is directed at a different input problem.

3.2.1 In-Code Table

The structure of the data resembles a square table. Each column has a descriptive header. This is the favored strategy for smaller sets of data.

The idea here is to use or emulate the structure of a CSV file inside the code.

One benefit is having the data stored inside a code chunk where the data are close to the processing steps.

This strategy works well when you don’t have too much data (perhaps < 50 observations). It is simple to use and allows early inspection of the data. The data are also easily updated either by adding new observation, adding a new variable (i.e., data column), or making a correction.

## Read the data
data <- read_csv(col_names = TRUE, file =
      "Name,      Age, Gender
       Jean Paul, 23,  male
       Micah,     28,  male
       Sarah,     24,  female
       Rose,      26,  female")

## Confirm the data
data
# A tibble: 4 × 3
  Name        Age Gender
  <chr>     <dbl> <chr> 
1 Jean Paul    23 male  
2 Micah        28 male  
3 Sarah        24 female
4 Rose         26 female

There are a number of things to note.

  • read_csv: This is part of the readr package.

  • Parameters: The col_headers = TRUE and file = are likely to be the same for all you uses of read_csv.

  • Characters: Character strings (like the names in the example) are kept as character and not converted into factors.

  • Parentheses & Quotation marks: The data are part of the parameter stet for read_csv and therefore inside the parentheses. Quotation marks occur before and after the data. Data that come as character strings don’t need to be enclosed in quotation marks unless there is an embedded comma.

  • Spacing: You can use spaces with the data so that data columns line up with the headers.

You can see that this is a simple and logical way to enter data.

3.2.2 In-Code Assignment as Vectors

The data are arranged in rows (i.e., vectors), with each row having a header (i.e., column name) followed by the data values.

This is the format shown in many R code examples. It is considered a poor choice here.

Here, data are entered as concatenated strings.

## Build the data frame
data <- data.frame(
   name   = c("Jean Paul", "Micah", "Sarah", "Rose"), 
   age    = c(23,28,24,26),
   gender = c("male","male","female","female"),
   stringsAsFactors = FALSE)

## Convert the data frame to a tibble
data <- as_tibble(data)

## Confirm the data
data
# A tibble: 4 × 3
  name        age gender
  <chr>     <dbl> <chr> 
1 Jean Paul    23 male  
2 Micah        28 male  
3 Sarah        24 female
4 Rose         26 female

There are a few things to note.

  • Shape: The visual reminder that the data (i.e., tibble) is square is obscured by the vector structure. Generally, each of the vectors (i.e., columns) will need to be the same length.

  • Character strings: Character strings need to be enclosed in quotation marks.

  • Data editing: If there are quite a few data values, it’s quite difficult to find places in the long strings of value where a correction (e.g., insertion of a dropped value) needs to be made.

  • Repeated values: Lists of numbers (or other values) can be created quite easily. For example, a sequence number column can be created with code like this inserted with the other vectors:

    seq = c(1:4),

  • Character vs Factor: The stringsAsFactors = FALSE statement is generally desirable if some of the data are character variables. There seems to be a tendency to define character strings as factors if you don’t use this statement.

3.2.3 External Spreadsheet

This looks like an In-Code Table as it is a square table. The difference is that data are stored outside the R code.

Entering data into an on-line spreadsheet, like Google Sheets, has an advantage of allowing access to other people working on a project.

Spreadsheet data entry is often quite efficient with facilities like auto-complete.

The example here uses Google Sheets. Getting access to a Google Sheet spreadsheet requires a bit of setup using Google Sheets (think: giving some permissions). Once done, you use the link that’s provided. The link looks something like this (which is fake):

 1z-gPNmPE-_2cBiZgsVB84uRWWa5yWqmIrHxshlg8k7Y

You can store this link in a text file (named Glink.txt here) and then use the stored link by importing it with a read_file function.

## Retrieve the stored Google Sheet link
link <- read_file("data/Glink.txt")

## Read the Google Sheet contents
data3 <- read_sheet(link)

## Confirm the data
data3
# A tibble: 4 × 3
  Name        Age Gender
  <chr>     <dbl> <chr> 
1 Jean Paul    23 male  
2 Micah        28 male  
3 Sarah        24 female
4 Rose         26 female

There are some ways to handle special situations.

  • Multiple Sheets: If you have more than one spreadsheet in the set, modify the read_sheet as in the following example (to use sheet 2):

    read_sheet(link, sheet = 2)

  • Specific Range of Cells: This is important when you want to eliminate superfluous information like title rows and cells that have equations.

If all of the values in a spreadsheet column are not the same type (e.g., number, character), the imported results may not be very satisfactory.

Here is an example with a range trim:

 read_sheet(link, range = "A1:C4")

3.2.4 External CSV File

This type of file has comma separated values (hence CSV). The file is stored somewhere on the computer, somewhere near (i.e., in the same folder structure as the R code) where they will be used.

Usually, the first row consists of header names.

Note that spreadsheets are often have their data saved in this format.

There are a lot of good reasons to store data in a file that is outside the R code.

  • Imported data: If you’ve been supplied with a CSV-formatted file, then you’re likely going to keep the data external to the code.

  • Lots of data: If you have hundreds of observations or more, it’s probably good to keep the data in a separate file.

  • Data spans many columns: If your data cover more real estate than fits on a line (e.g., > 50 characters), it may be better to store the data in a separate file.

  • Data security: You want to keep the data separate so that you are sure it won’t be changed accidentally as you build analysis code. Also, there may be data in the file that you don’t want other people to see.

The following code chunk show how to read an external CSV file.

## Define the file location
file_directory <- "data"

## Define the file name
file_name <- "external.csv"

## Combine the location and name
data_file <- paste0(file_directory,"/",file_name)

## Read the data
data <- read_csv(file = data_file)

## Confirm the data
data
# A tibble: 4 × 3
  Name        Age Gender
  <chr>     <dbl> <chr> 
1 Jean Paul    23 male  
2 Micah        28 male  
3 Sarah        24 female
4 Rose         26 female

3.3 Entering Dates

Dates can be a little tricky. Fortunately, there is Gromelund’s (2011) R package, lubridate , that is very helpful and quite straight forward to use (Grolemund 2021).

In the following example, the dates (i.e., Start, End) are read as character variables. After data confirmation, the values are converted to date variables.

Later on, a new variable is created (Months) using the lubridate functions interval and time_period. Note that there are several different values available for time_period, including "days", "weeks" and "years".

## Read the data
data <- read_csv(col_names = TRUE, file = 
      "Name,      Start,     End
       Jean Paul, 2021-4-15, 2022-7-13
       Micah,     2019-2-7,  2021-12-31
       Sarah,     2020-1-3,  2022-1-19
       Rose,      2022-1-15, 2022-10-5")

## Confirm the data
data
# A tibble: 4 × 3
  Name      Start     End       
  <chr>     <chr>     <chr>     
1 Jean Paul 2021-4-15 2022-7-13 
2 Micah     2019-2-7  2021-12-31
3 Sarah     2020-1-3  2022-1-19 
4 Rose      2022-1-15 2022-10-5 
## Convert to dates
data$Start <- ymd(data$Start)
data$End   <- ymd(data$End)

## Derive date-related values
period      <- interval(data$Start,data$End)
data$Months <- time_length(period,"months")
data$Days   <- time_length(period,"days")

## Confirm the data
data
# A tibble: 4 × 5
  Name      Start      End        Months  Days
  <chr>     <date>     <date>      <dbl> <dbl>
1 Jean Paul 2021-04-15 2022-07-13  14.9    454
2 Micah     2019-02-07 2021-12-31  34.8   1058
3 Sarah     2020-01-03 2022-01-19  24.5    747
4 Rose      2022-01-15 2022-10-05   8.67   263

3.4 Verifying the Data

Each of the examples has a simple statement that prints at least the start of the data tibble. This provides a first look at how R has interpreted the data. The size of the tibble (e.g., A tibble: 4 x 3), characteristics of each column (e.g., <chr>, <dbl>) and the first rows of data.

It is useful to do more.

3.4.1 Summary Function

The summary statement does what is needed as it is simple and intuitive.

data <- read_csv(col_names = TRUE, file = 
    "Group, Season, Score
     Alpha, Spring, 2
     Alpha, Summer, 3
     Alpha, Fall,   3
     Beta,  Spring, 2
     Beta,  Summer, 4
     Beta,  Fall,   3
     Beta,  Winter, 1")

## Basic statistics of the data
summary(data)
    Group              Season              Score      
 Length:7           Length:7           Min.   :1.000  
 Class :character   Class :character   1st Qu.:2.000  
 Mode  :character   Mode  :character   Median :3.000  
                                       Mean   :2.571  
                                       3rd Qu.:3.000  
                                       Max.   :4.000  

3.4.2 Describe Function

If you need more details, the Hmisc package (https://hbiostat.org/R/Hmisc/) has a describe function. This returns a much more comprehensive and statistically-oriented summary data.

Here is an example applied to the data in the previous chunk.

## Use the Hmisc Package function
describe(data)
data 

 3  Variables      7  Observations
--------------------------------------------------------------------------------
Group 
       n  missing distinct 
       7        0        2 
                      
Value      Alpha  Beta
Frequency      3     4
Proportion 0.429 0.571
--------------------------------------------------------------------------------
Season 
       n  missing distinct 
       7        0        4 
                                      
Value        Fall Spring Summer Winter
Frequency       2      2      2      1
Proportion  0.286  0.286  0.286  0.143
--------------------------------------------------------------------------------
Score 
       n  missing distinct     Info     Mean      Gmd 
       7        0        4    0.911    2.571    1.143 
                                  
Value          1     2     3     4
Frequency      1     2     3     1
Proportion 0.143 0.286 0.429 0.143
--------------------------------------------------------------------------------

3.5 Saving Data in a File

Often, you do a lot of work to input and configure a set of data so that it can be used in R for analyses.

There are cases where you’re very unlikely to change the data and you don’t want the “overhead” of running all the data configuration steps as you incrementally develop your analyses. This calls for generating an external version of your data. Such a data set is then easily entered into R for subsequent analyses.

There are several format in which you can store a single tibble object (i.e., your data).

3.5.1 RDS File

The saveRDS function lets you save data in a compressed form that maintains the data properties. This is an R-unique format that can’t be used with other programs. You can only open the file using R.

## Read some example data
data <- read_csv(col_names = TRUE, file = "data/external.csv")

## Save Data as an RDS file
saveRDS(data,"data/Data_RDS.RDS")

## Read the RDS file for confirmation
stored_data <- readRDS("data/Data_RDS.RDS") 

## Show the file contents
stored_data
# A tibble: 4 × 3
  Name        Age Gender
  <chr>     <dbl> <chr> 
1 Jean Paul    23 male  
2 Micah        28 male  
3 Sarah        24 female
4 Rose         26 female

3.5.2 CSV File

The write_csv function creates a familiar CSV-formatted file. Data in this format can be shared with other people and used in other programs (e.g., Excel or Sheets)

## Read some example data
data <- read_csv(col_names = TRUE, file = "data/external.csv")

## Save the data as a CSV file
write_csv(data,"data/Data_CSV.CSV")

## Read the CSV file for confirmation
stored_data <- read_csv("data/Data_CSV.CSV")

## Show the file contents
stored_data
# A tibble: 4 × 3
  Name        Age Gender
  <chr>     <dbl> <chr> 
1 Jean Paul    23 male  
2 Micah        28 male  
3 Sarah        24 female
4 Rose         26 female

3.6 Column Names

There are times when you’ll need to change the name of a column in your data. The dplyr package is very useful.

3.6.1 Renaming a Column

Use the dplyr function ## Confirm the data

data. In this example, we’re changing the names of two columns.

## Read the data
data <- read_csv(col_names = TRUE, file = 
     "Name,    Institution, Class
      Jasmine, North High,  Freshman
      Harriet, North High,  Senior
      Milton,  South High,  Junior
      Xavier,  South High,  Senior")

## Rename two columns
data <- data |> 
  dplyr::rename(School = Institution) |> 
  dplyr::rename(Student = Name)

## Confirm the data
data
# A tibble: 4 × 3
  Student School     Class   
  <chr>   <chr>      <chr>   
1 Jasmine North High Freshman
2 Harriet North High Senior  
3 Milton  South High Junior  
4 Xavier  South High Senior  

3.6.2 Shifting to Lower Case

There is a dplyr function for this. The example shows how it works.

## Use the data from the previous chunk

## Shift all the column names to lowercase
data <- data |> 
  dplyr::rename_with(tolower)

## Confirm the data
data
# A tibble: 4 × 3
  student school     class   
  <chr>   <chr>      <chr>   
1 Jasmine North High Freshman
2 Harriet North High Senior  
3 Milton  South High Junior  
4 Xavier  South High Senior  

References

Grolemund, Garrett. 2021. “Do More with Dates and Times in r.” https://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html.
Grolemund, Garrett, and Hadley Wickham. 2011. “Dates and Times Made Easy with lubridate.” Journal of Statistical Software 40 (3): 1–25. https://www.jstatsoft.org/v40/i03/.
Müller, Kirill, and Hadley Wickham. 2023. Tibble: Simple Data Frames.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse.” Journal of Open Source Software 4 (43): 1686.