The Discovery Process

Many studies collect basic demographic data relating to the people in the study area. Often, this is information about participants who are interviewed. Similarly, data on the people and the way they fit into the community can be collected. This can lead to more informed discussions.

In this scenario, simple narrative statements were recorded based on descriptive comments made by a senior farmer as he described the other farmers who were gathered for a meeting.

The goal here is to begin the data extraction and interpretation process. Obviously, more summary data can be extracted from the narratives. This is just a start.

Input Data

Two files are supplied. One has narrative data about the farmers and the other contains a few statements that help interpret the data.

farmers.txt (excerpt)

John grows soybeans on just 10 acres. He’s single and works at the factory as a senior manager. He goes along with the traditional farming techniques as he mostly contracts out the work to other people. He’s 60 years old.

Philip believes it is easier to grow mustard on his 40 acres than soybeans, his previous crop. He uses traditional methods. He’s 41 and is married with 1 child. He’s a college graduate and he works at the factory.

Ben learned about organic farming in college and he’s applying this knowledge to successfully grow soybeans on his 50 acre farm. He and his wife have 3 children. He is 39.

Richard is a high-school dropout. None the less, he’s a talented farmer as he grows soybeans on his organic-certified farm. He says he barely supports himself on 30 acres. He has no wife or children. He’s 53.

Don is known locally as the “mustard king” as he was the first farmer to grow this crop in the area. He’s 59 and has a wife and 2 children. After graduating from the local high school, he took over the family’s 60 acres as his father was seriously ill. There were no organic methods when he started and he’s seen no reason to change his farming methods.

farmers_guidelines.txt

Here are some rules to help interpret the data:

If factory work is not explicitly stated, assume they don’t work at the factory.

College and University are equivalent and can be called “Tertiary”.

Call high school completion “Secondary”.

Call high school dropout “Primary”.

Someone who did not complete college and is not a high-school dropout is a high-school graduate (i.e., “Secondary”).

The order for education is “Primary”, “Secondary” and “Tertiary”.

Project Files

There are three project files. The contents of these files are shown in the Appendix.

input_guidelines.txt
bar_chart_guidelines.txt
table_guidelines.txt

Custom Instructions

Create custom R language code that has the following characteristics:

Uses two hash marks (##) at the start of each comment.

Uses comments in the code to describe each action.

Formats the code so it is straightforward to see the logical structure.

Follow other rules provided in separate guideline documents.

Carefully double-check to make sure all the guidelines, both here and in the other documents, are followed to the extent possible.

Prompt

Extract the data from the “farmers” narrative and then create a formal table showing the input data. The data source is “test data”. Use the “farmers_guidelines” to help with the interpretation. Carefully follow the “input_guidelines” and “table_guidelines”, double checking to make sure you’ve got them right.

Response Certainly, I’ll extract the data from the farmers narrative, create a formal table, and follow the provided guidelines. Let’s start by creating the input data table and then we’ll format it using the gt package.

Show the code

## Set read_csv function option to suppress warning messages.
options(readr.show_col_types = FALSE)

## Use the package for the read_csv function
library(tidyverse)
library(readr)

## Data input.
data <- read_csv(col_names=TRUE, file=
"name,  age, marital_status, children, education,   farm_size, crop,     organic, factory_work
Bill,   62,  married,        2,        Secondary,   128,       corn,     no,      no
Harry,  23,  single,         0,        Tertiary,    5,         mustard,  yes,     yes
Fred,   52,  single,         0,        Tertiary,    180,       corn,     no,      no
Sam,    34,  married,        5,        Secondary,   45,        soybeans, yes,     yes
Jim,    51,  married,        2,        Secondary,   50,        corn,     yes,     no
Frank,  29,  married,        1,        Tertiary,    30,        soybeans, yes,     yes
Carl,   73,  widower,        2,        Secondary,   300,       corn,     no,      no
Pete,   33,  married,        3,        Secondary,   40,        mustard,  no,      no
John,   60,  single,         0,        Secondary,   10,        soybeans, no,      yes
Philip, 41,  married,        1,        Tertiary,    40,        mustard,  no,      yes
Ben,    39,  married,        3,        Tertiary,    50,        soybeans, yes,     no
Richard,53,  single,         0,        Primary,     30,        soybeans, yes,     no
Don,    59,  married,        2,        Secondary,   60,        mustard,  no,      no
James,  51,  married,        4,        Tertiary,    3,         soybeans, no,      yes
Chris,  43,  married,        2,        Secondary,   23,        corn,     no,      yes
Bob,    56,  divorced,       3,        Tertiary,    40,        soybeans, no,      no
Joe,    64,  married,        2,        Secondary,   120,       soybeans, yes,     no
Max,    32,  married,        0,        Tertiary,    30,        corn,     no,      yes
Dave,   36,  married,        4,        Tertiary,    10,        corn,     yes,     yes
Erik,   29,  single,         0,        Tertiary,    30,        soybeans, yes,     no
Ed,     81,  single,         0,        Secondary,   50,        soybeans, no,      no
Roger,  53,  married,        3,        Secondary,   60,        corn,     yes,     no")

## Save the input data as a CSV file
write_csv(data, "farmers_data.csv")

## Display the data
data

# A tibble: 22 × 9
   name     age marital_status children education farm_size crop     organic
   <chr>  <dbl> <chr>             <dbl> <chr>         <dbl> <chr>    <chr>  
 1 Bill      62 married               2 Secondary       128 corn     no     
 2 Harry     23 single                0 Tertiary          5 mustard  yes    
 3 Fred      52 single                0 Tertiary        180 corn     no     
 4 Sam       34 married               5 Secondary        45 soybeans yes    
 5 Jim       51 married               2 Secondary        50 corn     yes    
 6 Frank     29 married               1 Tertiary         30 soybeans yes    
 7 Carl      73 widower               2 Secondary       300 corn     no     
 8 Pete      33 married               3 Secondary        40 mustard  no     
 9 John      60 single                0 Secondary        10 soybeans no     
10 Philip    41 married               1 Tertiary         40 mustard  no     
# ℹ 12 more rows
# ℹ 1 more variable: factory_work <chr>

Now, let’s create a formal table using the gt package, following the table guidelines:

Show the code

## Load required libraries
library(gt)
library(tidyverse)

## Initialize the identification information.
table_title <- "Farmers' Characteristics and Farming Practices"
data_source <- paste0("Source: test data (", Sys.Date(), ")")

## Use table stripes
table_stripes <- TRUE

## Create the table.
table <- gt(data) %>%
  tab_header(
    title = table_title
  ) %>%
  tab_source_note(
    source_note = data_source
  ) %>%
  tab_style(
    style = cell_text(v_align = "top"),
    locations = cells_body()
  ) %>%
  opt_row_striping(row_striping = table_stripes) %>%
  cols_label(
    marital_status = "Marital Status",
    farm_size = "Farm Size",
    factory_work = "Factory Work"
  ) %>%
  fmt_number(
    columns = c(age, children, farm_size),
    decimals = 0
  ) %>%
  cols_align(
    align = "left",
    columns = everything()
  ) %>%
  tab_footnote(
    footnote = "In acres",
    locations = cells_column_labels(columns = farm_size)
  )

## Output the table
table

name	age	Marital Status	children	education	Farm Size¹	crop	organic	Factory Work
Farmers' Characteristics and Farming Practices
Bill	62	married	2	Secondary	128	corn	no	no
Harry	23	single	0	Tertiary	5	mustard	yes	yes
Fred	52	single	0	Tertiary	180	corn	no	no
Sam	34	married	5	Secondary	45	soybeans	yes	yes
Jim	51	married	2	Secondary	50	corn	yes	no
Frank	29	married	1	Tertiary	30	soybeans	yes	yes
Carl	73	widower	2	Secondary	300	corn	no	no
Pete	33	married	3	Secondary	40	mustard	no	no
John	60	single	0	Secondary	10	soybeans	no	yes
Philip	41	married	1	Tertiary	40	mustard	no	yes
Ben	39	married	3	Tertiary	50	soybeans	yes	no
Richard	53	single	0	Primary	30	soybeans	yes	no
Don	59	married	2	Secondary	60	mustard	no	no
James	51	married	4	Tertiary	3	soybeans	no	yes
Chris	43	married	2	Secondary	23	corn	no	yes
Bob	56	divorced	3	Tertiary	40	soybeans	no	no
Joe	64	married	2	Secondary	120	soybeans	yes	no
Max	32	married	0	Tertiary	30	corn	no	yes
Dave	36	married	4	Tertiary	10	corn	yes	yes
Erik	29	single	0	Tertiary	30	soybeans	yes	no
Ed	81	single	0	Secondary	50	soybeans	no	no
Roger	53	married	3	Secondary	60	corn	yes	no
Source: test data (2024-09-04)
¹ In acres

Prompt

Please use the saved data and make bar charts for the following data: distribution of crop types, distribution of farmer ages, distribution of farm sizes. Be sure to follow the “bar_chart_guidelines” and double check that they are followed.

Response

Certainly! I’ll create bar charts for the distribution of crop types, farmer ages, and farm sizes using the saved data. I’ll make sure to follow the bar chart guidelines provided.

First, let’s load the necessary libraries and the saved data:

I’ve created an artifact containing the R code to generate the three requested bar charts using the saved data. Let’s go through how this code adheres to the bar chart guidelines:

Bar outline: Each bar has a black outline (color = “black” in geom_bar()).
Bar fill colors: We use “cornsilk3” as the fill color for all bars, as each chart represents a single variable.
Panel background color: The panel background is set to darkslategray2 (panel.background = element_rect(fill = “darkslategray2”)).
Axis labels and scale values: The axis labels and scale values are set to black and size 12 (axis.text = element_text(size = 12, color = “black”), axis.title = element_text(size = 12, color = “black”)).
Grid lines: We use y-axis grid lines (default in theme_minimal()) and removed x-axis grid lines (panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank()).
Start and end of the y axis: The y-axis starts at zero (default behavior) and has a small extension at the top (scale_y_continuous(expand = expansion(mult = c(0, 0.1)))).
Source information: We’ve added the source information as a caption for each chart (caption = source).
Box plot outliers: This guideline is not applicable to these bar charts.
Box plot data points: This guideline is not applicable to these bar charts.

These bar charts provide a clear visualization of the distribution of crop types, farmer ages, and farm sizes in the dataset. The charts are consistent in style and follow the provided guidelines.

Is there anything you’d like me to modify or explain further about these bar charts?

Show the code

## Load required libraries
library(tidyverse)
library(ggplot2)

## Read the saved CSV file
data <- read_csv("farmers_data.csv")

## Set the source information
source <- "Source: test data"

## 1. Distribution of crop types
crop_distribution <- data %>%
  count(crop) %>%
  ggplot(aes(x = crop, y = n, fill = "cornsilk3")) +
  geom_bar(stat = "identity", color = "black") +
  geom_text(aes(label = n), vjust = -0.5, size = 4) +
  scale_fill_manual(values = "cornsilk3") +
  theme_minimal() +
  theme(
    panel.background = element_rect(fill = "darkslategray2"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.text = element_text(size = 12, color = "black"),
    axis.title = element_text(size = 12, color = "black")
  ) +
  labs(
    title = "Distribution of Crop Types",
    x = "Crop",
    y = "Count",
    caption = source
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
  guides(fill = "none")

print(crop_distribution)

Show the code

## 2. Distribution of farmer ages
age_bins <- cut(data$age, breaks = seq(20, 90, by = 10))

age_distribution <- data %>%
  count(age_group = age_bins) %>%
  ggplot(aes(x = age_group, y = n, fill = "cornsilk3")) +
  geom_bar(stat = "identity", color = "black") +
  geom_text(aes(label = n), vjust = -0.5, size = 4) +
  scale_fill_manual(values = "cornsilk3") +
  theme_minimal() +
  theme(
    panel.background = element_rect(fill = "darkslategray2"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.text = element_text(size = 12, color = "black"),
    axis.title = element_text(size = 12, color = "black")
  ) +
  labs(
    title = "Distribution of Farmer Ages",
    x = "Age Group",
    y = "Count",
    caption = source
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
  guides(fill = "none")

print(age_distribution)

Show the code

## 3. Distribution of farm sizes
size_bins <- cut(data$farm_size, breaks = c(0, 10, 50, 100, 200, 300))

size_distribution <- data %>%
  count(size_group = size_bins) %>%
  ggplot(aes(x = size_group, y = n, fill = "cornsilk3")) +
  geom_bar(stat = "identity", color = "black") +
  geom_text(aes(label = n), vjust = -0.5, size = 4) +
  scale_fill_manual(values = "cornsilk3") +
  theme_minimal() +
  theme(
    panel.background = element_rect(fill = "darkslategray2"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.text = element_text(size = 12, color = "black"),
    axis.title = element_text(size = 12, color = "black")
  ) +
  labs(
    title = "Distribution of Farm Sizes",
    x = "Farm Size (acres)",
    y = "Count",
    caption = source
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
  guides(fill = "none")

print(size_distribution)

Prompt

Very good. Please use the saved data to create a box plot of the farm size for each of the crops. As before, be sure to use the “bar_chart_guidelines” and double check that they are being used.

Response

Certainly! I’ll create a box plot of the farm size for each of the crops using the saved data, following the bar chart guidelines. Although these guidelines were primarily designed for bar charts, we’ll adapt them as best we can for the box plot.

Now, let’s go through how this code adheres to the bar chart guidelines, adapting them for our box plot:

Bar outline: We’ve used a black outline for the boxes (color = “black” in geom_boxplot()).
Bar fill colors: We’ve used different colors for each crop type, starting with “cornsilk3” for the first category, then “darkgoldenrod2”, “lightsalmon2”, and “lightblue” for the remaining categories.
Panel background color: The panel background is set to darkslategray2 (panel.background = element_rect(fill = “darkslategray2”)).
Axis labels and scale values: The axis labels and scale values are set to black and size 12 (axis.text = element_text(size = 12, color = “black”), axis.title = element_text(size = 12, color = “black”)).
Grid lines: We use y-axis grid lines (default in theme_minimal()) and removed x-axis grid lines (panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank()).
Start and end of the y axis: The y-axis starts at zero (default behavior) and has a small extension at the top (scale_y_continuous(expand = expansion(mult = c(0, 0.1)))).
Source information: We’ve added the source information as a caption for the chart, including the current date (caption = source).
Box plot outliers: As per the guidelines, we’ve set outlier.shape = NA to not show outliers in any special way.
Box plot data points: We’ve added individual data points using geom_jitter(), colored them red, and set their size to 3, as specified in the guidelines.

This box plot provides a clear visualization of the farm size distribution for each crop type in the dataset. The plot is consistent with the style of the previous bar charts and follows the provided guidelines as closely as possible, adapting them for a box plot format.

Is there anything you’d like me to modify or explain further about this box plot?

Show the code

## Load required libraries
library(tidyverse)
library(ggplot2)

## Read the saved CSV file
data <- read_csv("farmers_data.csv")

## Set the source information
source <- paste0("Source: test data (", Sys.Date(), ")")

## Create the box plot
farm_size_boxplot <- ggplot(data, aes(x = crop, y = farm_size, fill = crop)) +
  geom_boxplot(outlier.shape = NA, color = "black") +
  geom_jitter(color = "red", size = 3, width = 0.2) +
  scale_fill_manual(values = c("cornsilk3", "darkgoldenrod2", "lightsalmon2", "lightblue")) +
  theme_minimal() +
  theme(
    panel.background = element_rect(fill = "darkslategray2"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.text = element_text(size = 12, color = "black"),
    axis.title = element_text(size = 12, color = "black"),
    legend.position = "none"
  ) +
  labs(
    title = "Farm Size Distribution by Crop Type",
    x = "Crop",
    y = "Farm Size",
    caption = source
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1)))

## Output the plot
print(farm_size_boxplot)