The Background chapter explores a practical use case: converting unstructured text in a Google Doc into a BibTeX file compatible with RStudio. This integration simplifies the process of inserting and managing citations within markdown documents, offering a streamlined workflow for researchers.
This section describes some activities that need to be done in preparation for the use case.
1.2 General Disclaimer
This is early exploratory work. There are likely a lot of “gocha” situations. Also, this was developed using the Windows 11 environment; other platforms may behave differently.
1.3 API Keys
You need to get the API key from the data supplier. For example, OpenAI provides keys to clients who meet a set of requirements, such as having a paid account.
It is necessary to have your own OpenAI API key to run the functions.
1.3.1 Store the API Key
It is important to keep the OpenAI API key secret, especially if the use of the key cost money to use (as most do).
Generally, API keys are store in a system file on the local computer. This is convenient and quite easy to do.
It is often said that you use the command Sys.putenv() to store the key. Here, it is recommended that you use the following command inside a R chunk:
file.edit("~/.Renviron")
This opens up the Renviron file so that you can edit the entries inside a RStudio tab.
The format is:
KEY1 = keyvalue1 KEY2 = keyvalue2
Here is an example (with a fake key value):
OPENAI_API_KEY = 3Y8jPQ34v772I24Lk9
1.3.2 Access the API Key
Once you save the Renviron file, you must restart RStudio. This refreshes the file and you can access the key value with the following R statement:
ChatAPI_key <- Sys.getenv("OPENAI_API_KEY")
1.4 Setup the Libraries
There is a set of libraries that provide the functions needed to handle the citation processing. These are listed, along with a few default values, in this code chunk.
Show the code
## Standard Librarieslibrary(tidyverse) ## Lots of useful functionslibrary(gt) ## Create tableslibrary(gtExtras) ## Add a few useful functions to gtlibrary(ggplot2) ## Create chartslibrary(devtools) ## Load packages from GitHub## Specialized librarieslibrary(openai) ## library(httr) ## Send requests and receive responseslibrary(jsonlite) ## Handle the request formattinglibrary(pdftools) ## Handle PDF fileslibrary(base64enc) ## Convert image files to base64 encodinglibrary(googledrive) ## Download Google Doc fileslibrary(curl) ## The force behind the httr functionslibrary(DiagrammeR) ## Make flowchart diagrams## Get the accessOAI package (do this just once)## install_github("kimbridges/accessOAI")## Initialize the accessOAI librarylibrary(accessOAI)## Initialize some things that rarely change.LLM <-"gpt-4o"LLM_alt <-"gpt-3.5-turbo"## (text only)temp <-1apiKey <-Sys.getenv("OPENAI_API_KEY")
1.5 The analyzeTXT function
The analyzeTXT function is part of the accessOAI package. This function is used to send a body of text (here, the free-form citation notes) and then to analyze this text by behaving with specific skills using a set of directions.
There are three character strings that are basic to this function:
data: This defines the location of a text file to be used for processing. In the example used here, this is a local file called “notes.txt”.
role: The LLM (i.e., ChatGPT) can be asked to play a specific role. This emphasizes how it should approach a request. Here is how the LLM is instructed in this example:
You are a bibliographic expert. You know the various formats that are used in the scientific literature. You are able to convert from one format to another. You can also find the DOI data and abstracts for an entry.
prompt: You want the LLM to do some specific tasks within the scope defined by the role. Here is where you formulate the task. The following text is used in this example:
The input file has bibliographic citations, along with some comment lines. Some of the bibliographic citations reference web pages. Include these. Ignore the comment lines. Convert the bibliographic citations into a standard BibTex format. If possible, include the DOI data. Don’t include any comments in your response.
These three character strings, along with some technical jargon and formatting keywords are sent to the OpenAI server using the POST command. The response is filtered and saved as the references.bib file.
Note that this three-element structure (data, role and prompt) is very general. Many useful and interesting things can be done by customizing these character strings.
1.6 drive_download Function
The drive_download function is part of the googledrive package. The purpose is to download files from Google Drive (in the cloud) to the local drive. The native Google file types (Docs, Sheets and Slides) are supported.
The download converts the Docs format to a .txt format file.
This is a key step in the process of creating the BibTex file as the text file can be used by the OpenAI API for data extraction and formatting.