Probably a better way: Get the data

I have recently been moving away from accessing data via CSVs (comma-separated values) to APIs (application programming interface). An API is “a set of programming instructions and standards for accessing a Web-based software application or Web tool” [1]. Why not just download the CSV then suck it into your analytics tool of choice (mine being R)?

I am developing a new reporting system at work (got me a new job). Automation is a priority. I want users to double click an icon to generate the latest report. Full automation will be achieved by automating the separate process that link one form of data to another. Raw data to tidy data is achieved via my custom R programs. Analysed data to report data could be achieved via a web app accessible by users. These data forms and the process that produce each build the “data pipeline”. I’m not sure if I made up that term myself or I read it online and I’m not attributing the source.

What about just getting the data? The first process in the pipeline it to get the data. Traditionally, I would save the CSV file in a directory, write the script to read the data into R, then run the script. With this approach I have to:

Visit the location of the data (such as a website)
Save the data as CSV into a directory
Run the R script.

OMG – visit the location AND save it! So much work. And what if the CSV is like, lots of megabytes? The API approach would be as follows:

Download the data directly into R.

Boom!

However, the setup to download API data took a bit more fiddling-around than I expected.

The survey website API I’m accessing data from permits two flavours – I could download the data as JSON files or XML files. I will not go into detail about how these differ. Look it up yourself. I chose JSON. I actually can’t remember why.

I knew that R had packages to handle JSON files, namely, jsonlite. This is what I needed the package to do:

Download the JSON file
Convert the JSON file format to a data frame (something that resembles a spreadsheet).

It took a bit of trial and error to get the download part to work. Eventually the webpage data appeared in R in something that I assumed was JSON format. Here’s a snippet:

I’ll come back to the red box later. The jsonlite package includes a function that converts the JSON format to a data frame. However, after applying it I received an error message indicating a “lexical error”. Something in the JSON format was not in proper JSON format. Unless the data was correctly formatted, I was unable to get my data frame.

Thankfully, I have a colleague who is a JSON guru. His name happens to be Jason. No joke. I’m thinking of changing my name to “R” for consistency. He looked at the format and said it was the “\r\n”. “\r” indicates that there is a new line in the text. The backslash is an “escape character”. The escape character lets R know that the letters “r” and “n” or not actual strings – with the escape characters they become a symbol for a new line.

Jason said I could get past the error by replacing “\r\n” with “\\r\\n”. The double backslash escapes the escape characters. R will read “\\r\\n” as “\r\n\” - a new line. The function gsub finds the old and replaces it with the new

theJSON <- gsub("\r\n", "\\\\r\\\\n", theJSON)

“theJSON” is the string of JSON data. What’s with all the extra escape characters “\\\\r\\\\n”? I require one escape character for each escape character I’m inserting in. Since there are two backslashes before the r and n in “\\r\\n”, I’ll need two more for each - "\\\\r\\\\n".

I re-ran the script and got past the error message. Success! Then I ran into another lexical error. Oh no! Upon each error, I updated the substitutions including:

theJSON <- gsub("\t", "", theJSON) # Replace tab symbol
theJSON <- gsub("[0-9]*\\\\[0-9]*\\\\[0-9]*", 'DATE', theJSON) # Replace date with string "DATE"

theJSON <- gsub("\\\\\"", "'", theJSON) # Replace backslashes with single quotation mark

The updated process:

Download the JSON file
Convert the downloaded data to correct JSON format
Convert to a data frame.

Success!

References and notes

1. http://money.howstuffworks.com/business-communications/how-to-leverage-an-api-for-conferencing1.htm

Probably a better way

Sunday, 7 June 2015

Get the data – JSON APIs

No comments:

Post a Comment