Saturday 27 June 2015

Probably a better way – Year one

It’s been a year since my first post, and a lot has changed. Looking over my early posts, I’m amused that I started with VBA in Excel. At the time I had completed a VBA training course and was writing a program at work. I saw a future of working exclusively in VBA – writing VB scripts where my customers would use my awfully clunky workbooks to manage and analyse their data. I suspect if I searched “VBA” across job ads, the number of hits has reduced since a year ago.

Soon I commenced the Data Science courses on Coursera. I learnt to program in R via these courses (aided with prior Matlab knowledge) then applied this knowledge at work. I managed my data with R and visualised data with R. I even built a web app (using the Shiny package). R is great, and it’s free. Not like how Facebook is free where one hands over personal data. R is proper free.

I am regularly using R in my new role. I do wonder whether I should learn Python. My understanding is that Python excels over R in web applications and text analytics. I wish I knew Javascript – I would like to create custom interactive displays.

I’d love to know more about statistics and multiple regression. Not entirely sure how I would apply this knowledge. At the heart of it, I’d like to be in a position to receive a large amount of data and simply know what statistical methodology I should be applying towards uncovering insights. I have commenced reading my thick stats book mentioned in this post.

At the moment I am being exposed to different data types and methodologies at work. JSON, XML, MySQL. I’ll take them as they come. I’m fortunate that I am surrounded by developers that have advice when I get stumped. I just have to be more comfortable with asking for help.

Summary: Year one was moving away from VBA to R. Here’s to year two!

Sunday 7 June 2015

Get the data – JSON APIs

I have recently been moving away from accessing data via CSVs (comma-separated values) to APIs (application programming interface). An API is “a set of programming instructions and standards for accessing a Web-based software application or Web tool” [1]. Why not just download the CSV then suck it into your analytics tool of choice (mine being R)?

I am developing a new reporting system at work (got me a new job). Automation is a priority. I want users to double click an icon to generate the latest report. Full automation will be achieved by automating the separate process that link one form of data to another. Raw data to tidy data is achieved via my custom R programs. Analysed data to report data could be achieved via a web app accessible by users. These data forms and the process that produce each build the “data pipeline”. I’m not sure if I made up that term myself or I read it online and I’m not attributing the source.

What about just getting the data? The first process in the pipeline it to get the data. Traditionally, I would save the CSV file in a directory, write the script to read the data into R, then run the script. With this approach I have to:
  • Visit the location of the data (such as a website)
  • Save the data as CSV into a directory
  • Run the R script.

OMG – visit the location AND save it! So much work. And what if the CSV is like, lots of megabytes? The API approach would be as follows:
  • Download the data directly into R. 

Boom!

However, the setup to download API data took a bit more fiddling-around than I expected.

The survey website API I’m accessing data from permits two flavours – I could download the data as JSON files or XML files. I will not go into detail about how these differ. Look it up yourself. I chose JSON. I actually can’t remember why.

I knew that R had packages to handle JSON files, namely, jsonlite. This is what I needed the package to do:

  • Download the JSON file
  • Convert the JSON file format to a data frame (something that resembles a spreadsheet). 

It took a bit of trial and error to get the download part to work. Eventually the webpage data appeared in R in something that I assumed was JSON format. Here’s a snippet:



I’ll come back to the red box later. The jsonlite package includes a function that converts the JSON format to a data frame. However, after applying it I received an error message indicating a “lexical error”. Something in the JSON format was not in proper JSON format. Unless the data was correctly formatted, I was unable to get my data frame.

Thankfully, I have a colleague who is a JSON guru. His name happens to be Jason. No joke. I’m thinking of changing my name to “R” for consistency. He looked at the format and said it was the “\r\n”. “\r” indicates that there is a new line in the text. The backslash is an “escape character”. The escape character lets R know that the letters “r” and “n” or not actual strings – with the escape characters they become a symbol for a new line.

Jason said I could get past the error by replacing “\r\n” with “\\r\\n”. The double backslash escapes the escape characters. R will read “\\r\\n” as “\r\n\” - a new line. The function gsub finds the old and replaces it with the new

theJSON <- gsub("\r\n", "\\\\r\\\\n", theJSON)

theJSON” is the string of JSON data. What’s with all the extra escape characters “\\\\r\\\\n”? I require one escape character for each escape character I’m inserting in. Since there are two backslashes before the r and n in “\\r\\n”, I’ll need two more for each - "\\\\r\\\\n".

I re-ran the script and got past the error message. Success! Then I ran into another lexical error. Oh no! Upon each error, I updated the substitutions including:

  • theJSON <- gsub("\t", "", theJSON) # Replace tab symbol 
  • theJSON <- gsub("[0-9]*\\\\[0-9]*\\\\[0-9]*", 'DATE', theJSON) # Replace date with string "DATE" 
theJSON <- gsub("\\\\\"", "'", theJSON) # Replace backslashes with single quotation mark

The updated process:
  • Download the JSON file 
  • Convert the downloaded data to correct JSON format 
  • Convert to a data frame. 
Success!


References and notes