Wednesday 12 November 2014

Smart dumb for loops – R

I few weeks ago, I created a monster of a for loop with a bunch of if statements using R. The loop was not doing want I wished it to do, namely, copy some dates into a data frame at indexed positions. And so began my hours of troubleshooting.

I fixed some errors (referencing the wrong variables), rearranged the placement of if statements, fiddled with parentheses and did some general tidying-up. I highlighted the code and ctrl Entered (which runs code in RStudio) and – the errors were gone, yet the dates were not imputed to where they should have been.

I started to question the fundamental way that R for loops operated. Were they so different from the for loops I had used in Matlab and VBA? How deep into the help file would I need to venture to find a solution? Would I need to submit a query on Stack Overflow for the first time?

Here's a quick example of what I was dealing with. Take this data frame that has 2134 rows of random numbers between 0 and 100. The start of the data frame is shown.


The code shows a simple for loop that is to iterate through each row and turn each value in an even row to zero. This is achieved using the modulo operation of row number (i) %% 2. If a remainder remains, then the row is odd. No remainder, the row is even.


After running the loop, there is no change to the values in the data frame. WHY?

I don’t know what neuron fired in my head, but it set off a cascade of action potentials that made me want to slap palm against forehead (my palm, my forehead, not someone else's). I had forgotten this.


That's right. A "1" followed by a ":" before nrow(df). Previously I had "i in nrow(df)" assuming that this was sufficient for R to know I wanted it to iterate through each row of the data frame. However, nrow(df) equals the number of rows df, which is 2134. Before, my code read "i in 2134", restricting any changes to the last row alone.

With the correct "i in 1:nrow(df)" the code now reads "i in 1, 2, 3, 4, 5… 2134". The sequential rows of the data frame are specified. 

I felt rather stupid and shared this with a friend. His reply:

Ah yes, I know those moments. And I'm never sure if I should feel smart that I figured it out, or dumb that I made the mistake in the first place :)   

I settled for a bit of both.

Earlier today a new loop was failing. Turns out I had this.


For the love of pizza – I hadn't even included the "nrow" part around the data frame. That one tipped the scales to the dumb-side.