Sunday 20 July 2014

Call a procedure to get the first empty column number – Excel VBA

You have a spreadsheet and you want to add some data in the first empty column. How do you know which column to use? You simply use your eyes and look. Guide your mouse to move the cursor to a cell in the empty column, enter a formula, then autofill down. But how does one automate this for a spreadsheet when you do not know beforehand which column is empty (eg column B, column AB, etc)? Your computer does not have eyes to simply check.

Or does it?

A webcam is an eye, but your webcam will not assist you with automation. It will assist you with a fine evening on Chatroulette.


Below is code to determine the location of the first empty column in a spreadsheet. As a bonus, it includes code that will determine the last row where data exists on the spreadsheet [1].


Take the following sheet which holds episode details of Breaking Bad season 1 [2].


After running the code, it returns the next empty column (emptyCol) as column 5 (column E), and the last row with data as row 8. 

Now I want to populate column 5 (column E) with data. Using the Original air date, I want to list the year. I can use something like this:

Range(Cells(2, emptyCol), Cells(rowLast, emptyCol)).Formula = "=Year(Originalairdate)"

Where emptyCol = 5, rowLast = 8. The “Range(Cells(), Cells())” part are the coordinates of the range where I wish to populate cells: from rows 2 to 8, restricted to column 5. I have used the year formula on the data in the “Original air date” column [3]. 


Next scenario. What happens when you have to repeat this action? I now want to have a column with the month of the original air date (using the Month() formula), and then populate the next empty column with the day (using the Day() formula). This would be populated in column F and G respectively. Before populating data in the next empty column, the software needs to know where the empty column is. Like this:
  • Find the column number of the next empty column
  • Remember the column number
  • Point to empty column using the remembered column number
  • Populate with data (eg year)
  • Find the column number of the next empty column
  • Remember the column number
  • Point to empty column using the remembered column number
  • Populate with data (eg month)
  • Find the column number of the next empty column
  • Remember the column number
  • Point to empty column using the remembered column number
  • Populate with data (eg day).


Each time a column is populated with data, the empty column number changes by one (year in column 5, month in column 6, day in column 7). You could copy and paste the whole code each time to get the next empty column (and last row with data if this changes). Or, you could set the code aside in a procedure. When you need to run the code, call the procedure. Here I have placed the code in a procedure called "EmptyCol_LastRow". 


I can call this procedure (EmptyCol_LastRow) as required to get the new emptyCol values, as follows:
  • Call procedure to get emptyCol and lastRow
    • emptyCol = 8, lastRow = 8
  • Populate with year
    • In column 8 
  • Call procedure to get emptyCol and lastRow
    • emptyCol = 9, lastRow = 8
  • Populate with month
    • In column 9
  • Call procedure to get emptyCol and lastRow
    • emptyCol = 10, lastRow = 8
  • Populate with day
    • In column 10.
The corresponding code that populates the columns with data is shown below.



Rather than copy paste lines of code to get updated column and row numbers, simply call the procedure as required!


References and notes
1. Code modified from ozgrid.com VBA/ExcelRanges page. 
2. Data from Breaking Bad Episodes Wiki page.
3. I manually created a named range for the Original air date column called “Originalairdate”.

Sunday 13 July 2014

Online R programming course

If I had my time again, I would have taken a gap year before commencing University. Rather than travel the world whilst supporting myself working data entry jobs in windowless offices, I would be particularly nerdy and enrol in a whole bunch of free online courses.

There were very few structured online courses back at the turn of the century. Google was barely a thing.

I didn't know what I wanted to do at University at the age of 17. Yet whilst completing my final year exams, I filled out my list of preferred degrees informed by short blurbs from a phone book-thick course guide. I ended up in a molecular biology/genetics/biochemistry degree and proceeded to hate if for the three year duration. Good choice me [1].

The availability of massive open online courses (MOOCs) allows me to say, "Hey, what's a course in Marine Biology like?", then find one and enrol. In the comfort of my home, I watch videos, attempt quizzes and submit assignments. If I don't understand any of the course content, I can engage with other student on discussion forums [2].

I enrolled and completed two Coursera courses to date: The Data Scientist's Toolbox and R Programming. The courses are really well-structured and the range of topics covered have been broad. R programming included the following:
  • Data types
  • Subsetting
  • Reading and writing data
  • Control structures
  • Functions
  • Scoping rules
  • Vectorized operations
  • Debugging.

Course assignments drove the requirement to learn programming in R. One assignment provided hospital mortality data from hundreds of hospitals across the United States. The assignment was to write a function whereby users could enter the name of a US State and request a list that ranked the best (lowest mortality) or worst hospitals (higher mortality). Since I have programming experience, I could list the logical steps required to fulfil the task: 
  • Read the spreadsheet data
  • Point to the State and mortality columns
  • Get the user's input parameters (State and rank request)
  • Organise the data as specified by the user's parameters (subsetting and sorting)
  • Return the result as a list.

I do know how to code the logical steps via VBA in Excel. However the assignment required coding in R. I had to learn the R syntax required for each step. That is, I had to correctly use the R syntax for a For loop, for an If Else statement, etc. The screen-grab shows a simple For loop with an If Else statement as written in VBA (background) and R (foreground using the R Studio IDE). One step at a time, putting it all together, I had a functioning function.



Different programming languages can all perform the same basic tasks (I assume). Efficiency, tools and packages, time, usability, cost, experience, popularity – such factors influence the choice of a programming language. I can see where I would use R for some projects at work over VBA. Then I can enter the debate of whether one should use SAS/SPSS/Python/R for business needs. I think the first two have a greater market share. However the last two are open source and have a dedicated users with various Meetup groups. For now, I'm happy to sample different programming languages via MOOCs. But life still owes me a gap year.


References and notes
1. I believe that it is a tad bit unfair for a 17 or 18 year old to make a choice that significantly dictates the course of one's career for (most of) life. I had had very few major life experiences at 17; I was living at home, was not working, had barely travelled, and only knew how to make plain omelettes and cheese pizzas. 
2. A disadvantage to "real life" courses is the lack of direct access to lecturers and tutors. Recognition of qualifications upon completing an online course is under development.

Sunday 6 July 2014

Find word matches across two sheets – Excel VBA

We have an Incidents registry where staff members can note down any work, health and safety (WHS)/ occupational health and safety (OHS) events that occur on-site. These can be innocuous (paper cut) to the severe. Here is a simple list of incidents with dates on a worksheet called “Incidents”.


I was asked to provide a subset of this sheet containing any incidents that made mention of "WHS" (red text). I could also highlight any mention of "OHS" (magenta text), though this was secondary. The real list I worked off had many thousands of rows (and no coloured text). What did I do to find the targets "WHS" and "OHS" (including "WH&S" and "OH&S")? I used Ctrl F then marked the cell next to the incident with "1". At the end, I could sort by these numbers and remove rows that contained none – hence subsetting the data as required.

As I Ctrl F'ed and contemplated my life in general, I knew I would have to perform a similar task on a different spreadsheet in the future. Did someone say coding solution?

I built something! After running the code to search for "WHS", "WH&S", "OHS" and "OH&S", I get the below results.


First I created the Target sheet which contains the list of words/terms that I wish to find on the Incidents sheet. Column A lists the primary target words to search for. If either "WHS" or "WH&S" are found, the incident row will be marked with "1".Column B lists the secondary targets. If either "OHS" or "OH&S" are found on the incident row AND the row IS NOT marked with "1", then the row will be marked with "2" (I do not want to overwrite any existing marks).


I shall pause here. For those who wish to see the code [1], I have (poorly) screen-grabbed it at the end of this post [2]. For those still with me – I will attempt to translate the code steps into words. I'll do this is two parts - Part 1: Sections of code, and, Part 2: Details of code.

Part 1: Sections of code

The code is composed of three key sections: 
1. For c = 1 To colLast (red vertical line).
2. For r = 1 To rowLast (blue vertical line).
3. With Worksheets(“Incidents”).UsedRange (green vertical line).

For c = 1 To colLast iterates through each column on the Target sheet.

For r = 1 To rowLast iterates through each row within each column on the Target sheet.

With Worksheets(“Incidents”).UsedRange interrogates the cells that contain data on the Incidents sheet.

The order of what happens is:
1. First column on the Target sheet is selected (column A).
2. First row in the column on the Target sheet is selected (row 1, column A). A target word is selected ("WHS").
3. The target word is used to find all matches on the Incidents sheet (rows 4, 5, 9, 11, 12) and marks these with "1".
4. The next row in the column on the Target sheet is selected (row 2, column A). The next target word is selected (“WH&S”).
5. The target word is used to find all matches on the Incidents sheet (rows 5, 12) and marks these with "1".
6. The next column on the Target sheet is selected (column B). 
7. The next row in the column on the target sheet is selected (row 1, column B). A target word is selected ("OHS").
8. The target word is used to find all matches on the Incidents sheet and marks these with "2". However, if a row already contains a match value, then it will not overwrite this value (only row 7 will be marked with "2").
9. The next row in the column on the Target sheet is selected (row 2, column B). The next target word is selected ("OH&S").
10. The target word is used to find all matches on the Incidents sheet and marks these with "2" if there is no prior marked value (no additional rows will be marked with "2").

And that's it! All of the words on the Target sheet have been grabbed and used to find a match on the Incidents sheet.

Part 2: Details of code (I'm impressed you're still reading)

Remember: the Target sheet contains the words I am trying to find on the Incidents sheet. The below steps follow the lines of code from top to bottom.
  • Determine the last column with data on the Target sheet. Why? I need to know how many columns there are to search across for target words – otherwise I would search endlessly and freeze the program. There are 2 columns in the Target sheet, A and B.
  • For each column on the Target sheet, determine the last row with data. Starting with column A, the last row with data is row 2.
  • Since I now know the column number (1) and the last row with data (2) in the column (column A), I can get the range of the column. I will select my target words from this range to begin. If I used the whole worksheet range, I would inadvertently start selecting blank cells to find target words. Since each row of the Incidents data contains blank spaces, each row would be marked with "1", which I do not want. Setting ranges involve knowing exactly where the data you wish to use is located.
  • Activate the Incidents sheet. A user can activate a sheet simply by selecting that sheet. Once the code has activated the sheet, any action by the code will apply to the Incidents sheet (unless specified otherwise).
  • Get the first empty column of Incidents sheet. Why? Because this is where the mark values will be populated each time a target word is found. I do not wish to populate the values in a column that contains data.
  • Set initial values for variables. I have two – rowLoc and firstRowLoc. Before describing these, I should note the code is driven by a range called Loc (I have underlined in red where this appears in the code below). Loc gets the cell location (range) where the target word match was found. I can get the row number of where Loc exists via rowLoc = Loc.row. firstRowLoc is the row where the first instance of a target word match was found. Before I begin to use rowLoc and firstRowLoc, I have to set initial values (they cannot be blank). rowLoc = 0 and firstRowLoc = -1 and I'll explain why I used these values below.
  • Specify the range to use when searching for target words in the Incident sheet. The code will not search outside of this range.
  • Select a target word from the target range. Recall above that the range of column A listed words has been determined. The first target word “WHS” is chosen.
  • Set Loc to find the row where the target word exists on the Incidents sheet. The first instance of where “WHS” is found is on row 4. At this point, the values of rowLoc and firstRowLoc are updated to both equal "4" (no longer 0 and -1, respectively). 
  • Enter the Do Until loop. The Do Until loop will continue until the rowLoc = firstRowLoc. In other words, when the Loc returns to the first row where a target word match was found, the condition of rowLoc = firstRowLoc is met, and the loop ends. Had rowLoc and firstRowLoc both been initialised with 0 at the start, then this condition (rowLoc = firstRowLoc; 0 = 0) would have been met and the Do Until loop would not have commenced. Hence prior to the loop, rowLoc = 0 and firstRowLoc = -1 (unequal values). But wait! The previous step indicated that both rowLoc and firstRowLoc equalled 4. If this rowLoc = firstRowLoc condition is now met, shouldn't the loop end? No, because there is a code further on that will find the new target word match and update the rowLoc value (to "5"). Thus when entering the loop again, rowLoc (5) will not equal firstRowLoc (4) and the loop will continue. 
  • Still in the loop, when a match is found, mark with "1" if target words are from column A. Otherwise, mark with the column number for column B onwards. That is, if the word is "OHS" or "OH&S", this arises from column B, which is the second column. Hence, the mark value is "2". BUT – only mark with "2" if there is no prior value of "1".
  • Set Loc again to get the location of the cell where the target word is found next. Get the row number of where the next word will be found. This is where the rowLoc value is updated (to "5"). Upon entering the loop again, this next rowLoc value will be compared against the firstRowLoc value. If the values match, that means the FindNext code has pointed back to the row where the first match was found, and the loop will end. This allows the next target word to be selected and the code re-enters the loop to interrogate the Incidents data to find a match.
  • The process repeats until all words on the Target sheet have been selected and assessed for a match on the sheet.

I probably should have used more screen-grabs for Part 2. My bad.

The code



References and notes
1. I modified code I found on the stackoverflow thread Find all matches in workbook using Excel VBA by Nanashi.
2. The code in the Excel IDE ran over the page. I screen-grabbed two separate sections, pasted it in Word to recombine, made some annotations, then screen-grabbed the combined image. A copy of a copy made is a poor copy. I assume there is a better way to have produced this.