Random Forests Brain Dump

Edit 8/7/2020:

In the process of resurrecting this site, I came upon this blog post that’s over eight years old, and I semi-cringe.

My current understanding of “how we got to” random forest is this:

  • Bagging, short for Bootstrap Aggregation, is a technique to take low bias high variance methods, e.g., decision trees, and lowering the variance.
  • This is simply done by taking bootstraps of the original data, fitting with trees B times and then averaging it. The decreased variance is similar to var(\bar{x}) = \frac{var(x)}{n}.
  • The challenge is you don’t get to that level of decreased variance since there’s correlation amongst the trees, e.g., a particularly dominant feature is in every bootstrapped tree.
  • We attempt to mitigate this with Random Forest, where for each iteration of fitting the tree, we fit on some random subset of the features. For that matter, we can do this on each split.
  • Finally, when the “random subset” is all of the features, then we get back to Bagging.

Original Post:

Revisiting Kaggle, a site and service which hosts multiple data-mining competitions, I found a new competition that looked potentially interesting. It’s been a while since I’ve fully downloaded any competition’s data, so I was piqued by the inclusion of R code under a file named sample_code.R which didn’t exist before.

Analyzing the code, it was clear that the purpose of the code was to provide two submittable benchmark solutions. One was the naive approach, using the mean of the dependent variable as your predictor. The second was using Random Forests, a machine learning algorithm. In this particular competition, you were asked to to predict n variables, so there were n Random Forest predictors for each variable.

Not knowing much about Random Forests, I spent a portion of that day trying to see what it was all about and understand its mechanism.

Wikipedia, as usual, gave me the practitioner’s definition. In short, it “is an ensemble classifier consisting of many decision trees and outputs the class that is the mode of the classes output by the individual trees.” It helps to understand ensemble in this context as an averaging over a set of sub-models, which happens to be decision trees in this case. It then classifies your particular example by seeing how the sub-models (decision trees) each classified it, and then takes the most-occurring classification as the final classification for your particular example. Interestingly, the name has been trademarked.

I later found a presentation by Albert A. Montillo going over Random Forests, breaking it down in more digestible bits than Wikipedia with more examples. Following is a brief summary of some points I found useful from his presentation.

Random Forest’s first randomization is through bagging
A bootstrap sample is a training set (N' < N) with random sampling (with replacement).
Bootstrap aggregation is a parallel combination of learners (decision trees for Random Forests) independently trained on distinct bootstrap samples.
Bagging refers to bootstrap aggregation (independent training on learners with distinct bootstrap samples).

Final prediction is either the mean prediction of the independent learners (for regression) or the most-picked classification (for classification).

Random Forest’s second randomization is through predictor subsets
You select a random subset (m_{try}) of predictors from the total set (k) for each split. Bagging is a special case of Random Forest where m_{try} = k.

After understanding the two aforementioned features, the Random Forest algorithm is more easily understood

Random Forest Algorithm
For a tree t_{i} you’re building, you first select a bootstrap sample from the original training set for which you will learn on. You will grow an unpruned tree from this bootstrap sample. At each internal node, randomly select m_{try} predictors (from the total set of predictors) and determine best split using only these predictors. Additionally, don’t perform cost-complexity pruning.

Your overall prediction is the average response (for regression) or majority vote (classification) from all the individually trained trees.

Montillo then goes into some practical considerations, e.g., how are splits chosen (squared error, gini index), how many trees to build (build trees until error no longer decreases), or how to select m_{try} (use the recommended defaults of \sqrt{k} for regression and \frac{k}{3} for classification).

Additional Information for Free
Random Forests are able to estimate the test error during the building stage.

For each tree grown, ~33-36% of samples are not selected in the bootstrap, for which we call out-of-bootstrap (OOB) samples. We’re then able to take the OOB samples and feed them in as input to the corresponding tree, treating the predictions made as if they were true out-of-sample.

By various bookkeeping, majority vote (classification) or average response (regression) is computed from all the OOB samples from all trees.

Supposedly such test error is very accurate in practice, given a reasonable number of trees you build.

After going over Montillo’s presentation a couple times, the Wikipedia article and other assorted reading (from Google) made much more sense. I found the original paper and “official” documentation both informative and useful.

Circling back, in the end, it was nice to actually understand what the benchmark solution was doing. The R manual for the randomForest package was much more useable (if only because you’re more confident what the parameters for the functions actually mean), giving me the ability to modify the benchmark solution as I saw fit.

Keeping Track of Your Finances (The Little Things Add Up)

Overtime, I’ve realized that while big ticket items (car, education) and lesser big ticket items (TV, new bike) can visibly hit the pocketbook quite apparently, in that you can point to a credit card statement and say “oh, it’s so high since I just paid off my tuition,” it’s the small items that can invisibly hit the pocketbook. Symptoms include but not exhaustive: feeling that you should have had more to save at the end of each month but not and wondering where all your money went.

The reason you can’t identify where your money went is not because you’re necessarily daft, it’s just that it’s much harder to keep track of 20 items that add up to $500 instead of a few items that add up to $500. And just because you got more bang for your buck with 20 items, you still need to ask yourself: of those 20 items you bought totaling $500, how much was necessary?

It’s for this reason, that it’s critical to keep track of your finances. Keeping track of your finances is not loading up the your four credit card websites and your two bank websites and see if there were any peculiar charges. Keeping track of your finances means that you have buckets for your spending. You have your fast-food-splurge bucket. You have your grocery bucket. You have your car payment bucket. You have your going-out-to-drink bucket. You have your seeing movies with friends bucket. You have your in-your-underwear-shopping-on-amazon-at-2am bucket. Keeping track of your finances means you must be able to categorize every single transaction you perform into one of your existing buckets.

Sitting down and coming up with every conceivable bucket you need a priori is a sisyphean task. However, what is doable and needs to be done is that you need to go through every single credit card and bank statement you have for the past three months (if not more) and categorize it. You can use paper and pencil, but I recommend your preferred spreadsheet program in order to do this. A free one you can use is the spreadsheet program by Google.

Having gone through this process (I anticipate it taking at minimum five hours and more likely a couple days), you will get a much better idea of how your finances actually are. You will notice, for example, that you’re spending upwards of $500 on just going out for drinks in a given month. You most likely didn’t think you were spending this much, since $10 here and $14 there don’t really seem to add up to $500. But, $10 here and $14 there, extrapolated to the rest of the month, does add up to $500. At this point, I would take a deep breath. When I did this exercise before and got to the “how much I spent in each bucket” stage, I was having mini-anxiety attacks about how my money was just bleeding out. That being said, you did a hard step and deserve some congratulations.

As the PSA from G.I. Joe would say, “Now you know, and knowing is half the battle.” With this knowledge, you can now take proactive steps in curtailing your spending on the little items, e.g., you will repeatedly tell yourself “I don’t need this {random $10 item}” knowing that those $10 items will easily add up to a much larger tab in the end.

Bonus for reading this far: My personal workflow is to use mint.com, which is effectively an aggregator for all your financial accounts. I won’t get into the details of the security of their approach, but suffice it to say that I feel comfortable enough to use it. It will download all the transactions you have and attempt to categorize it for you. You can easily go in and change the categories, add new categories, etc… saving you a lot more time than doing the spreadsheet approach by hand. Even more so, it can show you how your budget has evolved month by month. It’s a great time-saving tool in keeping track of your finances, especially the little things.

Random Readings 0001 – Investment Related

Kiplinger provided a list of four companies who are similar to Berkshire Hathaway and its chairman Warren Buffet. Specifically, they highlighted, Markel, Fairfax Financial, Loews, and Leucadia National. The common thread of such companies is that they are cash rich businesses from underwriting insurance and need to do something with the cash. At least for “Berkshire Hathaway”-like companies, they leverage the cash in building large stock portfolios and/or acquiring value/distress-based companies.

Continuing the theme of taking insurance premiums and investing it, Greenlight Capital Re is a reinsurer who takes its premiums and invests it in David’s Einhorn’s hedgefund, Greenlight Capital.

Jeffrey was profiling Annaly Capital Management and incidentally highlighted the downside risks of all the high dividend yield REITs we see. Specifically, the strategies typically encompass borrowing low interest rates and investing them in various sorts of mortgage securities, which typically earn a higher rate of return. The risks come from 1) increasing interest rates going forward relative to the all time low interest rates we have now will shrink the yields obtainable and 2) if home owners become able to refinance at the current lower interest rates (although if you’re underwater, it will be difficult to refinance), the yields will shrink.

Why a 529 Savings Plan is Superior

This is an article demonstrating the benefit of the 529 savings plan compared against other potential choices for your child’s college savings.

Having a newborn, I recently began investigating options for saving money for the kid’s college fund. Of course, there are many options, including (but not entirely inclusive) investing directly yourself, UGMA/UTMA, and the 529 plan.

I think the natural choice arises when you begin with the right set of questions. E.g., “Is your kid going to turn into a twat at 18” or “How do you think your kid will handle having a sudden influx of money?” Not that this is directly correlated, but more than 50% of NBA and NFL players experience bankruptcy or financial duress post retirement – leading me to believe that if you don’t have a good handle on how to use money and debt, a sudden influx of money isn’t going to fix that.

Of the options I mentioned, only the “investing directly yourself” and the 529 plan allow you to be in control at all times of the account (including how you handle distributions). With UGMA/UTMA, the kid inherits all of the money at the moment they turn 18 and can spend it all on baseball cards if they so desired. And of the”investing directly yourself” and the 529 plan, only the latter is tax-advantaged.

With a 529 plan, when applying for financial aid, it’s more advantageous since it is counted much less so towards total expected family contribution. Additionally, you’re able to transfer the beneficiary to other people (including yourself) if there’s unused money. Really, I’m not seeing any downsides here with the 529 plan.

In short, even though we hope our kids don’t turn out to be financially irresponsible, they might anyway due to inexperience. We need to remember that we’ve had the hard lessons already, in addition to years on them, and they haven’t had the chance to learn these things on their own. We want to help our kids through college but let’s not give them enough rope to hang themselves.

Personally, I’m not going to tell my kid(s) about the 529’s existence and have them work under the mode that they better do well in school now to attain merit scholarships. And when they ask for residual money not covered by scholarships for additional school material, I’ll just tell them “Ugh. I’ll dig in my wallet and see what I can come up” when I’m really digging into their 529.

Note that when I reference the 529 plan, I’m referring to the self-directed investing (instead of say, locking in a state tuition rate). Of all the 529 plans I saw, I ended up going with Vanguard (the 529 plan being “based” in Nevada), since Vanguard funds had the lowest expenses I’ve seen out of all the plans.

Cost of Replacing a 2012 Ford Edge Key

Cost of replacing a 2012 Ford Edge key can be very expensive if you’re not prepared.

Recently, I misplaced (read: placed on the roof of my car) my 2012 Ford Edge key. Not knowing anything about keys, I assumed it was a relatively straightforward process to get a new copy. Not so much.

First and foremost, the key is a smart key, which implies there’s a chip embedded in the key that effectively talks to the car. I.e., if the key isn’t programmed to your car specifically, it’s useless. This also implies that most likely your local Walmart won’t be able to help you.

Feeling panicky, I got some quotes from the Ford dealer, with all sorts of prices that were effectively at least $500. I ended up finding a parts dealer and ordering the fob ($150) and the key ($25) and felt pretty proud of myself, since I was under the assumption that I could program the second key myself with at least one working key. That came quickly crashing down once I actually pulled out the manual and realized that I needed at least two already programmed keys in order to program another key. So, my second key really wasn’t a spare at all, and a spare would have been the third key had the dealer given me three keys, which obviously they did not.

I ended up calling around various Ford dealers to get quotes for “I have one programmed key and one unprogrammed key and need to program the unprogrammed one.” I got various prices ranging from $100 and upwards but eventually I got a dealer who quoted me a price of $50 to reprogram that one key.

During all of this, I did more research on keys. Apparently, I don’t really need the $175 magical key from Ford. On ebay, I found various sellers selling blank uncut transponder keys for around $15. As such, I would be able to take my two programmed keys and start programming the cheapo-deapo keys as backups. It didn’t matter that the keys were uncut, since the Ford Edge I had couldn’t be physically started anyway; the most the mechanical key could do was lock and unlock the door, but once in, you couldn’t start the car until the programmed key was within range.

The moral of the story as such is that for these new fancy keys, you really don’t have a spare and that you should immediately begin the process of programming a third (if not more) key. And, for the purposes of a true spare, it doesn’t need to be the official Ford key and can be a blank uncut transponder key.

Cost of replacing a “spare”: $250
Cost of replacing a true spare: $15