Long time no see – 2017

It has been a while since my last post here, where I was trying to post regarding my possible studies in Neural networks. I see that I also posted some resources in my last post here. Unfortunately, I have not been able to make much progress in that area due to a job change and relocating to a new state.

Off late, I have focusing to be more active on Cross Validated (the stackexchange website for statistics). Some of the questions there result in really insightful answers that are a learning in themselves. I am thinking of doing a blog post series on such questions so that I can discuss them in much more detail. In doing so, I hope to improve upon my statistics fundamentals and also keep my learning process going on.
Meanwhile, I have been reading Rohit Brijnath’s column on almost a regular basis and would recommend any sports fan to read the same. Such writing is one of the best I have come across of late and cannot recommend it enough. 
Meanwhile, I have completed books 5 and 6 of the Malazan series (more on these in a different post hopefully). Currently making my way through String Theory’ by David Foster Wallace which is a collection of his essays on tennis.
Going back to the series of posts dealing with questions on Cross Validated (CV from hereon), I plan to deal with one question in each post to make sure I have covered the material in large detail. I plan to cover one question a week at least, trying to achieve higher frequency when I can.
My first question will be the following:

On training, validation and testing datasets

Today as I was going through top answers of interesting users on Cross Validated, I came across a question that stuck me with its fundamental nature.

The question asks:

Cross-validation including training, validation, and testing. Why do we need three subsets?

This seemed to be a very interesting question. One one fold it was concentration on the need for 3 datasets and essentially its difference with cross validation.

Assuming clarity on why we do cross validation (to be discussed in detail probably in another post), we need 3 subsets for the following purpose as per the top answer in the link:

  • The training set is used to choose the optimum parameters for a given model. Note that evaluating some given set of parameters using the training set should give you an unbiased estimate of your cost function – it is the act of choosing the parameters which optimise the estimate of your cost function based on the training set that biases the estimate they provide. The parameters were chosen which perform best on the training set; hence, the apparent performance of those parameters, as evaluated on the training set, will be overly optimistic.

  • Having trained using the training set, the validation set is used to choose the best model. Again, note that evaluating any given model using the validation set should give you a representative estimate of the cost function – it is the act of choosing the model which performs best on the validation set that biases the estimate they provide. The model was chosen which performs best on the validation set; hence, the apparent performance of that model, as evaluated on the validation set, will be overly optimistic.

  • Having trained each model using the training set, and chosen the best model using the validationset, the test set tells you how good your final choice of model is. It gives you an unbiased estimate of the actual performance you will get at runtime, which is important to know for a lot of reasons. You can’t use the training set for this, because the parameters are biased towards it. And you can’t use the validation set for this, because the model itself is biased towards those. Hence, the need for a third set.

These outline some of the very important ideas that anyone who works with statistical modeling should have embedded into their skulls. Often, we build models and test them without knowing the complete picture. It involves using both our business acumen and statistical acumen for one without another is  pointless.

That is why during the assignments or exercises, I always use the concept of asking myself ‘Why’ are we doing this? Why? WhY? Why? But stopping once I get the answer to that would not help, I should also be asking, what-if questions. There can be two types of what-if questions:

  1. What-if it is not done this way?
  2. What-if I do it in an alternate way?

Now, these two questions might seem to be essentially the same, but they aren’t. And understanding the difference between these two makes a remarkable difference in one’s learning process according to me.

  1. What-if it is not done this way?

This mostly applies to techniques which have a methodology set in place for them without much of a choice. In such cases, knowing the repercussions of not using this technique or misusing this technique can often allow us in diagnostics during the journey of a project. Some unwanted result or behavior that you might encounter during your project can be diagnosed by asking this question and knowing its answer(s).

  1.   What-if I do it in an alternate way?

This question lets you discover any possible alternatives that can be learnt about. This also gives you vital information about the technique/aspect you are trying to learn. And that aspect is, the important role played by the given technique in answering the question it is dealing with. If there are few or zero alternatives, then this might mostly imply that the technique is either redundant or all-ruling (which is rarely the case). Now coming to how this question differs from the 1st question, it is understanding the behavior in case of not using a technique vs knowing the choices you have when such a situation arises. A combination of these two questions will give you a certain clarity and technical mastery that are utterly essential according to me.