On training, validation and testing datasets

Today as I was going through top answers of interesting users on Cross Validated, I came across a question that stuck me with its fundamental nature.

The question asks:

Cross-validation including training, validation, and testing. Why do we need three subsets?

This seemed to be a very interesting question. One one fold it was concentration on the need for 3 datasets and essentially its difference with cross validation.

Assuming clarity on why we do cross validation (to be discussed in detail probably in another post), we need 3 subsets for the following purpose as per the top answer in the link:

  • The training set is used to choose the optimum parameters for a given model. Note that evaluating some given set of parameters using the training set should give you an unbiased estimate of your cost function – it is the act of choosing the parameters which optimise the estimate of your cost function based on the training set that biases the estimate they provide. The parameters were chosen which perform best on the training set; hence, the apparent performance of those parameters, as evaluated on the training set, will be overly optimistic.

  • Having trained using the training set, the validation set is used to choose the best model. Again, note that evaluating any given model using the validation set should give you a representative estimate of the cost function – it is the act of choosing the model which performs best on the validation set that biases the estimate they provide. The model was chosen which performs best on the validation set; hence, the apparent performance of that model, as evaluated on the validation set, will be overly optimistic.

  • Having trained each model using the training set, and chosen the best model using the validationset, the test set tells you how good your final choice of model is. It gives you an unbiased estimate of the actual performance you will get at runtime, which is important to know for a lot of reasons. You can’t use the training set for this, because the parameters are biased towards it. And you can’t use the validation set for this, because the model itself is biased towards those. Hence, the need for a third set.

These outline some of the very important ideas that anyone who works with statistical modeling should have embedded into their skulls. Often, we build models and test them without knowing the complete picture. It involves using both our business acumen and statistical acumen for one without another is  pointless.

That is why during the assignments or exercises, I always use the concept of asking myself ‘Why’ are we doing this? Why? WhY? Why? But stopping once I get the answer to that would not help, I should also be asking, what-if questions. There can be two types of what-if questions:

  1. What-if it is not done this way?
  2. What-if I do it in an alternate way?

Now, these two questions might seem to be essentially the same, but they aren’t. And understanding the difference between these two makes a remarkable difference in one’s learning process according to me.

  1. What-if it is not done this way?

This mostly applies to techniques which have a methodology set in place for them without much of a choice. In such cases, knowing the repercussions of not using this technique or misusing this technique can often allow us in diagnostics during the journey of a project. Some unwanted result or behavior that you might encounter during your project can be diagnosed by asking this question and knowing its answer(s).

  1.   What-if I do it in an alternate way?

This question lets you discover any possible alternatives that can be learnt about. This also gives you vital information about the technique/aspect you are trying to learn. And that aspect is, the important role played by the given technique in answering the question it is dealing with. If there are few or zero alternatives, then this might mostly imply that the technique is either redundant or all-ruling (which is rarely the case). Now coming to how this question differs from the 1st question, it is understanding the behavior in case of not using a technique vs knowing the choices you have when such a situation arises. A combination of these two questions will give you a certain clarity and technical mastery that are utterly essential according to me.


The elements of statistical learning

Having just finished exams for the first flex of spring semester, I am looking at last two months of study in the master’s program I am in. The courses I will be taking are already going to make this time a lot more interesting (not to mention the invisible never ending job search :p). I have however recently seen that lot of machine learning aspects can be implemented using R and need not necessarily require expertise in Python. Although I am all in for learning and advancing my python knowledge, I have come to face the fact that my coursework will not be allowing me to do so.

Coming to the machine learning part, I will be using the following book for this purpose:

  1. The elements of statistical learning (2nd edition)

I have numbered this list because I plan to add more books in case I come across any as good enough to be added to this list.

I will be posting my notes, thoughts, doubts and links from the internet in this blog. I wish to help them serve as references for myself in future times. Hence, I will not be cutting any corners while doing so and will try to make them as rigorous as possible.

Also, as and when the time permits, I would try using LATEX.


Sachin Tendulkar

Today I happened to chance upon an article in cricinfo. It was 5 questions to Brijnath, my favorite sports writer. And the questions were about Sachin Tendulkar.


The answer to the last question particularly stayed with me. Brilliant indeed.

Fifteen years from now, if a young boy or girl were to ask you about Tendulkar what would you tell them?
Even as a writer, I wouldn’t be able to. Not sufficiently. No numbers suffice. No quotes from his peers will do. I have about seven-eight books on Muhammad Ali on my bookshelf. He fascinates me. I will read everything on him. But I wish I lived in his time, through Vietnam and his ban, I wish I had experienced him. And it’s the same with Tendulkar. He was an experience. You were either there or you were not.

The first time I watched a complete match that involved a splendid Tendulkar innings was India vs Pakistan in 2003 World Cup. So vividly I remember that innings. Pakistan had batted first and I saw Anwar making merry with our bowlers and some good batting performance getting them to a 270+ score, setting up a decent target for India.

More often than not, Pakistan is a team that cherishes its bowling and getting batsmen into trouble. This time was no different. Their bowling attack included Akram, Waqar, Akhtar and a young Umar Gul. This was probably the best attack in that world cup, comparable only with the Aussie bowling line up led by McGrath.

So, there was the Indian opening pair, Sachin and Sehwag walking into the middle to chase down this target. Indian batting lineup definitely could give a hard day to the Pak bowlers. Especially one of the Indian batsman revelled at challenges. Just as you would love to do well in an exam of a subject that is along the lines of your passion and nothing gives more happiness than doing amazingly well in a paper set by a difficult professor, so was Sachin’s appetite for challenging totals and fierce bowling lineups. Except for that one chance where Razzaq dropped him off Akram, he was flawless in rest of the innings.

Such is the beauty of the shots he was conjuring that if felt as if one was listening to Bach or Mozart through the sounds that the bowl made when Sachin sent it to the boundary. I wouldn’t say this was a brash innings, if you can call any of Sachin’s innings that. Except for that one six off Akhtar, all the other shots seemed to be creating a symphony, one that an avid cricket fan would watch again and again, one that would bring a smile to all of his admirers, one that the Pak bowling lineup would remember forever.

As he went on with his innings, one could see the concentration in him. It was as if the entire cacophony of sounds in the ground weren’t audible to him, as if he knew he had to win this no matter what, it was as if he wanted the only eluding trophy he couldn’t win for his country. Once again, he got out at 98, due to a combination of cramps and bouncer by Akhtar. He walked off, yet again, without a century, but with having done almost what he set out to do in the first place. Thankfully, it was not a repeat of Chennai 136, though I very much believe this was a good possibility given the bowling lineup of Pakistan. The remaining Indian batsmen came through, especially Dravid.

Dravid’s innings is no less important than Sachin’s innings in this chase. Every additional minute he was batting was reducing Pakistan’s hopes of victory. And he made sure India did win that match. But his innings, as has been many a times, stayed and to till day stays in shadow of Sachin’s innings.

It was a victory that was all the more sweet due to the competition’s level. I do not think I have a Indian victory over Pakistan that I would rank higher than this one (among the ones I have seen live on TV).

A decade later, I would be in the stadium for his final outing for the Indian team. Little did I know that the master would continue this for 10 years more, for which I am thankful.

Like Rohit Brijnath says in the article, Sachin was an experience. You were either there or not there.