On training, validation and testing datasets

Today as I was going through top answers of interesting users on Cross Validated, I came across a question that stuck me with its fundamental nature.

The question asks:

Cross-validation including training, validation, and testing. Why do we need three subsets?

This seemed to be a very interesting question. One one fold it was concentration on the need for 3 datasets and essentially its difference with cross validation.

Assuming clarity on why we do cross validation (to be discussed in detail probably in another post), we need 3 subsets for the following purpose as per the top answer in the link:

  • The training set is used to choose the optimum parameters for a given model. Note that evaluating some given set of parameters using the training set should give you an unbiased estimate of your cost function – it is the act of choosing the parameters which optimise the estimate of your cost function based on the training set that biases the estimate they provide. The parameters were chosen which perform best on the training set; hence, the apparent performance of those parameters, as evaluated on the training set, will be overly optimistic.

  • Having trained using the training set, the validation set is used to choose the best model. Again, note that evaluating any given model using the validation set should give you a representative estimate of the cost function – it is the act of choosing the model which performs best on the validation set that biases the estimate they provide. The model was chosen which performs best on the validation set; hence, the apparent performance of that model, as evaluated on the validation set, will be overly optimistic.

  • Having trained each model using the training set, and chosen the best model using the validationset, the test set tells you how good your final choice of model is. It gives you an unbiased estimate of the actual performance you will get at runtime, which is important to know for a lot of reasons. You can’t use the training set for this, because the parameters are biased towards it. And you can’t use the validation set for this, because the model itself is biased towards those. Hence, the need for a third set.

These outline some of the very important ideas that anyone who works with statistical modeling should have embedded into their skulls. Often, we build models and test them without knowing the complete picture. It involves using both our business acumen and statistical acumen for one without another is  pointless.

That is why during the assignments or exercises, I always use the concept of asking myself ‘Why’ are we doing this? Why? WhY? Why? But stopping once I get the answer to that would not help, I should also be asking, what-if questions. There can be two types of what-if questions:

  1. What-if it is not done this way?
  2. What-if I do it in an alternate way?

Now, these two questions might seem to be essentially the same, but they aren’t. And understanding the difference between these two makes a remarkable difference in one’s learning process according to me.

  1. What-if it is not done this way?

This mostly applies to techniques which have a methodology set in place for them without much of a choice. In such cases, knowing the repercussions of not using this technique or misusing this technique can often allow us in diagnostics during the journey of a project. Some unwanted result or behavior that you might encounter during your project can be diagnosed by asking this question and knowing its answer(s).

  1.   What-if I do it in an alternate way?

This question lets you discover any possible alternatives that can be learnt about. This also gives you vital information about the technique/aspect you are trying to learn. And that aspect is, the important role played by the given technique in answering the question it is dealing with. If there are few or zero alternatives, then this might mostly imply that the technique is either redundant or all-ruling (which is rarely the case). Now coming to how this question differs from the 1st question, it is understanding the behavior in case of not using a technique vs knowing the choices you have when such a situation arises. A combination of these two questions will give you a certain clarity and technical mastery that are utterly essential according to me.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s