TRENDING NEWS

POPULAR NEWS

How To Do Simulation Using Matlab. My Topic Is Artificial Neural Network Using Train Lm Algorithm

Should I split my data to train/test split or train/validation/test subset?

Let's say you have a dataset with 10,000 instances. As you mentioned, there are the following two options:1) Cross-validation: Divide the data to train and test sets - say of sizes 8,000 and 2,000. Then, do a 5-fold cross-validation on the training set, so that each run will train on 6,400 points and test on the remaining 1,600 points. Your final model is the one that gives the minimum average validation error.2) Using a fixed validation set: Here, you divide the total data into 3 parts a priori of sizes 6,400, 1,600 and 2,000 respectively, train on the first part and do model selection using the second part.The first one is more robust. Suppose for a given set of parameters, two models give the following errors on 5 folds:Model 1: {5.1%, 4.9%, 5.2%, 5.0%, 4.8%} => Average : 5.0%Model 2: {4.1%, 6.4%, 6.7%, 6.5%, 6.3%} => Average : 6.0%Clearly, model 1 does mostly better than model 2, and therefore, that should be the one selected. Model 2 performs better on just fold 1, perhaps because the model just accidentally fits the noise in that split well. However, if you use a fixed validation set, and that turns out to be similar to fold 1 above, you'll end up selecting a bad model.The other advantage of cross-validation is that you are using all the data to train your model. In fixed validation set, your validation set is not used for learning, and some rare patterns that only appear in the validation set will not be learnt at all.The downside of using cross-validation is that it requires 5 times longer to train your model.Thus, if you can tolerate the minor loss in accuracy due to idiosyncratic behavior of a fixed validation set, and compute time is more crucial, go with method 2; otherwise go with method 1.

TRENDING NEWS