Overfitting: Data Augmentation, Feature Selection, and Ensemble Methods

Words: 566 Pages: 2

The machine learning process in data and analytics can be faced with an array of different challenges. Among them is overfitting, which is the result of loss of bias and an increase in variance (Provost and Fawcett, 2013). In other words, overfitting is “a concept in data science, which occurs when a statistical model fits exactly against its training data” (IBM Cloud Education, 2021, para. 1). In general, the problem can take place when large volumes of data are available (Delua, 2021). However, overfitting can stop being an issue after the point of interpolation is reached, which is manifested in ‘double descent’ (Belkin et al., 2019). Thus, “overfitting refers to the condition when the model completely fits the training data but fails to generalize the unseen testing data” (Kumar, 2021, para. 2). Therefore, the problem needs to be prevented and avoided by deploying various techniques.

Besides early stopping, training with more data, and regularization, there are three more methods to combat overfitting. These include data augmentation, feature selection, and ensemble methods (IBM Cloud Education, 2021). In the case of data augmentation, it “can improve the performance of their models and expand limited datasets to take advantage of the capabilities of big data” (Shorten and Khoshgoftaar, 2019, p. 1). The key limitation is the fact that it needs to be done sparingly. When it comes to feature selection, the overall redundancy of the parameters is addressed by removing the irrelevant ones (Meyer et al., 2018). However, it can result in excessive oversimplification of the model if critical parameters are eliminated. The ensemble method, such as boosting or bagging, can be used to create data samples and train them independently (Ying, 2019). The disadvantage is the notion of the requirement for extrinsic factors, such as consensus among multiple models (Salman and Liu, 2019). Therefore, all methods are effective when used properly, but the limitations should always be considered.

A critical examination of data augmentation, feature selection, and ensemble methods reveals that they can be reconciled with the bias-variances trade-off in machine learning. It should be noted that overfitting is a problem of reduction in bias and increase in variance. Data augmentation is a process of adding some noise to data to reduce perfect fit and increase bias. Feature selection is similar to the previous method in principle because removing some redundant parameters increases bias as well. The ensemble method is about reducing variance since several independent samples are trained, which results in better and more accurate estimates.

An example of a banking organization would be the Bank of India. The ensemble method can be further illustrated due to the fact that “the gradient boosting model outperforms the base decision tree learner, indicating that ensemble model works better than individual models” (Chopra and Bhilare, 2018, p. 129). In other words, overfitting, alongside many other issues, was avoided or minimized when applying these measures to credit scoring models. Such an approach yields competitiveness and advantage for the organization since it uses volume, velocity, and a variety of data to properly score the credit units (McAfee and Brynjolfsson, 2012). All three methods can be used in an effective manner, but the ensemble method is likely to be the most effective since it runs several independent samples. In other words, variance is decreased due to small divergences in each unconnected run, which are all derived from the same pool of external data.

Reference List

Belkin, M. et al. (2019) ‘Reconciling modern machine-learning practice and the classical bias–variance trade-off’, PNAS, 116(32), pp. 15849-15854. Web.

Chopra, A., and Bhilare, P. (2018) ‘Application of ensemble models in credit scoring models’, Business Perspectives and Research, 6(2), pp. 129-141. Web.

Delua, J. (2021) Supervised vs. unsupervised learning: what’s the difference? Web.

IBM Cloud Education. (2021) Overfitting. Web.

Kumar, S. (2021) ‘3 techniques to avoid overfitting of decision trees’, Towards Data Science. Web.

McAfee, A., and Brynjolfsson, E. (2012) ‘Big data: the management revolution’, Harvard Business Review, Web.

Meyer, H. et al. (2018) ‘Improving performance of spatio-temporal machine learning models using forward feature selection and target-oriented validation’, Environmental Modelling & Software, 101, pp. 1-9.

Provost, F., and Fawcett, T. (2013) Data science for business: what you need to know about data mining and data-analytic thinking. 1st edn. Sebastopol: O’Reilly Media.

Salman, S., and Liu, X. (2019) ‘Overfitting mechanism and avoidance in deep neural networks’, ArXiv, 1901, pp. 1-8. Web.

Shorten, C., and Khoshgoftaar, T. M. (2019) ‘A survey on image data augmentation for deep learning’, Journal of Big Data, 6(60), pp. 1-48.

Ying, X. (2019) ‘An overview of overfitting and its solutions’, Journal of Physics: Conference Series, 1168(2), pp. 1-6.