Introduction
The information gain is a useful method to predict the behaviors of the customers and allows to better target and market products. Nevertheless, some limitations to this method are present.
Discussion
The cleanliness of data and the possibilities of recurring issues are one of the limitations of using information gain in the future. The organization previously encountered such a problem, and there are no guarantees of this not happening again. This was a cause of the need to clean the data, which resulted in extra work done and time spent on the information gain. It is necessary to reassure in further data collection that the information is clean and properly separated and categorized to be easily accessed and used in the future.
Another issue regards the complexity of the data. It is primarily due to the presence of both categorical and numerical values in the information collected by the organization. This may result in problems regarding the possibility of correctly categorizing and assessing the data (Lutes, 2020). Moreover, the attributes which the data set consists of are independent variables that might strongly distort the results of the analysis, as their values may not affect the actual relevance of the prediction (Santini, 2015). Meanwhile, the analysis procedures related to information gain will still use this data, and it is necessary to adapt the processes accordingly to the possible problems.
Overfitting is a common issue in information gain and should be considered in every possible case. This issue is based on the prediction models being able to adapt to the information that is being assessed, which often results in wrong predictions (Bramer, 2007). Nevertheless, it is necessary to note that it is not the fault of certain models, whereas it is more of a human factor. Choosing a model and adapting it to the data set should be conducted with the consideration of avoiding the excessive complexity of the changes in the model (Provost and Fawcett, 2013). These modifications are usually the main reason for the eventual overfitting and must be done carefully and relevantly. The main method that should be adopted to avoid overfitting is testing the developed model with a holdout set.
The information gain is prone to favoring the attributes that have many possible values. The potential risk groups of attributes are almost unavoidable. This is the case in the organization reviewed, and the present attribute of such a category is ID. The reason for that lies in IDs being independent of any factors and their high variety in values (Tang, Alelyani, and Liu, 2014). This might lead to the model being too biased to evaluate information based on this attribute, which will lead to entropy being close to or equal to zero (Buscemi, Das, and Wilde, 2016). Such issues lead to the impossibility to use the information provided by the analysis and should be avoided.
Conclusion
The limitations regarding the usability of the information are described in the case of work with Amazon’s data. This case emphasizes the issues that are faced during working with the data to prepare it for information gain and prediction of customer behaviors (Zdravenski et al., 2020). It is necessary to develop and adopt methods that allow to properly categorize and transform data for to be used to extract any knowledge from it, which is demonstrated in the article. Another case regarding clothing sales emphasizes the issues regarding the model’s modifications and their proneness to overfitting, developing an algorithm free of these problems (Sun et al., 2015). This is remarkable in terms of how the model has to be developed to suit the necessities of businesses to properly predict customers’ behaviors.
Reference List
Bramer, M. (2007) ‘Avoiding overfitting of decision trees’, Principles of data mining, pp.119-134.
Buscemi, F., Das, S. and Wilde, M.M. (2016) ‘Approximate reversibility in the context of entropy gain, information gain, and complete positivity’, Physical Review A, 93(6), p.062314.
Lutes, J. (2020) Entropy and Information Gain in Decision Trees. Web.
Provost, F. & Fawcett, T. (2013) Data Sciences for Business: What you need to know about Data Mining and Data-Analytics Thinking. Sebastopol, CA: O’Reilly Media.
Santini, M. (2015) ‘Lecture 4 decision trees (2): Entropy, information gain, gain tatio’ [PowerPoint presentation]. Web.
Sun, F., Liu, Y., Xurigan, S. and Zhang, Q. (2015) ‘Research of clothing sales prediction and analysis based on ID3 decision tree algorithm’, In International Symposium on Computers & Informatics.
Tang, J., Alelyani, S., & Liu, H. (2014) ‘Feature Selection for Classification: A Review’, Data Classification: Algorithms and Applications.
Zdravevski, E., Lameski, P., Apanowicz, C., & Ślȩzak, D. (2020) ‘From Big Data to business analytics: The case study of churn prediction’, Applied Soft Computing, 90, 106164.