Data mining is the process of sorting out data in order to identify their patterns and to establish their relationships. Data mining deals with the use of complex data analysis tools to unravel previously unknown patterns and relationships in huge amounts of data sets. The tools used are mainly made up of models of statistics, mathematical algorithms and other machine learning methods. The process of data mining is done in a sequence of activities which include data collection, management of data, analysis and prediction. Data mining is becoming an increasingly important tool used in the conversion of data into information in various areas in which it’s applicable. Most industries use data mining to cut down their expenditures, promote innovative activities that may help in improving their businesses like research and also to boost the industrial productivity. For example, industries may use data mining to determine malicious cases in their firms like theft. Additionally, this process is also used to assess risks in organizations. In most cases data mining can be used to reveal data patterns although this can only be performed only on data samples. However, this whole process does not work properly or can fail in cases where the samples used do not reflect the whole population of the original data. Therefore, the data mining techniques in use may not bring out data patterns which are more likely to be in the original sample of data if the same patterns are not found in the specific sample that is used in the mining process. In addition, the invention of certain data patterns in a specific set of data does not actually mean that the same pattern can be traced in the larger data from where that sample was obtained from. For this reason, an imperative part of the procedure is the confirmation and substantiation of patterns on supplementary samples of data (Seifert 2004, p.1).
specifically for you
for only $16.05 $11/page
The major aim as to why data mining is used is to assist or aid in the process of analyzing data collections on observations of characters. These kinds of data are usually very susceptible to collinearity mainly due to their interrelations which are not familiar. Unfortunately, the unavoidable fact of data mining is that the subset(s) of data being analyzed may not be representative of the whole domain. As a result, the data may not consist of illustrations of some relevant and essential relationships and characteristics which occur or may be present in other sections of the domain. This kind of problem may be augmented using the experiment –based approaches like Choice Modeling for human generated data. In such cases, intrinsic correlations are either restricted or eliminated in overall throughout the creation of the investigational plan. In most instances, data mining involves four classes of tasks which are listed as follows. Data mining can be done in two different ways: the directed or undirected form. In directed mining, there is a prior view of what someone is trying to find out and it’s defined by a target variable while in undirected forms there is no prior view (Seifert 2004, p.1).
Application area of data mining
Although there are several areas where data mining is applied, this report will mainly dwell on the application of data mining in the medical community. This specific area uses data mining to facilitate or assist in the forecasting of the efficiency / efficacy of a course of action or medication. Medical data mining has a great potential for exploring the hidden patterns in the data sets of the medical domain. These prototypes can be utilized for medical analysis. On the other hand, the available unprocessed clinical data is extensively disseminated and in most cases it is diverse in nature and capacious/huge. Therefore, this kind of statistics needs to be gathered in an organized form and be incorporated in order to make up a hospital data scheme (Krishan 2006, p.119). In medicine, data mining is commonly used to solve diagnosis problems and enhance decision making in relation to the diagnosis that has been performed. An example of a medical case that may require the use of data mining to enable the decision making process to proceed as to whether treatment should be carried out is described as follows.
Case study 1
A solitary pulmonary nodule is a deformity of the lungs which may possibly be malignant and it’s estimated that more than 160,000 inhabitants in the United States suffer from lung cancer with approximately 90% of them dying. Therefore it’s essential that solitary pulmonary nodules are detected at an early stage and the diagnosis should be done precisely. The clinical diagnosis of solitary pulmonary nodule (SPN) using information from non-invasive tests is 40-60% accurate (Kusiak 2000, p.7).
. This statistics indicate that majority of the people who suffer from cancerous diseases like SPN have to go through biopsy which consists of substantial dangers and expenses. In a distinctive SPN infection incidence scenario, a lump is detected on a patient’s trunk radiograph and because this SPN may be malignant or benign, further testing is needed to ascertain its exact nature. Hence this diagnosis relies on several features like the SPN width, edge makeup, calcification and the patient’s age. As a result, numerous health disciplines are engaged in the collection of large volumes of medical data at various times and locations with varying precision and regularity/ uniformity. Consequently, a method that merges information from a variety of sources and wisely processes bulky volumes of statistics is required. In this case study, fifty patients (items) with recognized SPN diagnosis were put into consideration. For every patient, 18 characteristic values (medical examination results) were gathered and the features that considered for the 50 patients were grouped into two groups: patient s information (like age) and test results (like computed tomography).Using these features, two prediction algorithms were used in the decision making phase. The results of this study show that the prediction accuracy was 100% correct and the diagnosis was accurate. This indicates that data mining can be useful in making diagnostic decisions accurately without using invasive methods on patients (Kusiak 2000, p.7).
Case study 2
Another case is the use of data mining technique for diagnosis of Posterior Uveal Melanoma. In this case 89 patients with intraocular pathology were investigated in the eye clinic of Kaunas University of medicine. Intraocular tumors were diagnosed for 46 patients, 8 patients with metastatic tumor of the eye and 35 were clinically and echoscopically similar to tumor cases. All patients were examined with ultrasound diagnostic imaging system using the A/B ultrasonic investigation mode. The diagnostic parameters of tumor were calculated from A/B ultrasound images. Diagnostic parameters were calculated and were used in the synthesis of a decision tree. The pilot decision tree for the differential diagnosis of intraocular tumors based on parameter from eye images obtained by A/B ultrasound examination was created (Darius 2002 p. 1).
In data mining, the greatest chance of success comes from combining experts’ knowledge with advanced analysis techniques in which the computer itself identifies the relationships and features in the data. The analytical technique used in this case study was the decision tree. The results of data mining were represented in a decision tree because it is a more convenient form since it is represented in a form of a tree structure with decision rules. Decision trees are often used in classification to predict the groups to which specific cases belong to. It can also be used for regression to predict a specific value. The decision tree was chosen as a model in the above case study because the main purpose of the study was to differentiate the tumors correctly. For the decision tree model, data with diagnostic parameters and known diagnosis were used. The created decision tree was used for new data processing and its results were certain diagnosis. The new data with known and approved diagnosis was used for decision tree model remodeling and improvement based on the new experience. When the amount of data used for decision tree modeling got sufficiently big, the knowledge discovery system reached its potentially best decision support performance. The decision tree model indicated the importance of the diagnostic parameters and this importance varied depending on the data size and parameters. This technique was appropriate because it saves time and effort in the acquisition of data. Similarly, there were no limitations to possibly include informative parameters and to get an evaluation of its usefulness for decision making. Despite the fully automatic decision tree synthesis, it was easy to read. Hence this was another positive feature of this approach to compare with neural networks, logistic regression and other methods.
100% original paper
on any topic
done in as little as
This model was intended to support the physician’s clinical decision as second opinion and the main errors appeared in metastatic tumor classification because of the small amount of data used. The reliability of decision support increased with every increment of learning cases. Using the knowledge discovery scheme, there was no need to interfere in the process or change some things in programs because the only duty of the physician was to feed the system with reliable data and diagnoses for consecutive learning of algorithm. A performance issue that is normally associated with the use of decision trees in medical diagnosis is the occurrence of false positives and negatives hence causing misdiagnosis. In cases of false negatives, illnesses can be missed out while in false positives the illness is included yet it may not be there (Darius 2002 p. 1).
Case study 3
Another example is the heart attack forecast system that uses data mining and artificial neural network. This technique is principally based on the information composed from standard experiences and from current circumstances which envisage something as it may happen in the future. The Neural networks technique is one of the extensively acknowledged simulated brainpower education models and an immense pact has by now been written concerning them. With assistance of a data set, the prototypes considerable to the heart attack prediction were extracted using the following method. The information set on heart ailment was fruitfully processed prior to data mining by eliminating second copy records and supplying the omitted values. Thereafter, the polished heart disease data scheme that came up as an outcome of prior processing was clustered by means of the K- algorithm and 2 was used as the K value. Subsequently the recurrent patterns were mined proficiently from the group that was pertinent to heart disease via the MAFIA algorithm. Afterward, the important patterns were extracted with the support of the implication weightage greater than the already defined threshold.The values that corresponded to each trait in the noteworthy patterns are as follows: blood pressure range bigger than 140/90mmHg, cholesterol range superior than 240mg/dl, maximum heart rate larger than 100 beats/minute, abnormal and unsteady angina. In addition to these significant parameters, some more parameters that were significant to heart attack were used. Their weightage and levels of precedence were advised by the health experts. With the facilitation of the considered prediction scheme, it was probable to envisage the diverse danger levels of attack. This study used the Multi-Layer Perceptron Neural Network (MLPNN) with back propagation as instruction algorithm. MLPNN can be given the description of a feed forward artificial neural network model which is proficient of mapping sets of key-in data onto a set of suitable productivity while the back propagation algorithm can be put in use efficiently to instruct neural networks. This technique of data mining assisted in predicting the patients who were at high, medium and low risk levels of getting heart attacks (Patil 2009, p.642).
Other data mining techniques
Apart from the data mining techniques that have been discussed in the above case studies, there are also others which are equally good. An example is the association rule learning. In data mining, this is a very acceptable and well investigated technique for coming up with interesting associations among variables in hefty data bases. For example a business company may gather data on customer purchasing habits and by means of the association law knowledge, the company can be able to determine the goods that are commonly purchased mutually. The resulting information may be useful for marketing purposes. At times, this task is called the basket market analysis. Association rules are used in lots of application areas that include web practice mining, invasion recognition, bioinformatics and medication. One constraint of the customary move towards discovering relations is that through looking for gigantic statistics of probable associations to search for objects which emerge to be linked, there is a great threat of finding several false relations. This technique involves listing association rules that pass thresholds together with their support, confidence and lift ordered by decreasing support, confidence or lift. These are collections of things which co-occur by means of unpredicted rate of recurrence in the data although they only do so by probability. For example, if a collection of 10,000 objects are taken into account, it would necessitate one to look for a set of laws containing two objects in the left hand side and one item in the right hand side. There would be so many such regulations. If an arithmetical analysis of independence is applied for dependence with 0.05 as the level of significance, it implies there would be only a 5% likelihood of accommodating a rule if there is no relationship/ alliance. Association of rules consists of properties which include the following: Each individual rule is easy to interpret, it requires effective pruning and it can be applied to very large data sets.
In conclusion, the application of data mining in the health field provides several benefits which take account of identifying the patient’s conduct which will assist the medicinal practitioner to foretell their clinical visits. Additionally, data mining facilitates the process of identifying the diverse remedial therapies for assorted types of ailments.
Darius, J. (2002). Application of data mining technique for diagnosis of posterior Uveal Melanoma. Informatica. Volume 13(4):464-4455.
Krishan, S. (2006). Effect of data mining techniques on medical diagnostics. Journal of Data science. Volume 5(2006):119-126
Kusiak, A. (2000) Data mining: medical and engineering case studies. Web.
Patil, S. (2009). Data mining & artificial neural network. European journal of scientific research. Volume 31(4):642-656
Seifert, J. (2004). CRS report for congress: data mining. Web.