Predictive Models for Microbiome Data

Words: 2646 Pages: 10

Table of Contents

Background
Data Classification and Analysis
Importance of Machine Learning in the Analysis of Microbial Data
Methods
Results
Discussions
Works Cited

Background

Human health and disease control are some of the oldest yet complicated fields in the history of humankind. Over the years, scholars have published papers and literature in microbial studies concerning the relationship between different microbial communities and their influence on diseases and infections. The microbial communities exist inside and outside the human body and significantly influence the overall human health and prevalence of diseases and infections. The microbial communities outside the human body are found on the skin, nails, and hair and are associated with communicable and non-communicable diseases.

The study of microbial communities is resource-intensive and has been challenging in the recent past. However, with the aid and adoption of computers and artificial intelligence models such as machine learning, the processes have been simplified, producing more reliable results. Traditionally, microbial studies included blood sampling, urine and stools screening, and other samples and were associated with separating the samples to establish any microbes in the specimen. The processes were labor-intensive and highly susceptible to error due to human fatigue. The paper represents machine learning algorithms in the study of microbial communities. The experiments present a stepping stone in the study, interpretation, and understanding of microbial communities found in and on the human body and their impact on human understanding of diseases and infections.

Data Classification and Analysis

Data classification and analysis is one of the on-demand techniques used in the modern computing era. The proliferation of data collected, stored, and processed in a day has dwarfed the traditional data processing tools and techniques. Today, data is stored in various dimensions and formats, calling for sophisticated methods and tools for analysis. Data analysis and processing play a crucial role in interpreting naturally existing phenomena that would have otherwise been considered meaningless. Over the years, data collection and collection has been so diverse that the existing analysis techniques and tools have become obsolete. Scholars and engineers came up with techniques to extract data from different sources and formats to form singly manageable data sets. This process is known as data mining and is commonly used by large cooperate organizations to summarize data from different operations, departments, and processes.

Usually, different data collection points use varying techniques, storage techniques, data formats, and complicated tools that might not necessarily be compatible with one another. As a result, it becomes challenging for data analysts to process data from such varying sources. Data mining is an essential practice in large cooperate institutions as it helps them gain meaningful insight into customers, suppliers, and other stakeholders. The results are used to make company decisions that impact the company’s performance, future, and success (Ge at al. 20590). Data mining has heavily relied on existing tools. For instance, it is heavily dependent on data warehousing and the computing power of the information systems. The process is also affected by the effectiveness of the data collection methods and tools as they dictate the type and amount of data collected and stored in the data warehouses.

The development and advancement of data processing and analysis tools have been on the rise since the development of the internet. Today, the amounts of data processed have surpassed human capabilities, thwarting their efficiency in data collection, storage, processing, and presentation. The development of data analysis and processing tools powered by artificial intelligence has been on the rise. It has helped improved accuracy, efficiency and reduced operational costs in the cooperate sector. The development and growth of different fields of artificial intelligence have been adopted in all fields of human life. Machine learning is one of the youngest yet widely adopted branches of artificial intelligence. This branch of artificial intelligence deals with the development of models and computer algorithms that can learn by themselves from existing data sets. Data warehouses contain the data sets needed to train machine learning models that suit their businesses in an operating organization.

Machine learning models study patterns and relationships between different stakeholders’ data stored in data warehousing. Based on the results, the models can then make human-like intelligent decisions such as predicting sales, customer needs, product development process, the success of a particular marketing strategy, or mutation of a particular disease-causing organism. Machine learning is mainly adopted in the military, education, finance, banking, medicine and microbiology, agriculture, and meteorology to help experts analyze large data sets with minimal effort (Zou et al. 1182). The models and algorithms have improved in performance and accuracy. As of today, artificial intelligence has surpassed human intelligence in various fields, including x-ray scanning, gaming, and image processing.

Data mining is always driven by company needs which also dictates the kind of software used. However, the data mining process remains the same across organizations, and the main goal remains to establish links between different data sets. Machine learning and data warehousing are of great importance to the scientific communities as they help extract meaningful insights from unimaginably large data sets (Zou et al. 1182). Also, the data collection tools used in modern research experiments have dwarfed the labor-intensive tools and techniques used before the widespread adoption of artificial intelligence. Since machine learning models rely on large data sets for train and testing, the data collection and sampling techniques have been updated to match the processing power of the computing resources and artificial intelligence models.

Importance of Machine Learning in the Analysis of Microbial Data

Although human intelligence beats artificial intelligence general applications, Artificial intelligence had surpassed human intelligence in various sectors such as gaming, analysis of x-ray data, and other fields. It is an excellent sign of the unchallenged benefits of artificial intelligence in real-life scenarios. Health care is one of the most skill-demanding disciplines challenging human accuracy. The adoption of machine learning models has proven beyond any reasonable doubt that the models can outperform human experts as long as they are trained adequately. Training machine learning models helps the algorithms to adapt and improve their accuracy (Qu et al. 827). Microbes are tiny and present remarkable similarity from community to another. Human experts might not be able to establish the differences even with the help of a powerful microscope. However, properly trained machine learning models equipped with high precision sensors and other data collection tools can establish the differences, similarities, and potential impact of these microbial communities with high levels of accuracy.

Methods

The method of choice of data processing techniques plays an essential role in determining the results obtained. The data pre-processing entails importing required libraries, loading data set to model, handling missing data, and encoding measures of central tendency. In our case, the study used the random forest algorithm to mine data from multiple sources. It is one of the most influential and widely used machine learning models for data mining as its performance improves with the intensity in training levels. Random forest is a popular algorithm used in artificial intelligence to establish the relationship between grouped items in a data set. in a nutshell, the algorithm employs classification and regression analysis to compute the relationship between data features.

This algorithm is significant because it looks out for the best rather than the essential features as it split down the decision tree. Before splitting down a tree, the algorithm picks a particular set of essential features geared towards producing the best results. Random forest is similar to a decision tree, except that it adds randomness into the tree, making it more complex, unbiased, and unpredictable. Besides, random forest ranks high compared to other machine learning algorithms in measuring the relative importance of every feature in the data set. With the help of feature importance, an analyst can choose which features to drop or keep for the analysis process. This helps focus on what is most useful rather than important. The correctly classified instances were used to measure the model’s accuracy. The results of the model are presented in the section below. The usefulness of the algorithm was tested using the confusion matrix.

Results

This section presents the results obtained after train and testing the models. It presents graphical and tabular views of the results as abstained after running the models. Machine learning models are data-dependent, and their accuracy y is directly proportional to the data set used in the model training process. The larger the data set, the better the model adapts and teaches itself how to identify and classify input in the future. The model used different features for machine learning include Axilla, Volar Forearm, Plantar Foot, Forehead, Palmar Index Finger, Popliteal Fossa, Labia Minora, Umbilicus, External Nose, Lateral Pinna, Palm, and Glans Penis.

The aforementioned features are the different body parts from which the microorganisms were found in or on the human body.

Table 1 Random Forest model training.

=== Run information ===
Scheme: weka.classifiers.meta.AdaBoostM1 -P 100 -S 1 -I 10 -W weka.classifiers.trees.DecisionStump
Relation: HSS_otus-weka.filters.unsupervised.attribute.Remove-R1-weka.filters.unsupervised.attribute.Remove-R2,3,4,5,6,7,8-weka.filters.unsupervised.attribute.Remove-R2,3,4,5,6,7,8-weka.filters.unsupervised.attribute.Remove-R2,3,4,5,6,7,8-weka.filters.unsupervised.attribute.Remove-R2,3,4,5,6,7,8-weka.filters.unsupervised.attribute.Remove-R2,3,4,5,6,7,8-weka.filters.unsupervised.attribute.Remove-R2,3,4,5,6,7,8-weka.filters.unsupervised.attribute.Remove-R2,3,4,5,6,7,8-weka.filters.unsupervised.attribute.Remove-R2,3,4,5,6,7,8-weka.filters.unsupervised.attribute.Remove-R2,3,4,5,6,7,8-weka.filters.unsupervised.attribute.Remove-R2,3,4,5,6,7,8-weka.filters.unsupervised.attribute.Remove-R2,3,4,5,6,7,8
Instances: 401
Attributes: 2151
[list of attributes omitted]
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
AdaBoostM1: No boosting possible, one classifier used!
Decision Stump
Classifications
X2245 <= 0.0375411033978809 : forehead
X2245 > 0.0375411033978809 : plantar foot
X2245 is missing : plantar foot
Class distributions
X2245 <= 0.0375411033978809
axilla volar forearm plantar foot forehead palmar index finger popliteal fossa labia minora umbilicus external nose lateral pinna palm glans penis
0.07309941520467836 0.17543859649122806 0.049707602339181284 0.1871345029239766 0.08187134502923976 0.06432748538011696 0.017543859649122806 0.03216374269005848 0.04093567251461988 0.07894736842105263 0.18421052631578946 0.014619883040935672
X2245 > 0.0375411033978809
axilla volar forearm plantar foot forehead palmar index finger popliteal fossa labia minora umbilicus external nose lateral pinna palm glans penis
0.03389830508474576 0.0 0.7966101694915254 0.0 0.0 0.0847457627118644 0.0 0.01694915254237288 0.0 0.0 0.01694915254237288 0.05084745762711865
X2245 is missing
axilla volar forearm plantar foot forehead palmar index finger popliteal fossa labia minora umbilicus external nose lateral pinna palm glans penis
0.06733167082294264 0.14962593516209477 0.1596009975062344 0.1596009975062344 0.06982543640897755 0.06733167082294264 0.014962593516209476 0.029925187032418952 0.034912718204488775 0.06733167082294264 0.1596009975062344 0.0199501246882793
Time taken to build model: 0.04 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 108 26.9327 %
Incorrectly Classified Instances 293 73.0673 %
Kappa statistic 0.1306
Mean absolute error 0.1332
Root mean squared error 0.2586
Relative absolute error 90.699 %
Root relative squared error 95.4622 %
Total Number of Instances 401

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.000 0.000 ? 0.000 ? ? 0.484 0.063 axilla
0.000 0.000 ? 0.000 ? ? 0.564 0.167 volar forearm
0.719 0.039 0.780 0.719 0.748 0.703 0.805 0.575 plantar foot
0.891 0.742 0.186 0.891 0.307 0.129 0.562 0.177 forehead
0.000 0.000 ? 0.000 ? ? 0.551 0.076 palmar index finger
0.000 0.000 ? 0.000 ? ? 0.459 0.065 popliteal fossa
0.000 0.000 ? 0.000 ? ? 0.401 0.014 labia minora
0.000 0.000 ? 0.000 ? ? 0.489 0.029 umbilicus
0.000 0.000 ? 0.000 ? ? 0.495 0.034 external nose
0.000 0.000 ? 0.000 ? ? 0.545 0.072 lateral pinna
0.078 0.089 0.143 0.078 0.101 -0.014 0.558 0.174 palm
0.000 0.000 ? 0.000 ? ? 0.462 0.025 glans penis
Weighted Avg. 0.269 0.139 ? 0.269 ? ? 0.577 0.194

=== Confusion Matrix ===

Table 2. Random Forest model test results.

=== Run information ===
Scheme: weka.classifiers.trees.RandomForest -P 100 -I 100 -num-slots 1 -K 0 -M 1.0 -V 0.001 -S 1
Relation: HSS_otus-weka.filters.unsupervised.attribute.Remove-R1-weka.filters.unsupervised.attribute.Remove-R2,3,4,5,6,7,8-weka.filters.unsupervised.attribute.Remove-R2,3,4,5,6,7,8
Instances: 401
Attributes: 2214
[list of attributes omitted]
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
RandomForest
Bagging with 100 iterations and base learner
weka.classifiers.trees.RandomTree -K 0 -M 1.0 -V 0.001 -S 1 -do-not-check-capabilities
Time taken to build model: 1.31 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 271 67.581 %
Incorrectly Classified Instances 130 32.419 %
Kappa statistic 0.6312
Mean absolute error 0.1039
Root mean squared error 0.2119
Relative absolute error 70.703 %
Root relative squared error 78.2312 %
Total Number of Instances 401

=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.556 0.032 0.556 0.556 0.556 0.523 0.914 0.703 axilla
0.683 0.053 0.695 0.683 0.689 0.635 0.928 0.746 volar forearm
0.922 0.021 0.894 0.922 0.908 0.890 0.985 0.942 plantar foot
0.766 0.065 0.690 0.766 0.726 0.672 0.917 0.853 forehead
0.536 0.083 0.326 0.536 0.405 0.362 0.920 0.367 palmar index finger
0.741 0.011 0.833 0.741 0.784 0.771 0.980 0.762 popliteal fossa
1.000 0.000 1.000 1.000 1.000 1.000 0.998 0.842 labia minora
0.333 0.000 1.000 0.333 0.500 0.572 0.895 0.485 umbilicus
0.214 0.008 0.500 0.214 0.300 0.312 0.946 0.496 external nose
0.667 0.051 0.486 0.667 0.563 0.533 0.937 0.576 lateral pinna
0.578 0.042 0.725 0.578 0.643 0.590 0.888 0.721 palm
0.500 0.000 1.000 0.500 0.667 0.704 0.986 0.675 glans penis
Weighted Avg. 0.676 0.041 0.704 0.676 0.677 0.644 0.933 0.734
=== Confusion Matrix ===
a b c d e f g h i j k l <– classified as
15 1 2 1 2 0 0 0 0 5 1 0 | a = axilla
1 41 1 3 7 2 0 0 0 2 3 0 | b = volar forearm
3 0 59 0 1 1 0 0 0 0 0 0 | c = plantar foot
1 3 0 49 2 0 0 0 3 4 2 0 | d = forehead
1 2 0 3 15 0 0 0 0 1 6 0 | e = palmar index finger
1 4 1 0 0 20 0 0 0 0 1 0 | f = popliteal fossa
0 0 0 0 0 0 6 0 0 0 0 0 | g = labia minora
2 1 0 2 1 1 0 4 0 1 0 0 | h = umbilicus
0 0 0 7 1 0 0 0 3 3 0 0 | i = external nose
1 0 0 4 3 0 0 0 0 18 1 0 | j = lateral pinna
0 7 3 2 14 0 0 0 0 1 37 0 | k = palm
2 0 0 0 0 0 0 0 0 2 0 4 | l = glans penis

Discussions

This section presents a discussion of the results presented in the above section. The section explains the results and compares them with other machine learning model training experiments carried out in the past. The data set contained a total of 401 instances (rows of data). The data set had several features, all of which were used in the analysis. A total of 2214 instances were studied, as presented in the results section above. The features, in this case, refer to the body parts from which the microbes were found. They included axilla, forehead, palmar index finger, popliteal fossa, labia minora, umbilicus, external nose, lateral pinna, palm, and glans penis. The data set contained 401 instances with a total of 2214 attributes. While training the model, 108 instances were identified correctly (which accounted for 26.9327%), while 293 instances were identified incorrect (accounting for 73.0673%). Two hundred seventy-one instances were identified correctly, accounting for 67.581%, while 130 were identified incorrectly, accounting for 32.419 %.

The trend in the model’s performance is in concurrence with that of most supervised machine learning models. Conventionally, machine learning models adapt as they input more labeled data, enabling them to adapt and learn. The ability of the model to correctly identify the microbes according to their respective communities is a major breakthrough in the study and classification of disease-causing organisms and diseases and infections.

Works Cited

Ge, Zhiqiang, et al. “Data Mining and Analytics in The Process Industry: The Role of Machine Learning.” IEEE Access 5 (2017): 20590-20616, Web.

Qu, Kaiyang, et al. “Application of Machine Learning in Microbiology.” Frontiers in Microbiology 10 (2019): 827. Web.

Zou, Quan, and Qi Liu. “Advanced Machine Learning Techniques For Bioinformatics.” IEEE/ACM Transactions on Computational Biology and Bioinformatics 16.04 (2019): 1182-1183. Web.