This paper focuses on the implementation of the Indian Liver Patient Dataset classification using the Intel® Distribution for Python* on the Intel® Xeon® Scalable processor. Various pre-processing steps were used to analyze their effect of this machine learning classification problem. With the help of various features, the liver patient classification aims to predict whether or not a person has liver disease. Early determination of the disease without the use of manual effort could be a great support for people in the medical field. Good results were obtained by using SMOTE as the preprocessing method and the Random Forest algorithm as the classifier.
The liver, which is the largest solid organ in the human body, performs several important functions. Its major functions include manufacturing essential proteins and blood clotting factors, metabolizing fat and carbohydrates, etc.
Excessive consumption of alcohol, viruses, the intake of contaminated food and drugs and so on are the major causes of liver diseases. The symptoms may or may not be visible in the early stages. If not diagnosed in the initial stages, liver diseases can lead to life-threatening conditions.
In India, according to the latest data published by WHO in May 2014, liver disease death constitutes to 2.44% of total deaths. Also, around 10 lakh people are diagnosed with liver diseases every year in India. With increased percentage of population falling prey to this medical condition, it is imperative to look at ways to detect these disorders in early stages. This not only saves a human life, but also with the advent of technology, it makes the medical treatment more accurate and faster. Additionally, the increase in the number of cases contribute to the building of database while applying technology.
Classification is an effective technique used to handle such problems in the medical field. Using the available feature values, the classifier could predict whether or not a person has liver disease. This ability will help doctors identify the disease in advance. It is always recommended to reduce Type I error (a type I error occurs due to the rejection of null hypothesis (as false) when it is actually true), as false diagnosis could lead to fatal conditions.
In this experiment, various preprocessing methods were tried prior to model building and training for comparison. Computational libraries like scikit-learn*, numpy, and scipy* from the Intel Distribution for Python on the Intel Xeon Scalable processor were used for predictive model creation.
Table 1 describes the environment setup that was used to conduct the experiment.
Table 1. Environment setup.
|Processor||Intel® Xeon® Gold 6128 processor 3.40 GHz|
|Core(s) per socket||6|
|Anaconda* with Intel channel||4.3.21|
|Intel® Distribution for Python*||3.6.3|
The Indian Liver Patient dataset was collected from the northeast area of the Andhra Pradesh state in India. This is a binary classification problem with the class labeled as liver patient (represented as 1 in the dataset) and patients without any liver disease (represented as 2). There are 10 features, which are listed in table 2.
Table 2. Dataset description.
|Attribute Name||Attribute Description|
|V1||Age of the patient. Any patient whose age exceeded 89 is listed as being age 90.|
|V2||Gender of the patient|
|V5||Alkphos alkaline phosphatase|
|V6||Sgpt alanine aminotransferase|
|V7||Sgot aspartate aminotransferase|
|V10||A/G ratio albumin and globulin ratio|
|Class||Liver patient or not|
The methodology, depicted in the following figure, has been adopted for conducting liver patient dataset classification experiment.
Figure 1. Methodology.
Before performing any processing on the available data, a data analysis is recommended. This process includes visualization of the data, identifying the outliers, and skewed predictors. These tasks help to inspect the data and thereby spot the missing values and irrelevant information in the dataset. A data cleanup process is performed to handle these issues and to ensure data quality. Gaining a better understanding of the dataset helps to identify useful information and supports decision making.
The Indian Liver Patient dataset consists of 583 records in which 416 are records of people with liver disease, and the remaining are records of people without any liver disease. The dataset has 10 features in which there is only one categorical data (V2-Gender of the patient). The endmost column of the dataset represent the class in which each sample falls (liver patient or not). A value of 1 indicates the person has liver disease and a 2 indicates the person does not have the disease. There is no missing value in the dataset.
Figure 2. Visualization: liver patient dataset class.
Figure 3. Visualization: male and female population.
Figure 2 shows a visualization of the number of patients with liver dissease and patients wioth no liver disease, whereas figure 3 represents a visualization of the male and female population in the dataset. Histograms of numerical variables are represented by figure 4.
Figure 4. Visualization of numerical variables in the dataset.
Some datasets contain irrelevant information, noise, missing values, and so on. These datasets should be handled properly to get a better result for the data mining process. Data preprocessing includes data cleaning, preparation, transformation, and dimensionality reduction, which convert the raw data into a form that is suitable for further processing.
The major objective of the experiment is to show the effect of various preprocessing methods on the dataset prior to classification. Different classification algorithms were applied to compare the results.
Some of the preprocessing includes:
- Normalization: This process scales each feature into a given range. The
preprocessing.MinMaxScaler()function in the sklearn package is used to perform this action.
- Assigning quantiles ranges: The
pandas.qcutfunction is used for quantile-based discretization. Based on the sample quantiles or rank, the variables are discretized and assigned some categorical values.
- Oversampling: This technique handles the unbalanced dataset. Oversampling is used to generate new samples in the under-represented class. SMOTE is used for oversampling the data. SMOTE proposes several variants by identifying specific samples. The
imblearn.over_samplingis used to implement this.
- Undersampling: Another technique to deal with unbalanced data is undersampling. This method is used to reduce the number of samples in the targeted class. ClusterCentroids is used for undersampling. The K-means algorithm is used in this method to reduce the number of samples. The
ClusterCentroids()function from the
imblearn.under_samplingpackage is used.
- Binary encoding: This method converts the categorical data into a numerical form. It is used when the feature column has a binary value. In the liver patient dataset, column V2 (gender) has the values male/female, which is binary encoded into “0” and “1”.
- One hot encoding: Categorical features are mapped onto a set of columns that have values “1” or “0” to represent the presence or absence of that feature. Here, after assigning the quantile ranges to some features (V1, V3, V5, V6, V7), one hot encoding is applied to represent the same in the form of 1s and 0s.
Feature selection is mainly applied to large datasets to reduce high dimensionality. This helps to identify the most important features in the dataset that can be given for model building. In the Indian Liver Patient dataset, the random forest algorithm is applied in order to visualize feature importance. The
ExtraTreesClassifier() function from the
sklearn.ensemble package is used for calculation. Figure 5 shows the feature importance with forests of trees. From the figure, it is clear that the most important feature is V5 (alkphos alkaline phosphatase) and the least important is V2 (gender).
Removing the least significant features help to reduce the processing time without significantly affecting the accuracy of the model. Here V2 (gender of the patient), V8 (total proteins), V10 (A/G ratio albumin and globulin ratio), and V9 (albumin) are dropped in order to reduce the number of features for model building.
Figure 5. Feature importance with forests of trees.
A list of classifiers was used for creating various classification models, which can be further used for prediction. A part of the whole dataset was given for training the model and the rest was given for testing. In this experiment, 90 percent of the data was given for training and 10 percent for testing. Since
StratfiedShuffleSplit (a function in scikit-learn) was applied to split the train-test data, the percentage of samples for each class was preserved, that is, in this case, 90 percent of samples from each class was taken for training and the remaining 10 percent from each class was given for testing. Classifiers from the scikit-learn package were used for model building.
The label of a new input can be predicted using the trained model. The accuracy and F1 score were analyzed to understand how well the model has learned during training.
Evaluation of the Model
Several methods can be used to evaluate the performance of the model. Cross validation, confusion metrics, accuracy, precision, recall, and so on are some of the popular performance evaluation measures.
The performance of a model cannot be assessed by considering only the accuracy, because there is a possibility for misleading. Therefore this experiment considers the F1 score along with the accuracy for evaluation.
Observation and Results
In order to find out the effect of feature selection on the liver patient dataset, accuracy and F1 score were analyzed with and without feature selection (see table 3).
After analyzing the result, it was inferred that there was no remarkable change in the result by removing the least significant features except in the case of the Random Forest Classifier. Because feature selection helps to reduce the processing time, it was applied before further processing techniques.
Table 3. Performance with and without feature selection.
|Classifiers||Without Feature Selection||With Feature Selection|
|Accuracy||F1 score||Accuracy||F1 score|
|Random Forest Classifier||71.1186||0.81||0.37||74.5762||0.84||0.44|
|Ada Boost Classifier||74.5762||0.83||0.52||72.8813||0.82||0.43|
|Decision Tree Classifier||66.1016||0.76||0.41||67.7966||0.77||0.49|
|Multinomial Naïve Bayes||47.4576||0.47||0.47||49.1525||0.5||0.48|
|Gaussian Naïve Bayes||62.7118||0.65||0.61||61.0169||0.62||0.6|
After feature selection, some preprocessing techniques were applied, including normalization. Here, each feature was scaled and translated such that it is in the given range on the training set. Another preprocessing was done by assigning quantile ranges to some of the feature values. One hot encoding was done after this to represent each column in terms of 1s and 0s. The classification result after performing normalization and quantile assigning is given in table 4. After analysis, it was clear that the preprocessing could not improve the performance of the model. But one hot encoding of the column helped in faster model building and prediction.
Table 4. Performance with normalization and quantile ranges.
|Classifiers||Normalization||Assigning quantiles ranges|
|Accuracy||F1 score||Accuracy||F1 score|
|Random Forest Classifier||72.8813||0.82||0.43||71.1864||0.82||0.32|
|Ada Boost Classifier||72.8813||0.82||0.43||76.2711||0.85||0.36|
|Decision Tree Classifier||67.7966||0.77||0.49||74.5762||0.84||0.35|
|Multinomial Naïve Bayes||71.1864||0.83||0||67.7966||0.75||0.56|
|Gaussian Naïve Bayes||57.6271||0.58||0.58||37.2881||0.21||0.48|
Another inference is that the F1 score for non-patients is zero in some cases, which is a major challenge. In such cases, the accuracy may be high, but the model will not be reliable because the classifier classifies the whole data into one class. The major reason for this could be data imbalance. To address this issue undersampling and oversampling techniques were introduced. Cluster centroids for undersampling and the SMOTE algorithm was used for oversampling. The results are shown in table 5.
Table 5. Performance with under sampling and SMOTE.
|Classifiers||Cluster Centroid (Under Sampling)||SMOTE(Over sampling)|
|Accuracy||F1 score||Accuracy||F1 score|
|Random Forest Classifier||67.7966||0.73||0.6||86.4406||0.91||0.75|
|Ada Boost Classifier||66.1016||0.71||0.58||74.5762||0.81||0.63|
|Decision Tree Classifier||57.6271||0.65||0.47||72.8813||0.79||0.6|
|Multinomial Naïve Bayes||45.7627||0.41||0.5||49.1525||0.5||0.48|
|Gaussian Naïve Bayes||59.3220||0.6||0.59||62.7118||0.65||0.61|
Table 5 shows that undersampling and oversampling could handle the data imbalance problem. Using cluster centroids as the undersampling technique did not improve the accuracy, whereas SMOTE did give a tremendous improvement in the accuracy. The best accuracy was obtained for the Random Forest Classifier and Ada Boost Classifier. The processing was improved by running the machine learning problem in the Intel® Xeon® Scalable processor making use of computational libraries from the Intel Distribution for Python.
Figure 6. ROC of Random Forest 5-fold cross validation.
Figure 7. ROC curve for various classifiers.
Figure 6 shows the ROC curve of the best classifier (Random Forest Classifier) for 5-fold cross validation. Higher accuracy was obtained during the cross validation as the validation samples were taken from the training sample that was subjected to oversampling (SMOTE). The expected accuracy during cross-validation was not attained during testing because the test data was isolated from the train data before performing SMOTE.
The ROC curves for various classifiers are given in figure 7. The classifier output quality of different classifiers can be evaluated using this.
The preprocessing and classification methods did not improve the accuracy of the model. Handling the data imbalance using SMOTE gave better accuracy for the Random Forest and Ada Boost Classifier. A good model was created using the computational libraries from the Intel Distribution for Python on the Intel Xeon Scalable processor.
- Intel® Optimized Packages for the Intel® Distribution for Python*
- Intel® Distribution for Python
Aswathy C is a Technical Solutions Engineer working on behalf of the Intel® AI Developer Program.
Product and Performance Information
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.