HEPATITIS DISEASES PREDICTION USING MACHINE-LEARNING TECHNIQUES

The importance of research that contributes to the early diagnosis and management of lethal diseases is critical to society, and hepatitis is one of these killer diseases. Hepatitis is a life-threatening condition that develops when the liver becomes enlarged and injured. As a result, the primary goal of this article is to analyze the hepatitis dataset in order to accurately forecast outcome accuracy and dependability. Six machine learning classification methods: Support Vector Machines, Gaussian Naive Bayes, Logistic Regression, Decision Tree, K Nearest Neighbors, and Multiplayer Perceptron were tested on hepatitis dataset and a confusion matrix was plotted for each of the classification models. The accuracy, precision, and recall criteria were used to make the comparison. For each model, the accuracy was assessed using the root mean square value and mean absolute error. The selected algorithms, particularly the Multiplayer Perceptron (87%) and Logistic Regression (87%) algorithms, showed high accuracy rates. Furthermore, with a minimal root mean error of 0.35 and a minimal mean absolute error of 0.12 and 0.13, the two algorithms are the most dependable of all the methods.


INTRODUCTION
Hepatitis is a potentially fatal disease that occurs when the liver becomes inflamed and injured. It is a viral disease that has resulted in a high death rate worldwide (Nilashi, 2019). Hepatitis is transmitted by sewage pollution or direct contact with contaminated bodily fluids (Al-Thaqafy et.al, 2013). Viruses, bacteria, medicines, or drugs can also cause this condition (Trishna et al., 2019).Tattoos and piercing, drug abuse, sexual contact with an infected person, hemodialysis, blood transfusions are also methods by which an infected person can transmit this disease (Metwally et al., 2018) hepatitis may be acute or chronic (Metwally et al., 2018). Acute hepatitis causes intense and painful symptoms at the start of the disease, making it more painful for patients, but it only lasts a month or two (Trishna et al., 2019). Consequently, there is only minor liver cell disruption and no effect on immune system function. Chronic hepatitis is a form of hepatitis that lasts more than six months and leads to cirrhosis, a condition in which the parenchymal cells of the liver are damaged (Metwally et al., 2018). Hepatitis A, B, C, D and E are 5 distinct forms of hepatitis (Ahmad et al., 2019). Hepatitis A and E are acute hepatitis, while Hepatitis B, C, and D are chronic hepatitis. Despite continuing studies into a treatment for hepatitis C, there is currently no available vaccine for the disease (Bhargav & Kumari, 2018). Early detection, as well as proper diagnosis and treatment, can cure the disease (Yarasuri et al., 2019). Also, health workers are most at risk with hepatitis disease (Polat & Günes, 2006). The cause for this is that the diagnosis of hepatitis disease is mostly by routine blood tests, which exposes medical personnel to associated risks during diagnosis. Hepatitis medical diagnosis is difficult since a specialist must weigh several aspects before performing the disease diagnosis process (Nilashi, 2019). As a result, this condition necessitates the creation of automated and reliable diagnostic systems that can aid in the identification of hepatitis for physician decision-making. Machine learning is a valuable technique that clinicians can use in this instance. Machine Learning (ML) is a technique for teaching a system to learn by finding patterns and associations in captured data using various algorithms (AtifKhan et.al, 2012). As a result, ML allows for predicting and diagnosing any illness, taking into account two essential factors: parameter collection and the method used to analyze these parameters. This study compares six related ML algorithms that are beneficial to diagnose hepatitis. Support Vector Machines (SVM), Gaussian Naive Bayes, Logistic Regression, Decision Tree, K Nearest Neighbors (KNN), and Multiplayer Perceptron (MLP) are the algorithms considered. The main objective of this paper is to analyze hepatitis dataset data and correctly predict the outcome in each dataset using the six ML methods. The study makes substantial contributions in the following areas: a) To improve the classification accuracy and reliability for predicting hepatitis diseases. b) To make a comparison of six classification algorithms for ML on the data set for hepatitis. c) Determine the most effective ML algorithm for predicting hepatitis.

Data collection
The hepatitis dataset was retrieved from the University of California, Irvine (UCI) Repository. There are 155 samples in the database, and it has 20 attributes, together with the class label attribute. To diagnose and identify hepatitis, Machine Learning Algorithms were applied to this dataset. The specifics can be found in table 1 below. The dataset was trained and tested using six machine learning algorithms. A comparison was made based on the tools' accuracy, precision, and recall. Loading data, attributing and preprocessing data, data classification, implementation of ML methodology, and disease prediction are the critical processes involved in this study. Figure  2 depicts a model method for hepatitis diagnosis, with phases of the procedure explained in the sections below:

Loading Data
The data came from the UCI register, which has 155 instances with 20 different attributes. Because machine learning learns from samples, the model requires smoothing large amount of data to produce results. Data imputation is performed on the available dataset to obtain a satisfactory amount of data.

Attributing and Preprocessing of Data
Missing data was resolved in order to obtain adequate data for preparation, validation, and testing by imputing the omitted data and substituting an individual global constant for each of the lost values. In this hepatitis data set, 75 of the 155 instances have missing values. If missing data in any field is not properly treated, it can result in error prediction and degrade the performance quality.
Classifying the Data.
The data for this analysis were divided using stratified splitting. Data were segmented into 10-fold cross-validation training and testing data sets before modeling. The scores collected for each fold are averaged out and utilized as a single score after a 10fold cross-validation repeat. This means that the model is trained with 90% of the data for each fold and evaluated against the remaining 10%. This style's cross-validation avoids the bias of training the model primarily on negative or positive data.
Using ML tool to diagnose the disease Training, forecasting, and testing are the basic three steps of the machine learning implementation. The classifier algorithm creates the model based on the training dataset during the training phase. The trained model then use to predict the hepatitis disease. The testing data set was use to validate the forecast's performance by determining the accuracy, precision, and recall of the prediction. The techniques used in this analysis were SVM, Gaussian Naive Bayes, Logistic Regression, Decision Tree, KNN, and MLP classifiers: SVM is a widely used and practical method for dealing with data classification, interpretation, and prediction issues (Saangyong et.al, 2009). SVM is used to map the input variable to n-dimensional function space. For classified training outcomes, it generates a hyperplane that divides the function space by their class, preventing overfitting (Xiao & Leedham, 2002).
KNN is one of the most fundamental classification algorithms. This algorithm prioritizes the best k nearest neighbors [It is a common machine learning algorithm for datasets due to its ability to select neighbors. We will not get the right results if we choose the lower and upper values of k. As a result, in order to obtain a particular result, we select an optimal k value for the algorithm.
Gaussian Naive Bayes implies the presence of one function in a class has no effect on the existence of any other feature. [The idea behind the term "naive" is that it reduces the difficulty of computation to a general probability multiplication. The primary advantage of GNB is its speed, as it is a simple algorithm in comparison to other classification algorithms. Due to its simplicity, this GNB algorithm is capable of efficiently processing datasets with a large number of dimensions MLP is a form of feed-forward artificial neural network that maps input data datasets to a set of suitable outputs. A MLP is made up of multiple layers of nodes in a directed graph, each layer being completely connected to the previous one. Excluding the input nodes, a unit node represents a processing unit with a nonlinear activation function. In the MLP classification dataset, back propagation is a supervised learning approach that was employed to train the network. MLP is a version of the normal linear perceptron which can classify data in datasets that are not linearly separable.
Logistic Regression is a computational method for evaluating a data set in which the result is calculated by one or more independent variables. The aim of logistic regression is to determine the optimal model that describes the relationship 4 between a collection of predictor variables and an observed dichotomous feature. Decision Tree is the most frequently used classification algorithms are decision tree algorithms (Karthikeyan & Thangaraju, 2013) (Twa eta.al,2005). A decision tree is a straightforward modeling technique that employs tree structure to construct classification or regression models. It generates a related decision tree incrementally as a data set is subdivided into smaller categories. Consequently, a tree with leaf and decision nodes is formed. A decision node with more than two branches is referred to as a leaf node, and the upmost decision node in a tree is referred to as the root node, which represents the best predictor (Soofi, & Awan, 2017).

Classification performance measures
The following are the metrics used to evaluate the classification mentioned above. It is the measure of the difference between the two continuous variables. The MAE is the average vertical distance between each actual value and the line that best matches the data. MAE is also the average horizontal distance between each data point and the best matching line. e) Root Mean Square Error (RMSE) is defined as the square root of the average squared distance between the actual score and the predicted score as shown in Equation 5: The true score for the i th data point is denoted by , and the predicted value is denoted by ŷ .
f) Area Under Curve (AUC) is the likelihood that the classifier would score a randomly chosen positive example higher than a randomly chosen negative example. The AUC is based on a plot of the false positive rate against the true positive rate and ranges between 0 and 1 which are defined as shown as shown in Equation 6 and 7. )

RESULTS AND DISCUSSION
The classification techniques was implemented with python. A number of health-related attributes are included in the dataset, as well as the class label, which corresponds to a patient's hepatitis status. The data was separated into two categories: training data and validating data. Using the training data given, we trained the six models; SVM, Gaussian Naive Bayes, Logistic Regression, Decision Tree, KNN, and MLP. The models were tested using validating data, and a confusion matrix was plotted for each of the models. The Table 2 depict the confusion matrix of the SVM, Gaussian Naive Bayes, Logistic Regression, Decision Tree, KNN, and MLP on the hepatitis dataset respectively. The confusion matrix for all the models are as shown below:  The classifier's accuracy in making correct predictions is measured using the uncertainty matrix. The count value of the uncertainty matrix represents the number of accurate and inaccurate classifier predictions. The upper row of the uncertainty matrix lists predicted positive events with true positives, while the lower row lists no events with true negatives. The diagonal elements denote the number of projected target classes that are equal to the actual target class. The misclassified or wrongly predicted targets class belongs to the off-diagonal elements.

Hepatitis Diseases Prediction…
From the matrix, the true positives, true negatives, false positives, false negatives along with the true positive rate and false positive rate were utilized to calculate the recall, precision, accuracy and AUC were calculated by implementing specified modules. The recall, precision and accuracy give the performance of the various classification algorithms when applied on the Hepatitis dataset are display in the chart figure 2 and 3 of the ROC graph as shown. The following conclusion may be drawn from the findings: the MLP and Logistic Regression algorithms have the highest accuracy of 87 percent, followed by the Decision Tree Algorithm with an accuracy of 85 percent. The KNN comes next, with an ideal accuracy of 82 percent, while the Gaussian Nave Bayes Algorithm comes in third, with an ideal accuracy of 72 percent. The ROC curve reveals that the AUC for MLP model beat all other models on the validation data set, with substantially higher and steady performance.
The following figure 4 and Figure 5 describe a Mean absolute error analysis and root mean square error for all the models. The lowest mean absolute error rate of 0.13 was achieved with MLP and Logistic regression, and the root mean square error rate of 0.35. The lower the MAE and RMSE for a given model, the more closely the model can predict the actual values. Figure 4 below shows the comparison graph of the MSE for the ML tools. To further verify the accuracy of the models, the Mean Absolute Error (MAE) for each one of the models was determined. The MAE states the average difference between the actual data value and the value predicted by the models. The lower the MAE for a given model, the more closely the model can predict the actual values. The RMSE and MAE were also used to validate the algorithms' predictability.

CONCLUSION
In this paper, evaluation of performance using classification performance measures was carried out on selected Machine Learning (ML) algorithms. Accuracy, precision, and recall were used to determine if an individual has hepatitis or not from the various independent attributes. According to the results of this analysis, the selected algorithms demonstrated some good accuracy percentages, especially MLP (87%), Logistic Regression (87%), Decision Tree (85%) and KNN (82%) algorithm. These algorithms can be applied for determining whether or not hepatitis is present in a person. MLP, on the other hand, is the most dependable, with a Mean Absolute Error 0f 0.13 and a minimum Root Mean Square Error of 0.35.
In the future, the data set will be used to build the model will be increased, and this will result in more unique rules and better accuracy. Different weighing techniques are suggested to enhance the accuracy. Also, other classification methods can be employed to extend the research further.