Enhancing Cardiovascular Disease Prediction Using Hard Voting Technique in Machine Learning

Cardio Vascular Disease (CVD) or Heart disease is one of the leading causes of death around the globe. Early identification of the disease can significantly save precious lives. But the identification of heart-related diseases is a challenging task as it relies on a wide range of factors. Machine Learning algorithms have strong potential in prediction-related domains. In this paper, we have used an Ensembled model called the Hard Vot-ing Ensemble Model to detect heart disease. A dataset containing 13 features is taken from the UCI repo using Kaggle. Seven different algorithms are used, tested, and trained, accuracy is measured and out of those, models with the best accuracy are picked and ensembled together. The ensemble model resulted in higher accuracy than all other individual models.


Introduction
The largest cause of death worldwide, heart-related illnesses (CVD) claim 17.9 million lives annually. CVD refers to a range of diseases that affect the heart namely coronary artery disease, heart failure, arrhythmias, and so on. The most prevalent kind of heart illness is coronary artery disease, also known as atherosclerosis, which happens when fatty substances constrict or block the arteries that carry blood to the heart. When the heart is unable to pump blood effectively, heart failure occurs. Arrhythmias refers to abnormal heartbeats. More than four CVD fatalities out of every five are attributable to heart attacks and strokes. Heart disease risk factors include a poor diet, a lack of exercise, cigarette use, and alcohol abuse. Heart disease identification is an extremely difficult task. Many factors have to be taken into consideration to predict the result. It will be a challenging task for humans to make such a pre-diction which involves such a complex process. Thankfully, in recent times Machine Learning has emerged in many fields especially in making predic-tions. It would be an optimal solution for this kind of scenario. In our solution, a type of Machine Learning technique called the Hard Voting Ensemble method is to predict CVD with high accuracy. Volume 5, Issue 3, May-June 2023 2

Literature Survey
Machine Learning has already made a great impact in this field. Various algorithms were trained, and tested and have yielded some great accuracy. TABLE 1 shows the algorithms and their accuracy achieved by various works.

Existing System
The performance of any algorithm depends on the dataset fed into that. A noisy dataset will affect the accuracy of the algorithm. Also having features whose contributions are significantly less to the target class will also reduce the accuracy. So it is crucial to select the features via a thorough examination before feeding them into the model. [1] have used the UCI dataset consisting of 13 features and 1 target class which refers if the instance had heart disease or not. All those 13 features were taken into consideration for testing and training the model. Supervised learning is carried out on four algorithms namely Support Vector, Decision Tree, Linear Regression, and K-Nearest Neighbor. The highest accuracy was 87%, produced by the KNN algorithm. By considering the above facts, the Volume 5, Issue 3, May-June 2023 3 accuracy of the algorithms can be improved by selecting optimum features and eliminating the others. Also, ensembling might have yielded even higher accuracy.

Proposed System
The proposed system uses the Hard Voting Ensemble technique to predict CVD effectively. The dataset is collected from the UCI repo which is verified by several researchers and authorities of UCI. It consists of 13 features. Preprocessing is done to remove noisy, empty, null values in the dataset. Feature Selection is carried out using the Filter method with the help of the correlation matrix. Finally, a dataset consisting of 12 features (removing the FBS feature) is used in building the model instead of using all the features. The dataset is divided into 8:2 ratios for training and testing the model. Supervised Learning is carried out on various algorithms namely Naive Bayes, Logistic Regression, Random Forest, Extreme Gradient, K-Neighbors, Decision Tree, and SVM and they are trained, tested, and their accuracy is noted down. Among those models, Random Forest, Decision Tree, and SVM models gave promising accuracy of 93.65%, 91.95%, and 92.68%. Hard Voting Ensembling is implemented on those three models and its accuracy is tested. The Ensembled model produced an accuracy of 95.12% which is higher than the accuracy of all individual models.

5.
Proposed System Modules

Data Preprocessing:
Using the Kaggle website, the dataset of 13 features is downloaded in CSV format from the UCI repository. TABLE 2 shows the features of the dataset and their description. The CSV file containing the dataset is imported and cleaned. Empty spaces and duplicate entries are eliminated, and null values are substituted with the column's mean. Data Preprocessing is a crucial step to remove noisy data which prevents the model from recognizing wrong patterns. target The person has CVD or not

Feature Selection:
It is a process that helps in identifying optimum sets of features essential for prediction [6]. The filter method is used in our solution to select the best features. Using a correlation matrix and some trial and error methods, among 13 features one feature namely FBS had significantly less contribution towards the target class. That feature is removed from the dataset. All other features are kept as it was as they had a good correlation with the target class. The final dataset consisting of 12 features is fed into the model. Figure.1 correlation matrix shows how the features and related to the target class.

Selecting the Models:
Seven supervised learning algorithms namely Naive Bayes, Logistic Regression, Random Forest, Extreme Gradient Boost, K-Neighbors, Decision Tree, and SVM are trained using the training dataset. Then they are tested with the Test Dataset to calculate the accuracy.

Accuracy Calculation :
The The confusion matrix is used to determine the values for the expression above. Testing the model using a test dataset yields a confusion matrix. Their accuracy score is compared to find the models with higher accuracy. Random Forest, Decision Tree, and Support Vector Machines have higher accuracy than others with an accuracy of 93.65%, 91.95%, and 92.68%. In the Hard Voting Ensemble method [7], the classification process is based on the models' consensus vote. The same input is fed into all the models and each model votes for a target class i.e make a prediction. The target class with the majority vote is the final output. Hard Voting Ensembling is carried out between various models. Of those, ensembling the Random Forest, Decision Tree, and Support Vector Machine models produced the highest accuracy of 95.12%. That model is finalized.

Conclusion
In this paper, the Hard Voting Ensemble method is used to predict heart diseases. The precision of the model was greatly increased by careful feature selection. Of all other methods, the SVM, Decision Tree, and Random Forest model assembly produced the best accuracy. the accuracy of the ensembled