The Effectiveness of Machine Learning Systems' Accuracy in Predicting Heart Stroke Using Socio-Demographic and Risk Factors - A Comparative Analysis of Various Models

Background: Cardiologists can more appropriately classify patients' cardiovascular diseases by executing accurate diagnoses and prognoses, enabling them to administer the most appropriate care. Due to machine learning's ability to identify patterns in data, its applications in the medical sector have grown. Diagnosticians can avoid making mistakes by classifying the incidence of cardiovascular illness using machine learning. To lower the fatality rate brought on by cardiovascular disorders, our research developed a model that can correctly forecast these conditions. Methods: This study emphasized a model that can correctly forecast cardiovascular illnesses to lower the death rate brought on by these conditions. We deployed four well-known classification machine learning algorithms like K nearest Neighbour, Logistic Regression, Artificial Neural network, and Decision tree. Results: The proposed models were evaluated by their performance matrices. However logistic regression performed high accuracy concerning AUC (0.955) 95% CI (0.872-0.965) followed by the artificial neural network. AUC (0.864) 95% CI (0.826-0.912). Conclusion: Individuals' risk of having a cardiac event may be predicted using machine learning, and those who are most at risk can be identified. Predictive models may be developed via machine learning to pinpoint those who have a high chance of suffering a heart attack


INTRODUCTION
Cardiovascular disorders affecting the heart and blood vessels are becoming a global burden, accounting for more than 70% of mortality. 1 According to the WHO, CVD claims up to 17.9 million deaths per year. CVD constitutes an array of disorders, including angina, stroke, heart failure, carditis, heart attack, rheumatic heart diseases, venous thrombosis, peripheral artery disease, and numerous other conditions. 2 CVD-associated risk factors include hypertension, smoking, hyperlipidemia, obesity, stress, a poor diet, and a family history of CVD. 3 Electrocardiograms, echocardiograms, magnetocardiography, and magnetic resonance imaging are all used to diagnose CVD. There are, without a doubt, various diagnostic options; nonetheless, their limitations are a nuisance. The ECG cannot offer a conclusive diagnosis of congestive heart failure. One of the most significant drawbacks of cardiac echocardiography is that it does not reveal any blockages or coronary arteries. Although magnetocardiography produces highquality signals, it is a time-consuming procedure. 4 Furthermore, these diagnostic options are sometimes excessively expensive and impracticable for patients in middle and low-income countries. 5 A stroke is a medical crisis instigated by an intrusion in blood flow to the brain that ends up in cell death and loss of brain function. Strokes have been classified as either ischemic or hemorrhagic. Both forms of stroke can cause considerable brain damage and result in a variety of physical and cognitive symptoms. 6 Stroke symptoms often appear suddenly, within seconds to minutes, and do not proceed further in the majority of instances. Dysarthria, aphasia, ptosis, and altered taste and smell are some of the indications. The risk factors of heart stroke embrace heart disease, high RBC count, high level of cholesterol, high blood pressure, diabetes, unhealthy diet, and secondhand smoking. Delayed medical presentation and inability to comply with medications are some of the primary issues that must be resolved. 7 Over time, an assortment of clinical procedures has been designed to assist in determining the existence of stroke. Whilst these procedures can help with the first triage of acute neurological patients, they are unable to match both the specificity and sensitivity of an imaging evaluation nor is there a clinical test that can extricate among ischemic and haemorrhagic stroke. 8 Heart stroke can be diagnosed using a CT scan, MRI, and electrocardiography . During a CT scan, a patient is exposed to ionizing radiation, which can cause long-term harm and potentially raise the possibility of cancer. According to the WHO, the worldwide incidence of CVD deaths is expected to upsurge to 23.6 million by 2030, with cardiovascular disease and stroke being the main culprits. 9 Machine learning algorithms can detect early warning signals of heart disease and stroke, allowing for earlier prevention and therapy. 10 This can scan massive amounts of data and uncover patterns that specialists may overlook, potentially extending lives while improving outcomes. This may give rise to further precise diagnoses and treatment recommendations. Moreover, ML algorithms can assess patient data and recommend individualized treatment strategies based on criteria such as age, gender, medical history, and lifestyle. 11

METHODOLOGY
The data set used for this research is obtained from the Kaggle website i.e., https://www.kaggle.com/da tasets/fedesoriano/stroke-prediction-dataset, which is openly available 5110 observations with 11 characteristics make up the data. 12 Kaggle which is a data-sharing website provides authentic and reliable secondary data for data scientists and researchers for research purposes. The particular data consists of 10 features which are described in Table 1. The outcome of the research is heating stroke which is a binary classification, yes (1), No (0). A k-fold crossvalidation technique was used in machine learning techniques.

Proposed models
When speaking of a supervised learning issue in the setting of machine learning, a classification problem is one where the objective is to predict a categorical label or class variable for a given collection of features or input variables.  The class variable is referred to as the dependent variable or the target variable, whilst the input variables may be referred to as predictors or independent variables. The objective of a classification challenge is to discover a model or algorithm that can precisely predict the class variable for brand-new, undiscovered data points. Usually, a labelled dataset with known class labels for each data point is used to train the model. This labelled data is used by the model to discover patterns and connections between the predictors and the class variable. Decision trees, random forests, support vector machines (SVM), logistic regression, k-nearest neighbours (KNN), and neural networks are just a few of the techniques used in machine learning that may be employed for classification challenges. The particular issue and data, as well as the required level of accuracy, interpretability, and computing efficiency, all influence the algorithm's selection.
The dataset has been divided into a training set (80%) and a testing (20%). All computational and machine learning algorithms were employed in R version 4.3.0. The training dataset is employed to train a model, and the testing dataset is utilized to assess the model's performance. The effectiveness of multiple classifiers, K nearest Neighbour, Logistic Regression, Artificial Neural network, and Decision tree. has been assessed using the dataset. The efficacy of each classifier is then assessed using its ratings for recall, recall precision, accuracy, and F-measure.

K Nearest Neighbour Classification
A supervised machine learning approach for regression and classification analysis is the k closest neighbour (KNN). It is a non-parametric method; hence it makes no assumptions about the distribution of the data at its core. Instead, it memorizes the full training dataset and applies it to forecast data from fresh, unobserved bits of data. Identifying the K data points in the training set that are closest to the new data point and classifying the new point based on the majority class of those K neighbours is the fundamental concept underpinning KNN classification. 13,14 Although different distance metrics can be utilized, Euclidean distance is typically used to compute the distance between data points. Binary and multiclass classification issues can both be solved with KNN classification. Although it is a straightforward and understandable approach, as the amount of the dataset increases, it may become computationally expensive. KNN classification has the benefit of being a lazy learning algorithm, which eliminates the need for training time. Instead, predictions for fresh data points are made using the full training dataset. This implies, however, that prediction times can be lengthy, particularly for sizable datasets. A few KNN classification hyperparameters, including the value of K, the distance metric employed, and the procedure for identifying the majority class, can be tweaked to enhance performance. KNN classification is a flexible and reliable technique that may be applied to an extensive variety of classification issues. 15 However, to attain the greatest achievement, it is crucial to carefully choose the value of K and the distance metric.

Logistic Regression
Machine learning employs the statistical approach of logistic regression to solve categorizing issues. 16,17 When the target variable is categorical, which means it can only accept a finite number of values, this kind of supervised learning approach is employed. Discovering a relationship between the input features and the likelihood that the variable of interest will take a particular value is the aim of logistic regression. By estimating the parameters of a logistic function, which converts the input features into the likelihood of the target variable, this is accomplished. To minimize a cost function, such as the cross-entropy loss, which assesses the gap between the predicted probability and the actual target values, the logistic regression algorithm iteratively adjusts the parameter values. By computing the likelihood that the target variable will take a particular value based on the input features, the model may be used to make predictions on fresh data once the parameters have been evaluated.

Neural Network
Neural network algorithms are frequently employed for problem-solving in classification. 18,19 ANNs are made up of interconnected nodes (sometimes referred to as neurons) organized in layers and are de-signed after the composition and operation of the human brain. An ANN's objective in a classification challenge is to figure out how to transform the input features into the output class labels. 20 Each data point in the labeled dataset used to train the network has a label for a particular target class. A loss function, which gauges the discrepancy between predicted and actual class labels, is minimized by the network during training by adjusting the weights and biases of the neurons in each layer. An input layer, one or more hidden layers, and an output layer are the common components of an ANN's architecture for categorization. 21,22 The input layer gets the features from the input layer, which is subsequently sent to the output layer via the hidden layers. Based on the input features, the output layer generates the expected class labels. Feedforward neural networks, convolution neural networks (CNNs), and recurrent neural networks (RNNs) are some of the ANN types that can be employed for classification. The most basic kind of ANN called a feed-forward neural network, is made up of many layers of interconnected neurons. CNNs use filters to extract features from the input images and are made to perform image recognition tasks. RNNs can account for the temporal de-pendencies between input features and are utilized for sequential data. In general, ANNs are an effective tool for classification problems because they can handle enormous quantities of data and learn complex non-linear correlations between the input variables and the output class labels

Decision tree
A machine learning approach known as a decision tree is employed for classification and regression issues. 23,24 It functions by creating a tree-like model that can be used for prediction by recursively splitting the data according to the values of the input characteristics. 25 Because they are simple to understand and can capture intricate non-linear correlations between the input data and the goal values, decision trees are widely used. They may be utilized for classification and regression issues and can handle both categorical and numerical input information. However, if the training data is not sufficiently regularized, decision trees may be sensitive to the selection of the splitting criterion and may overfit the data. 26 Decision trees can perform better and have less variation when using ensemble approaches like random forests and gradient boosting.

DISCUSSION
The use of ML techniques is showing efficacy in many different kinds of healthcare applications, most notably cardiovascular. Researchers have the chance to design and test new algorithms to identify risk factors and early indicators of heart ailments, which are still among the top causes of fatalities in developing countries. These approaches provide promising prospects for the early detection as well as mitigation of heart disease. 27  Alotalibi's (2019) study sought to evaluate the use of ML procedures for forecasting heart failure disease.
To construct prediction models, the researchers used a dataset from the Cleveland Clinic Foundation and utilized several ML methods for instance support vector machine (SVM), logistic regression, naive Bayes, random forest, and decision trees. During the model-building procedure, a 10-fold cross-validation strategy was applied. The decision tree method has the best accuracy rate (93.19%), followed by the SVM algorithm (92.30%). The study stresses the decision tree algorithm as a feasible choice to consider in the subsequent study and underlines the impending ML approaches as a useful tool for forecasting cardiac maladies. 32

CONCLUSION
Machine learning can forecast a person's likelihood of experiencing a cardiac episode and identify individuals who are most susceptible to it. To identify those who have a high risk of having a heart attack, predictive models can be created using machine learning. These models can evaluate several parameters, such as age, gender, family history, medical history, lifestyle, and other risk factors, to predict the chance of suffering a heart attack. Based on the results of the prediction model, patients may receive tailored recommendations for reducing their likelihood of having a cardiac event. A balanced diet, frequent exercise, regulating your body weight, and quitting smoking are just a few examples of lifestyle changes that may be advised. Using machine learning, it is possible to detect and keep track of people's health status and notify them if anything changes that would increase their risk of suffering a heart attack. For more advanced use of these methodologies, health data may be collected and analyzed by wearable technology, smartphone applications, and other technologies.