Wart Classification Using Logistic Regression: Analysis Based on Data Partitioning, Error Rate Comparison and Feature Selection

Accurate classification of cancerous warts is pivotal for effective medical intervention, and logistic regression serves as a promising tool for this purpose. This study delves into the realm of wart classification using logistic regression, with a specific focus on three key aspects: data partitioning, error rate comparison, and feature selection. Logistic regression demonstrates commendable accuracy during training, but an observed disparity between training and testing accuracy prompts a critical examination of potential overfitting. Data partitioning unveils mixed results, enhancing overall testing accuracy while diminishing performance on partitioned datasets, emphasizing the importance of meticulous dataset splitting. Furthermore, the impact of feature selection on the model's performance is explored, underscoring the need for a detailed analysis of influential features. The study concludes by proposing future work, including addressing overfitting through regularization, investigating feature importance, exploring alternative classification algorithms, optimizing accuracy through ensemble methods, and expanding the dataset for enhanced generalization. This research contributes to the advancement of wart classification methodologies, providing insights into logistic regression's application and paving the way for refined diagnostic tools in dermatological practice.


Introduction
Wart classification is a crucial aspect of dermatological diagnostics, and accurate differentiation between benign and malignant warts is essential for effective medical intervention [20].With advancements in medical imaging technology, the analysis of wart features, including the number, type, and area, has become instrumental in enhancing diagnostic precision.This research delves into the realm of wart classification using logistic regression, a statistical modeling technique renowned for its effectiveness in binary outcome predictions.Warts, caused by the human papillomavirus (HPV), manifest in various forms, and their potential malignancy necessitates a meticulous classification process [2].Traditional diagnostic methods often rely on visual inspection alone, lacking the quantitative depth required for nuanced decision-making.In response to this gap, the present study employs logistic regression to analyze critical wart featuresspecifically, the numerical count, the type of wart, and its spatial area-extracted from high-resolution medical images [23].
The significance of each feature in distinguishing between benign and malignant warts is a key focus of this investigation.The choice of logistic regression is motivated by its ability to model the probability of a binary outcome, making it well-suited for this classification task.The inclusion of numerical features, such as the count and area, alongside categorical information about wart types, aims to provide a comprehensive representation of the wart characteristics relevant to diagnosis [3].Moreover, the study incorporates advanced feature selection and preprocessing techniques to refine the logistic regression model's predictive capabilities.By optimizing the model's performance based on these crucial wart features, the research seeks to develop a robust and interpretable tool for accurate wart classification.In summary, this exploration into wart classification using logistic regression capitalizes on the quantitative insights derived from features like number, type, and area.The methodology aims to contribute to the advancement of dermatological diagnostics, offering a refined approach for distinguishing between benign and malignant warts and potentially improving patient outcomes through informed and timely medical interventions.

Background
Warts, caused by various strains of the human papillomavirus (HPV), are common dermatological conditions that manifest in diverse forms.While many warts are benign and self-limiting, some present a potential risk of malignancy, necessitating accurate classification for appropriate medical management.Traditional diagnostic approaches, often reliant on visual inspection alone, may lack the quantitative precision required to discern between benign and malignant wart types.With the advent of advanced medical imaging technologies, there is an opportunity to leverage quantitative features such as number, type, and area to enhance the accuracy of wart classification [34].The classification of warts is a challenging task due to the variability in their appearance and the potential overlap of features between benign and malignant cases [4].Researchers and clinicians alike recognize the need for more sophisticated tools that integrate quantitative data to aid in the precise identification of malignant warts.In this context, logistic regression emerges as a powerful statistical technique capable of modeling the probability of a binary outcome-perfectly aligned with the binary nature of wart classification (benign or malignant).The choice of features for this study-number, type, and area-derives from their clinical relevance and potential to offer a comprehensive representation of wart characteristics.The number of warts may signify patterns of viral activity, the type of wart provides insights into its histological characteristics, and the area offers a quantitative measure of its spatial extent.By integrating these features, the research seeks to develop a robust classification model that not only distinguishes between benign and malignant warts but also provides valuable insights into their characteristics for clinical decision-making.Furthermore, the inclusion of logistic regression in wart classification studies has gained traction in recent years, showcasing its applicability in medical diagnostics.Its capacity for probabilistic modeling, simplicity, and interpretability makes it an attractive choice, especially when dealing with binary outcomes such as wart classification.

Dataset Description
180 patient records with common warts or plantar warts treated with immunotherapy or cryotherapy were gathered between September 2013 and February 2015 from the dermatology department of Ghaem • Email: editor@ijfmr.com

IJFMR230611374
Volume 5, Issue 6, November-December 2023 3 Hospital in Mashhad.Due to their commonality, the two types of warts and treatment modalities were employed.Ninety of the 180 data were gathered after patients received cryotherapy.Eight features were present in these records: induration diameter of the initial test, age, gender, number of warts, categories of warts, surface area of the warts, and response to therapy.When patients were treated with cryotherapy, an additional 90 records were gathered.Seven characteristics were present in these records: gender, age, number of warts, categories of warts, surface area of the warts, time passed before treatment, and reaction to therapy.The purpose of this study was to assess how well data mining and logistic regression techniques might be used to assess how well post-immunotherapy treatment was going.The effectiveness of immunotherapy treatment for warts was assessed using classification analysis with Weka and bilateral logistic regression analysis with WTA.The discussion of decision tree construction also includes identifying the factors that influence the effectiveness of categorization.2. In order to assist patients in selecting between immunotherapy and cryotherapy, this work develops a classification and regression tree (CART) model based on particle swarm optimization.The patients' reactions to the two techniques can be reliably predicted by the suggested model.A more condensed and precise model is produced by optimizing the model's parameters using an enhanced particle swarm algorithm (PSO) as opposed to the conventional pruning method.3.In order to create precise prediction models that may be used to analyze how patients with common or plantar warts respond to cryotherapy and/or immunotherapy, this study uses the classification and regression tree (CART) algorithm.The patient's age and gender, the number and kind of warts, their surface area, and the amount of time that has passed before treatment are among the independent characteristics that are utilized to create a CART classifier for the cryotherapy technique.4.This work offers medical experts a way to read guidelines more easily and accurately.Treatment approaches' success rate can now be predicted in percentage terms.As a result, the decision treebased approach was employed in this study to identify the guidelines for forecasting the efficacy of • Email: editor@ijfmr.comIJFMR230611374 Volume 5, Issue 6, November-December 2023 4 wart treatment strategies.The results show that, depending on the treatment approach, the success rate ranged from 90 to 95%; these rates are greater than previously reported.5.The objective of this research is to create a dependable machine learning model that can precisely forecast a patient's prognosis for immunotherapy and cryotherapy based on their clinical and demographic attributes.We used a dataset of 180 patients who had different kinds of warts and were treated with immunotherapy or cryotherapy to train a support vector machine (SVM) classifier.

Methodology
This flow diagram (figure 1) elegantly outlines a meticulous process for harnessing the power of machine learning to detect malignant tumors within cryotherapy data.Its journey begins with a thorough loading and exploration of the data, ensuring a comprehensive understanding of its intricacies.Next, it meticulously selects the most valuable features-like expert curators choosing the finest gems-utilizing a decision tree algorithm to guide this discerning process.The data is then thoughtfully divided into two distinct sets: one dedicated to training a vigilant model, and the other poised to rigorously test its acumen.
In the training phase, a logistic regression algorithm diligently takes the helm, meticulously crafting a model specifically equipped to detect potential errors within the medical data.Once this model has been astutely trained, it confidently embarks upon the testing phase, deftly predicting outcomes within the designated testing dataset.The model's predictions are carefully weighed against the actual results, revealing a comprehensive evaluation of its accuracy.The diagram thoughtfully concludes with a poignant reminder of the overarching purpose of this meticulous process: to empower the early identification of malignant tumors, potentially paving the way for timely interventions and improved patient outcomes.

Data Splitting
We begin by loading the data into the RStudio platform to partition dataset for selection high value features.

Figure 2 Dataset decision Tree
This decision tree selects features from our Cryotherapy dataset and uses them to recursively split data into subsets based on the feature values.This decision is based on the majority class or average value in the leaf nodes.For our assignment, we've taken two leaf nodes from the above decision tree and applied various models on the two partitions.In the first partition, time>=8.3 and age>=20.After applying the filter on our data file, we get 45% of the samples in the first partition (from the leaf node) where time is less than 8.3.

Figure 3 Partitioned Dataset 1 and 2 (time>=8.3 and age>=20)
Now, that we have our partitioned datasets, we are going to try and identify which model is the most suited for the analysis of our data.For this purpose, we are going to use various different models and compare the error rate derived from each of them.The lowest error rate will correspond to the best suited model for our partitioned dataset.

Applying different ML models to check error rate
Applying the models on partitioned dataset 1, gives us the following results:

Model-Linear logistic regression model:
This model uses a linear relationship between input variables to make predictions or infer patterns.The error rate upon using this model is 0%.Applying various models and going through their overall and averaged class error helps us conclude that the Linear logistic regression model, boost model and the random forest model are the most optimal for this dataset as they have an error rate of 0% which indicates that these models are very accurate.Now, that we've identified the suitable models for our first partition, we take our second partition from the leaf node with the conditional requirement of time>=8.3 and age>=20 and apply various models to find the most accurate one.

Model-decision tree:
Decision tree can handle both classification and regression tasks, and their interpretability, as they can be visualized and easily understood by humans, making them useful for explainable machine learning.The overall error percentage is 12.2% and the Averaged class error is 12.5% with this model.

Neural network Model:
This is a complex computational model inspired by the human brain's structure and processes, capable of learning and making predictions from large data, commonly used in deep learning for various tasks like image recognition, natural language processing and speech recognition.The overall error % is 6.7% and the average class error is 6.65%.

Linear logistic regression Model:
The overall error is 7.8% and the Averaged class error is 7.6% while using this model for our dataset.

Figure 15 Error matrix for logistic regression
Applying various models and going through their overall and averaged class error helps us conclude that the Boost model and the random forest model are the most optimal for this dataset as they have an error rate of 2.2% which is the least in comparison to other models and is an indicator of the fact that these models provide the most accurate results with a negligible error rate.

Model Fitting (Logistic Regression)
A logistic regression model is defined and optimized using grid search CV.The hyperparameters of the model, such as the regularization strength and the type of penalty, are tuned to maximize the accuracy of the model on the training set.
The methodologies used to build the predictive model are logistic regression and cross-validation (CV) of grid search as an optimization technique.Logistic regression is a binary classification algorithm that models the likelihood of an event occurring based on a set of input variables or characteristics.In this case, the wart cancer is malignant or benign.Logistic regression algorithms work by using a logistic function to model the relationship between input features and output variables (that is, binary class labels).The logistic function assigns probability values between 0 and 1 to input features.It represents the probability that an input belongs to the positive class (malign in this case) compared to the negative class (benign).Mathematically, logistic regression can be formulated as Where p(y=1|x) is the probability of the input x belonging to the positive class, z is a linear combination of the input features, and exp(-z) is the exponential function.A logistic regression model is trained by optimizing the parameters (that is, the coefficients) of the linear function z so that the predicted probabilities are as close as possible to the actual class labels in the training data.This is usually done using a maximum likelihood estimation approach where parameter values are found that maximize the probability of observing the training data with a given mode.Once the logistic regression model is trained, it can be used to predict the probability of inputs belonging to the positive class using the logistic function.An input is classified as belonging to the positive class if the predicted probability is greater than a certain threshold (such as 0.5).Otherwise, it is classified as belonging to the negative class.We have used Grid Search CV as our optimization technique.Grid Search CV is an effective technique for finding the best hyperparameters for a given model by systematically searching a set of hyperparameters and evaluating their performance using cross-validation.This prevents overfitting and ensures that the model generalizes well to new data.

Hyperparameter Tuning
After a set of models hyperparameters is defined, the model is trained and scored on different combinations of these hyperparameters.This is done by systematically searching a grid of hyperparameters, with each point in the grid representing a different combination of hyperparameters.The grid search CV algorithm uses cross-validation to evaluate model performance for each hyperparameter combination.In our case, we have used 10-fold Cross-validation technique.Crossvalidation splits the training data into multiple subsets, with one subset used as the validation set and the remaining subsets used as the training set.The model is trained on the training set and scored on the validation set.This process is repeated for each subset.Then we average the performance of the model on each validation set to get an estimate of the overall performance of the model.We have taken C and penalty as the hyperparameters used for tuning the logistic regression model for wart cancer classification.The C hyperparameter controls the inverse of the regularization strength.Smaller values give higher regularization strength, while higher values give lower regularization strength.We have tuned C hyperparameter over the range of values [0.001, 0.01, 0.1, 1, 10, 100, 1000] to find the optimal value that maximizes the accuracy of the logistic regression model on the training data.The goal of this adjustment is to find the best balance between fitting the training data well (low bias) and not overfitting the training data (low variance).The Penalty hyperparameter controls the type of regularization used in the logistic regression model.The two options are l1 and l2, corresponding to L1 and L2 regularization respectively.L1 regularization adds a penalty term proportional to the absolute value of the coefficient.This tends to produce sparse models where many of the coefficients are zero.L2 regularization adds a penalty term proportional to the squared value of the coefficient.This tends to produce models with all small but non-zero coefficients.We have tested both 'l1' and 'l2' regularizations are tested to find the best regularization method that maximizes the accuracy of the logistic regression model on the training data.

Model Evaluation
In the context of wart cancer classification using logistic regression, I evaluated the model using accuracy which involves assessing the overall correctness of predictions in distinguishing between benign and malignant warts.The accuracy metric is calculated by dividing the number of correctly classified warts (either benign or malignant) by the total number of warts in the dataset.Accuracy is a straightforward metric that measures the overall correctness of predictions by dividing the number of correct predictions by the total number of predictions.The formula for accuracy is: Accuracy = Number of Correct Predictions Total Number of Prediction The logistic confusion matrix is depicted in figure 17 (train data) and figure 18(test data).

6.
Results and Discussion Logistic regression demonstrated notable accuracy in wart classification, particularly during training; however, a discernible accuracy gap between training and testing phases raises concerns about potential overfitting.Dataset partitioning yielded mixed results, improving overall testing accuracy but diminishing performance on partitioned datasets, emphasizing the importance of careful dataset splitting.Hyperparameter tuning marginally improved testing accuracy, suggesting room for further optimization.Notably, details on data distribution and feature selection are absent, and exploring alternative algorithms like support vector machines or decision trees could enhance performance.Future recommendations include addressing overfitting through techniques such as regularization, investigating feature importance, exploring ensemble methods, and collecting more data for improved generalization.

Conclusion and Future Work
In conclusion, the study on wart classification using logistic regression, with a focus on data partitioning, error rate comparison, and feature selection, has provided valuable insights into the performance and limitations of the model.Logistic regression demonstrated commendable accuracy in classifying warts, particularly in the training phase.However, the observed disparity between training and testing accuracy raises concerns about potential overfitting, highlighting the need for careful model evaluation strategies.Data partitioning yielded mixed results, enhancing overall testing accuracy but reducing performance on partitioned datasets.The study underscored the significance of meticulous dataset splitting for robust model evaluation.Feature selection, while impactful, requires careful consideration, yet details on the specific features used were not provided, leaving room for further exploration.Future work is warranted to address overfitting through techniques like regularization and data augmentation, investigate feature importance to identify influential factors, explore alternative classification algorithms for potential performance improvement, and further optimize the model's accuracy through ensemble methods.Moreover, collecting a more extensive dataset could enhance generalization and contribute to a more robust wart classification model.The findings from this study open avenues for future research in wart classification using logistic regression.To address overfitting, implementing techniques like regularization and exploring data augmentation methods could be investigated.Conducting an in-depth analysis of feature importance is crucial to identify the most influential factors in wart classification, and a detailed exploration of feature selection methods is warranted.Future work should also involve experimenting with alternative classification algorithms, such as support vector machines or decision trees, to ascertain the best fit for the wart classification data.Additionally, the potential benefits of ensemble methods, which combine multiple models, should be explored to enhance accuracy and robustness.Lastly, expanding the dataset size is recommended to improve generalization and mitigate risks associated with overfitting.Overall, future research endeavors should focus on refining the logistic regression model and its associated methodologies for wart classification, contributing to advancements in dermatological diagnostics.

Declarations
Conflict of interest -The author declare no competing interests.

Figure 4
Figure 4 Error matrix for linear logistic regression model

Figure 5
Figure 5 Error matrix for neural net model

Figure 6
Figure 6 Error matrix for tree

Figure 7
Figure 7 Error matrix for boost

Figure 8 Figure 9
Figure 8 Error matrix for SVM

Figure 10
Figure 10 Error matrix for decision tree

Figure 11
Figure 11 Error matrix for boost

Figure 12
Figure 12 Error matrix for random forest

Figure 13
Figure 13 Error matrix for SVM