Semi-Supervised Machine Learning Approaches for DDOS Attack Detection

Network infrastructures are the target of several attacks. These include intrusions into the confidentiality, integrity, and availability of the network. The network's availability is impacted by a persistent attack known as a distributed denial-of-service (DDoS) attack. Such an assault is carried out using a command and control (C & C) technique. To detect these assaults, numerous researchers have put forth various machine learning-based solutions. In this paper, we are going to detect different DDoS attacks by various methods and evaluate their performance. This experiment made use of the KD99 dataset. The normal and assault samples were classified using the random forest technique. The classification of 99.76% of the samples was accurate. By strategically selecting clusters and incorporating the insights gained from the small labelled dataset, a portion of the unlabelled clusters can be assigned labels, effectively converting raw data into useful training examples. This enriched dataset is then used to train an improved classifier that can better generalize and adapt to the dynamic nature of DDoS attacks.


Introduction
In the ever-evolving landscape of cybersecurity, Distributed Denial of Service (DDoS) attacks pose a substantial threat to the availability and integrity of online services.Traditional detection methods struggle to keep pace with the increasing sophistication and scale of these attacks.To enhance DDoS detection capabilities, this research delves into semi-supervised machine learning techniques.DDoS attacks leverage the distributed power of compromised systems to overwhelm a target, making traditional signature-based detection systems insufficient against novel or highly skilled attacks.The proposed semisupervised machine learning approach combines labeled and unlabeled data for model training, offering a dynamic solution to address these challenges.Our study introduces a novel semi-supervised machine learning paradigm for DDoS detection, leveraging both labeled and unlabeled data.Initially trained on a limited dataset, the classifier gains foundational insights into normal and DDoS traffic.A clustering algorithm is then employed to categorize extensive unlabeled traffic into coherent groups.By selectively labeling clusters using existing data, the model is refined further, enhancing detection accuracy.This innovative approach significantly reduces manual labeling requirements, providing an efficient defense against DDoS attacks and adapting to the evolving cyber threat landscape.The primary objective of this study is to assess the effectiveness of semi-supervised machine learning in identifying DDoS attacks.By combining labeled data with instances of known attack patterns and unlabeled data representing typical network behavior, these methods aim to identify subtle variations indicative of DDoS activity.Our semi-supervised machine learning paradigm proves to be a robust defense against the dynamic DDoS landscape by harnessing the strengths of both labeled and unlabeled data.This approach not only strengthens cybersecurity protocols but also streamlines the laborious process of human data tagging.The subsequent sections delve into the intricacies of our methodology, present empirical findings, and illustrate how semi-supervised learning can augment DDoS mitigation tactics.TCP_SYN floods represent a DDoS attack where the attacker inundates the network with SYN packets, initiating a three-way handshake but failing to respond with an ACK packet.This overwhelms the victim/server, causing network slowdowns and constituting a Distributed Denial of Service Attack.ICMP floods involve overwhelming a server with fabricated IP addresses and ICMP echo requests, rendering it unable to handle legitimate requests.UDP Floods bombard a server's random ports with UDP packets, preventing it from responding to valid applications and causing system disruption.

Related Work
The performance of machine learning algorithms for the detection of DDoS assaults in SDN is impressive.The SDN controller's control plane was attacked, and the ML approaches successfully detected it.The use of machine learning techniques to identify DDoS assaults in SDN is briefly covered in this section.The section also analyzes features selection-based ML models and strategies that researchers have recently introduced.A technique based on statistics and machine learning is suggested in [1].In [2], a hybrid model based on K-means and K closest neighbors (KNN) is proposed.DDoS detection in SDN using a support vector machine (SVM) was carried out in [3].In [4], a genetic algorithm (GA), a kernel principal component analysis (KPCA), and an SVM-based approach are provided.An entropy-based method for traffic classification using flow samples is provided in [5], and it solely concentrates on the traffic's standard distribution.There is a COFFEE model in [6] that extracts the features from the flow for the attack detection.The hypothesized flow is transmitted to the controller in order to extract more features.The machine learning algorithms in [7] make use of a variety of factors to find the attack.Additionally, [8] presents traffic features based on a simple DDoS assault detection algorithm.The analysis and extraction of traffic data uses the Self-organizing map (SOM).Artificial Neural Network is used to detect DDoS attacks after features are extracted.In [9], the researchers put out a k-nearest neighbor-based technique that identifies attacks based on the amorphous distance between traffic features.This method provides accurate results for the identification of anomalous flow while lowering the number of false alarms.Although the researchers suggested a number of machine learning-based approaches for identifying DDoS attacks, these approaches have several drawbacks in terms of the best feature selection, poor accuracy, and ineffectiveness.[10] suggested a Naive Bayes (NB) and K-mean clustering-based technique for identifying DDoS attacks.The Naive Bayes algorithm classifies the clustered data as standard and assaults traffic after the K-mean cluster method groups traffic data that exhibit similar behaviors.Artificial Neural Network-based techniques are put forth in [11] to identify both known and unidentified DDoS attacks.To identify DDoS attacks, the researcher in the controller uses a dynamic Multilayer Perceptron (MLP) that utilizes a feedback mechanism [12].They employ a few particular traits that are unable to differentiate between normal and attack traffic flows.

Methodology
Our study uses a systematic and reliable methodology in the goal of improving ddos attack detection through machine learning, taking into account the complexity and dynamic character of this important subject.Our strategy is motivated by the knowledge that attackers frequently create new techniques, calling for proactive measures that adapt and evolve.The primary procedures and approaches used in our study are summarized here, together with crucial statistics that highlight the importance of our work.

A. Obtaining and processing data:
We start out by obtaining a sizable dataset from KD99 dataset, the transaction histories of over 300,000 people.

B. Preprocessing the data:
Our dataset's integrity is of utmost importance.As a result, we carefully preprocess it to make sure there are no mistakes or missing values.This stage entails locating and addressing missing data, getting rid of unnecessary columns, and getting rid of sparse features.We use data imputation techniques to deal with missing values, substituting either the average or most frequent values.We also use techniques to continue discrete variables, which makes further analysis easier.

C. Data splitting and cross-validation:
We split the dataset into training and testing subsets while keeping an 80:20 split ratio in order to thoroughly analyze our models.In the early stages of analysis, we choose a 5-fold cross-validation strategy because cross-validation is crucial for evaluating model performance.

D. Selecting a model:
To find the best strategy for improving ddos attack detection, our study examines a range of machine learning techniques.We choose the decision trees Algorithm after a thorough examination that includes a review of related research papers.To offer effective DDOS attack detection capabilities, this ensemble machine learning model includes components of KNN, Decision Trees, Layer Perceptron and Logistic Regression.

i) K-nearest neighbors:
K-nearest neighbors is a supervised machine learning classifier that is simple and can be easily used to solve regression and classification problems.The nearest k neighbors mechanism is used to determine the class for the new upcoming data.The Euclidean and Manhattan distance functions are used for the measurement of the distance between two data.In this paper, the Euclidean distance function is used.The similarity between the data samples that to be classified and the sample that were found in the classes was distinguished.The Euclidean distance function calculates the distance between the new encountered data and the data which is present in the training set individually.After that, the classification set is created by selecting the k dataset which has the smallest distance.The number of KNN neighbors is based on the value of classification.

ii) Decision tree:
Decision tree is used in machine learning for classification.It is efficient way that follows a divide-andconquer strategy to construct decision tree recursively.The decision tree has the root, internal nodes, branches, and leaves like a tree.Each tree represents a rule which based on the data attributes.Leaves are labeled as the decision for classification.Let the classes are denoted by C1, C2, . .., Cn, and each leaf of decision tree is identifying a specific class from class Ci.

iii) Logistic regression:
Logistic Regression is one of the most effective classification approaches.It is possible to determine the application layer DDoS attack from the effective features after feature extraction.In this paper, we have used logistic regression, however the performance is not suitable for our dataset.The logistic regression can be explained as follows: suppose there are k independent features 1, 2, 3, . .., , then the probability of DDoS attack detection is expressed as follows: where, γ0 is the coefficient, and 1, 2, 3, . . .,  are the features.

iv) Random forest:
This section describes the general framework of the Random forest (RF) model.The RF classifier model consists of 1000 trees, and minimum number leaf node is 1.Furthermore, in the RF model every weak learner was grown to its maximum, unpruned, and 63% observations of the feature subset √m was provided for the bootstrap, where m represents the number of features, and all optimal features are used by the RF model.

E. Performance assessment:
We thoroughly evaluate our models' performance by applying a variety of measures and putting our models to the test.F1 Score, Precision, Recall, and Support are further important classification metrics that we explore in depth.These statistics act as crucial yardsticks for evaluating the potency of our models.

F. Model comparison and selection:
The decision tree Algorithm comes out on top in our comparative examination, showcasing stronger ddos attack detection abilities.In a number of performance metrics, it outperforms KNN, Logistic Regression, and MLP.

G. Data Visualization:
We use data visualization tools to produce graphical representations of our results as an addition to our quantitative analyses.The complexity of ddos attack detection is better understood and seen through the eyes of these images.

Results and Discussion
In this study, we aimed to improve the detection of Distributed Denial of Service (DDoS) attacks by applying semi-supervised machine learning techniques.By using the KD99 dataset, we were able to categorize normal and attack samples with an amazing 99.76% accuracy by applying the random forest technique.We proposed a novel semi-supervised machine learning paradigm to tackle the problem of new or highly skilled attackers.This method efficiently turns unlabeled data into useful training examples by carefully choosing clusters and applying knowledge from a small labeled dataset to classify some of the unlabeled clusters.An enhanced classifier that can more effectively generalize and adjust to the dynamic nature of DDoS attacks is then trained using the enriched dataset.
We used a methodical and trustworthy approach in our methodology, taking into account the dynamic and intricate nature of DDoS attacks.KD99 provided us with an extensive dataset that included a number of different attributes, including duration, protocol type, service, and more.Preprocessing entailed filling in missing values and eliminating superfluous columns in order to ensure data integrity.Comprehensive model analysis was made easier by data splitting and cross-validation, with a particular emphasis on decision tree techniques.Hyperparameter tweaking was done on the chosen models, which included random forest, logistic regression, K-nearest neighbors, and decision trees, to achieve the best results.The evaluation of our models' performance comprised criteria such as F1 Score, Precision, Recall, and Support, which demonstrated their efficacy.In a number of performance criteria, decision tree algorithms fared better than KNN, logistic regression, and MLP.Graphical representations were produced using data visualization tools, which improved comprehension of the difficulties involved in detecting DDoS attacks.
Using confusion matrices and accuracy graphs for various machine learning methods, results were shown and analyzed with an emphasis on ICMP, TCP_SYN, and UDP assaults.The comparative analysis demonstrated how well the decision tree system detects DDoS attacks.
As it wraps up, our research provides a methodical and empirical approach to semi-supervised machine learning for the identification of DDoS attacks.The outcomes show how successful the suggested models are, setting the stage for further developments in cybersecurity procedures and the ongoing defense against DDoS attacks.

Conclusion
As an outcome, the study "Semi-supervised machine learning approaches for ddos attack detection" demonstrates a thorough and organized approach to the difficult task of Ddos attack detection.The Study makes use of a systematic technique that includes a number of steps, from data preprocessing through model selection, and places a strong emphasis on the significance of hyperparameter tuning and evaluation to obtain the best accuracy.The use of a real-world KD99 dataset, containing attributes like "duration","protocol_type","service","flag","src_bytes" and" dst_bytes" for identifying attacks, emphasizes the usefulness of this research.The da-taset has been rigorously prepared for machine learning using data pretreatment techniques like cleaning, addressing missing values, and text processing.The decision to use K-nearest neighbors, Decision tree, Multilayer Perceptron (MLP), Random Forest, and Logistic Regression among other machine learning models for ddos attack detection is the result of a thorough investigation to determine which model will perform the best on this particular dataset.The models are fine-tuned using hyperparameter tweaking to guarantee their best performance.The study's results, including the AUC, support, F1, accuracy, and recall scores for each model, are noteworthy; K-nearest neighbors, Decision tree, Multilayer Perceptron (MLP), Random Forest, and Logistic Regression show their effectiveness in ddos attack detection.Robust DDoS attack detection methods have numerous potential applications.Social media sites, which frequently function as vital conduits for communication, stand to gain from the application of sophisticated detection algorithms to guarantee continuous operations in the event of cyberattacks.Journalistic organizations can benefit from the protection provided by state-of-the-art detection systems, which shield their platforms from disruptive attacks, as they depend on the secure and accessible transmission of information.Moreover, governmental agencies tasked with safeguarding the nation's infrastructure can make use of these developments to reinforce the robustness of vital systems, guaranteeing the continuous provision of vital services.The painstakingly outlined project implementation procedure, which includes data pretreatment, model selection, hyperparameter tuning, and evaluation, emphasizes the project's stringent approach to ensuring reliable findings.This project's future potential looks bright because it aims to improve application use and accuracy, potentially opening it out to people of all ages.The need to increase the dataset for realworld application demonstrates a dedication to continuous advancement of ddos attack detection methods.
In conclusion, our project uses a real-world dataset and machine learning models to detect ddos attacks in a scientific and systematic manner.Its results and promise for the future highlight its contribution to the crucial goal of reducing the effects of ddos attacks and developing the attack-free sector.

Figure 1 :
Figure 1: Visual representation of the comprehensive methodology, seamlessly blending labeled and unlabeled data for robust cybersecurity against evolving threats

Figure 2 :
Figure 2: This flowchart illustrates a comprehensive approach, from obtaining and processing data to final model selection and visualization, providing a robust strategy against evolving cybersecurity threats

Fig3A:
Fig3A: ICMP Attack Confusion Matrix of LR and KNN algorithms