From Data to Impact: Machine Learning Models for Sustainable Development Region Classification-Accelerating the Achievement of Sustainable Development Goals through Predictive Analytics

This research paper presents a classification model for predicting the region to which a country belongs based on its sustainability scores and other related features. The dataset used in this study comprises comprehensive data on sustainability and progress towards achieving the Sustainable Development Goals (SDGs) for various countries. The primary objective is to understand regional trends in sustainability and assess countries' progress in sustainable development. The research begins with data preparation and preprocessing steps, including merging datasets, handling missing values, and standardizing features. Exploratory data analysis is performed to visualize the distribution of the target variable (region) and the distributions of numeric features related to SDG scores. Additionally, relationships between these features are explored using correlation matrices and pair plots. Several machine learning models are employed to classify countries into their respective regions. The models used include Random Forest, Support Vector Machine (SVM) with a linear kernel, K-Nearest Neighbors (KNN), Logistic Regression, Decision Tree, and SVM with a radial basis function (RBF) kernel. Each model is trained on the dataset, and their performance is evaluated in terms of accuracy, precision, recall, and F1-score. The results demonstrate the effectiveness of these models in accurately classifying countries into regions based on sustainability scores and other attributes. The findings reveal that Random Forest, K-Nearest Neighbors, Decision Tree, and SVM with RBF kernel achieve exceptionally high accuracy, suggesting their suitability for regional classification based on sustainability metrics. Logistic Regression and SVM with a linear kernel also provide competitive results. In conclusion, this research contributes to understanding regional trends in sustainability by utilizing machine learning models to predict the regions of countries based on their sustainability scores and associated features.


Introduction
The pursuit of sustainable development has emerged as a global imperative, transcending borders and encompassing diverse regions of the world.In the wake of the adoption of the Sustainable Development Goals (SDGs) by 193 United Nations Member States in 2015, assessing progress towards these goals has become paramount.The Sustainable Development Report 2023 provides a comprehensive dataset, offering insights into countries' sustainability scores and their journey towards achieving the SDGs.This research paper endeavors to harness this invaluable dataset to address a critical question: Can machine learning models predict the regions to which countries belong based on their sustainability scores and associated features?Understanding regional disparities in sustainability is a fundamental step towards addressing global challenges, such as poverty alleviation, environmental conservation, and economic development.Regional trends in sustainability not only highlight areas of progress but also shed light on regions that may require targeted interventions to accelerate their sustainable development journey.As such, the ability to predict a country's region based on sustainability metrics holds immense potential for policymakers, international organizations, and researchers seeking to foster equitable and sustainable development across the world.This research aims to bridge the gap between sustainability assessment and regional classification by applying a diverse set of machine learning algorithms to a rich dataset.By doing so, we seek to offer a comprehensive understanding of the predictive capabilities of these models and their potential utility in regional sustainability assessment.Our investigation encompasses exploratory data analysis, data preprocessing, and the deployment of machine learning models, including Random Forest, Support Vector Machine, K-Nearest Neighbors, Logistic Regression, Decision Tree, and SVM with RBF kernel.Through this research, we aim to provide a robust framework for classifying countries into their respective regions based on sustainability scores.The findings of this study have the potential to inform evidence-based policy decisions, prioritize resources, and drive targeted interventions towards achieving the SDGs.Ultimately, our research contributes to the broader dialogue on sustainable development by leveraging the power of machine learning to uncover regional insights within a global context.

Dataset Description
The dataset used in this research is derived from the Sustainable Development Report 2023, which reviews the progress made towards achieving the Sustainable Development Goals (SDGs) since their adoption in 2015 by 193 United Nations Member States.This dataset provides comprehensive information related to sustainability, allowing for a nuanced assessment of countries' progress in sustainable development.Below are the key components and attributes of the dataset: 1. Country Information: • country_code: A unique identifier for each country.
• country: The name of the country under consideration.
• year: The year for which sustainability data is recorded.

Sustainability Scores:
• sdg_index_score: The overall sustainability score for a country, representing its progress towards achieving the SDGs.This score provides an aggregate measure of sustainability performance.
• goal_1_score through goal_17_score: Individual scores for each of the 17 Sustainable Development Goals (SDGs).These scores assess a country's progress towards specific goals, such as poverty reduction, quality education, clean energy, and more.

Regional Classification:
• region: The region to which a country belongs.This attribute serves as the target variable for classification, and the goal is to predict a country's region based on its sustainability scores and other features.The dataset is designed to facilitate the analysis of global sustainability efforts, offering valuable insights into countries' performances in various aspects of sustainable development.It covers a range of years, allowing researchers to track progress over time and identify trends in different regions of the world.Additionally, the dataset has undergone data preparation and preprocessing steps, including the handling of missing values and the standardization of features, to ensure its suitability for machine learning analysis.Overall, this dataset serves as a valuable resource for exploring regional trends in sustainability and developing predictive models to classify countries into their respective regions based on sustainability metrics.

Methodology
In this research, a structured methodology was employed to predict the regions to which countries belong based on their sustainability scores and associated features.The study commenced with data acquisition, involving the retrieval and loading of two primary datasets from the Sustainable Development Report 2023.These datasets were merged based on a common identifier, the 'country_code,' to facilitate further analysis.Ensuring data quality, missing values were addressed by removing rows with incomplete information.Exploratory Data Analysis (EDA) was a critical step in understanding the dataset's characteristics.Visualizations, such as countplots and histograms, were utilized to gain insights into the distribution of the target variable 'region' and the numeric features related to Sustainable Development Goals (SDGs) scores.Correlation matrices and heatmaps were employed to explore relationships between these features.To prepare the data for machine learning, irrelevant features, namely 'country_code' and 'country,' were dropped, and the dataset was split into training and testing sets.Feature scaling was applied to standardize the numeric features, ensuring that they have consistent scales.A diverse set of machine learning models was selected for the classification task, including Random Forest, Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Logistic Regression, Decision Tree, and SVM with Radial Basis Function (RBF) Kernel.Each model was trained on the training data and evaluated on the testing data, with accuracy serving as the primary evaluation metric.Classification reports and confusion matrices were generated to provide comprehensive insights into the models' performance.The results were thoroughly interpreted to identify the most effective model for regional classification based on sustainability metrics.The research's conclusions and implications were drawn from the findings, highlighting the potential utility of predictive models in guiding policy decisions and targeted interventions for sustainable development.Additionally, recommendations for further research and real-world applications of the models were discussed, underscoring the significance of this methodology in bridging the gap between sustainability assessment and machine learning.

Observations
In the context of our sustainability analysis, the vertical bar plot shown in Figure 2 offers a compelling visual representation of countries' progress toward achieving sustainable development.This plot provides a clear overview of each country's Sustainable Development Index Score, with the x-axis representing the scores and the y-axis denoting the country names.Each vertical bar corresponds to a specific nation, and its height reflects the magnitude of its Sustainable Development Index Score.The use of a pastel color palette and a white grid background enhances the plot's visual appeal and clarity.Upon careful examination of the plot, it becomes evident that there are notable variations in the sustainability performance of different countries.These variations highlight the disparities and trends in global sustainable development efforts.This visualization serves as a valuable reference point for our analysis, enabling us to gain insights into the diverse trajectories of sustainability achievement among nations.

Figure 2: Sustainability Development Score VS Country
In Figure 3, we present a visual representation of the distribution of regions, a crucial component of our sustainability analysis.The vertical bar plot provides insights into the frequency of countries belonging to each specific region.

Figure 3: Distribution of Regions
In our exploration of the dataset, we delved into the distributions of various numeric features, as depicted in Figure 4.This figure showcases a series of histograms, each representing a specific numeric feature related to sustainability.The selected features include 'sdg_index_score' and scores associated with the Sustainable Development Goals (SDGs) from 1 to 17.This visual exploration aids in our understanding of the distribution patterns and variations within these critical sustainability metrics, contributing to a more informed analysis of our research objectives.

Figure 4: Distribution of Numeric Features Related to Sustainable Development
In our quest to uncover meaningful insights within the dataset, we embarked on an examination of the relationships between numeric features, as visualized in Figure 5.This figure presents a correlation heatmap, a powerful tool for quantifying and visualizing the interdependencies among these sustainability-related metrics.

Result
The results of the analysis showcase the performance of various machine learning models in predicting the regions to which countries belong based on their sustainability scores.Below is a summary of the key findings: 1. Random Forest Classifier:

Conclusion
Our study assessed various machine learning models for their efficacy in classifying countries into regions based on sustainability metrics.Notably, Random Forest, K-Nearest Neighbors (KNN), Decision Tree, and SVM with Radial Basis Function (RBF) Kernel exhibited remarkable performance, achieving accuracy levels close to perfection (100% or 99%).This underscores the potential of machine learning in regional classification tasks.In closing, this research demonstrates the immense potential of machine learning in the realm of sustainability assessment and regional classification.The models showcased here offer a promising path towards more targeted and effective sustainable development efforts, fostering a future where global sustainability goals are not merely aspirations but attainable realities.As we continue to bridge the gap between data science and sustainable development, the pursuit of a more equitable and sustainable world remains within our reach.

Figure 1 :
Figure 1: Flow Chart of Methodology

Table 1 : Classifier and corresponding accuracy
The predictive models developed in this research offer actionable insights for policymakers and organizations engaged in sustainable development initiatives.By accurately classifying countries into regions, these models can guide the allocation of resources and interventions to areas where they are needed most.This targeted approach can expedite progress towards achieving the Sustainable Development Goals (SDGs) on a global scale.The study sets the stage for future research in the field of sustainability assessment and machine learning.Further exploration could involve the integration of additional data sources, temporal analysis to track sustainability trends over time, and the development of predictive models tailored to specific SDGs or subregions.Moreover, the models developed herein can find practical applications in guiding policy decisions, aid distribution, and sustainable development planning.