International Journal For Multidisciplinary Research

E-ISSN: 2582-2160     Impact Factor: 9.24

A Widely Indexed Open Access Peer Reviewed Multidisciplinary Bi-monthly Scholarly International Journal

Call for Paper Volume 7, Issue 4 (July-August 2025) Submit your research before last 3 days of August to publish your research paper in the issue of July-August.

A Comparative Study of Different Data Pre-processing Methods for Machine Learning

Author(s) Ms. Ramya S, Dr. B Kumaraswamy, Mr. Vishal Agarwal, Dr. Anushka Gkl Jain
Country India
Abstract In machine learning, the quality of data often determines the success of predictive models, and data pre processing is a crucial step in ensuring reliability, accuracy, and generalizability. This study presents a comparative evaluation of common pre processing methods, including missing value imputation, feature scaling and normalization, categorical encoding, outlier detection, and feature engineering techniques. Using three benchmark datasets across classification, regression, and multiclass tasks, we applied these methods in combination with machine learning models such as logistic regression, decision tree, support vector machine (SVM), random forest, and gradient boosted trees. Results show that imputation methods like iterative multivariate imputation improve predictive performance in datasets with moderate to high missingness, while scaling significantly enhances linear and gradient based models but remains unnecessary for tree based models. Target encoding proves most effective for high cardinality categorical features, though it requires careful leakage prevention. Outlier handling benefits linear models but has limited impact on tree based algorithms. Feature engineering techniques such as polynomial expansion and principal component analysis (PCA) provide gains in specific contexts but involve trade offs in interpretability and runtime. Overall, the study underscores the importance of tailoring pre processing strategies to both data characteristics and model families, offering practical guidelines for optimizing machine learning pipelines.
Keywords Data Pre processing; Machine Learning; Missing Value Imputation; Feature Scaling; Categorical Encoding; Outlier Detection; Feature Engineering; Principal Component Analysis (PCA); Model Performance; Comparative Study
Field Computer
Published In Volume 7, Issue 4, July-August 2025
Published On 2025-08-04

Share this