Placement Wizard Recruitment Made Easier Using Statistics

The recruitment process for jobs is often challenging and time consuming. It entails multiple stages, including screening resumes, conducting interviews, and assessing candidates through tests. Recently, machine learning algorithms have gained popularity in the recruitment process as they can predict the likelihood of a candidate's placement based on various factors. In this study, we aimed to develop and compare three machine learning algorithms, including LR, DT, NB, KNN & PCA for predicting the placement status of students in a job recruitment process.

These are just a few examples of common data cleaning tasks.The specific steps and techniques employed in data cleaning will depend on the nature of the data, the analysis goals, and the domain knowledge of the analyst.
In the dataset that we are considering, initially there were several other areas of specialization such as Digital Marketing, Business Analytics, Operations Management, etc.But we decided to consider only 2 Specializations viz a viz."Marketing and HR" and "Marketing and Finance" in order to simplify the computations and formulations.
There were several NA values in the dataset hence we had to omit those values whenever required.for ex.-Salaries of Unplaced Students were initially "NA" but we replaced salaries as "0" so that our data become computable without changing the meaning of the data.

EXPLORATORY DATA ANALYSIS
Exploratory Data Analysis (EDA) is a crucial step in the data analysis process that involves examining and summarizing the main characteristics of a dataset.It aims to gain insights, detect patterns, and uncover relationships within the data.EDA techniques include data visualization, summary statistics, and data transformation.By exploring the data, we can understand the distribution, identify outliers, assess missing values, and determine the appropriate pre-processing steps.EDA helps in formulating hypotheses, selecting modelling techniques, and making informed decisions.It serves as a foundation for further statistical analysis, model building, and drawing meaningful conclusions from the data.

❖ Summary statistics
It provides a concise summary of the key features of a dataset, helping to understand its distribution and properties.Measures like mean, median, and mode provide insights into the central tendency of the data, while measures of dispersion such as standard deviation and range indicate the variability and spread.Skewness and kurtosis describe the shape of the distribution.Minimum and maximum values highlight the range of the data.Quartiles offer information about the data's spread and can help detect outliers.Overall, summary statistics provide a quick overview of the dataset's characteristics, enabling initial insights and informing subsequent analysis and decision-making processes.
The summary() function provides a summary of the central tendency, dispersion, and distribution of each variable in the dataset.The output will include the minimum, 1st quartile, median, mean, 3rd quartile, and maximum values for numeric variables.For factor or character variables, it will show the frequency counts of each unique value.
• We can see maximum variance in the "entrance test percentage" hence we can say that although the mean is observed at 72.The observing coefficient of variation we compare the variances of all the academic marks.Here the variance of "SSC percentage" and "HSC percentage" are almost same while the variance of "Entrance test" is maximum.
Here a low coefficient of variation (less than 10%) indicates that the variable has relatively low variability or dispersion with respect to the mean.

❖ Data Visualisation:
• The histogram of SSC percentages appears to be normally distributed as the shape of the histogram seems symmetric and bell-shaped.• Also, peak (mode) is always seen in almost middlemost observations for most of the histograms.
• Hence, it can be interpreted that majority of the population scores about an average of 60 percentages.This trend is being seen in all of the 4 histograms.
• The above graph shows histogram of entrance exam percentage.It can be seen that the graph is not symmetric, rather it is positively skewed.In all the earlier graphs, the frequency peaked at 60-70% and then the frequencies dropped drastically in higher percentages.But in this case, there are quite a few students in high percentages.
• No. of male students = 139 • No. of female students =76 • No. of male placed =100 and no. of unplaced males=39 • No. of females placed =48 and no. of unplaced females = 28 • The sample proportion of males getting placed (71.9%) is slightly higher than no. of females getting placed (63.1%).• Also, overall proportion of placements is 68.3 % • The above histogram shows the HSC board stream vs status of placements.
• It can be seen that proportion of placements of arts is almost 50 percent ,we cannot conclude any statements because we have very low amount of data of arts stream students.• From the above graph we can say that there are candidates from all streams opting for MBA.

❖ Correlation Analysis:
• Here, the Correlation matrix shows the correlation between all the numeric variables that is percentages of marks in various exams.• We know, whenr lies between [-0.3, 0.3], the observations are uncorrelated or have negligible correlation.This generally occurs due to randomness in the data • There is considerable correlation among all the academic exams whereas it is observed that there is negligible correlation between academic exams and entrance tests.

MACHINE LEARNING MODELS
ML model building is the process of creating a computer program that can make predictions or provide insights based on patterns it learns from data.We first choose an algorithmi.e.,select a method or technique (algorithm) that suits the problem we want to solve.Then, prepare the data: Organize and clean the data, making sure it's in a format that the algorithm can understand.After that, wetrain the model byfeeding the algorithm with labelled data, allowing it to learn patterns and relationships between the input (features) and the output (labels).Then we evaluate the model by assessing how well the model performs by testing it on a separate set of data that it hasn't seen before.
Similarly, we may practice unsupervised learning as well wherein the goal is to discover patterns, structures, or relationships in data without labelled examples.The ultimate goal of ML Model building is to create a reliable model that can make accurate predictions or provide useful insights when given new, unseen data.Supervised machine learning models are trained on labelled data, where the inputs and corresponding outputs are provided, enabling the model to learn patterns and make predictions or classifications.In contrast, unsupervised machine learning models work with unlabelled data and aim to discover inherent structures, relationships, or clusters within the data without any predefined target outputs.
The unsupervised ML model that we use in this project is : i.

PRINCIPAL COMPONENT ANALYSIS
Similarly, the different supervised ML models that we used are as follows: Here, we need to know whether there is grouping structure between the data of placed students and un-placed students So, now we have data containing all numeric variables (Percentages), where 1 st 148 observations are of place students and remaining of unplaced students.• Email: editor@ijfmr.com

IJFMR23056296
Volume 5, Issue 5, September-October 2023 10 Now, we draw scree plot to check point at which elbow appears.This elbow point helps us to identify number of principal components.Here, the elbow is at k=3, hence we consider the first 2 principal components.
Here, we get to know that themost of the variation is given by first 2 principal components.now by converting the dataset into matrix and multiplying by Eigen vectors, we get all the linear combinations i.e., all the principal component values The 2 clusters overlap hence we can say that we fail to differentiate the data into distinct clusters.Hence there is no grouping structure between recruited students and non-recruited students.
Here, we use 5 numeric variables viz a viz ssc_p, hsc_p, degree_p, etest_p, mba_p, mostly related to scores in academic exams.We can say that scores in the academic exams do not help us to classify students into 2 groups Placed and Not Placed.

LOGISTIC REGRESSION MODEL
Logistic regression is a statistical method used for binary classification problems.It uses a mathematical function to model the probability of a binary outcome.The model assumes a linear relationship between the input variables and the log-odds of the positive outcome.
The model is used to make predictions on new data by computing the logit and applying the mathematical function to obtain the probability of the positive outcome.
In order to fit the logistic regression model to our data, we prepare the data first.We remove the salary and serial number columns as of now since we do not wish to study it.Then, we transform the "Status" column to a numeric format by assigning value "1" for "Placed" and value "0" for "Not Placed".
We define the logistic regression model using the formula interface.We define all required variables beforehand.Now, we fit the LOGISTIC REGRESSION MODEL by using the pre-decided Training dataset.We use the "glm" command in R which is a direct command to fit generalized linear models.
Testing of hypothesis using chi-square test for significance of regressors (at least one of the regressors is significant): H0 : All the regressors are insignificant H1 : At least one of the regressor is significant.
Here, H0 gets rejected, since G=null devianceresidual deviance exceeds chi-square with 12 degrees of freedom at 5 percent level of significance.Since we found out that at least one of the regressors is significant, we now need to check which of them is significant; we observe that the p-value is less than 0.05.
We conclude that the regressors "ssc_b", "hsc_p", "degree_p", "workex", "mba_p" are significant for the logistic model.Now we need to evaluate the performance of the model by applying the formulated model on our TESTING DATA.We get the confusion matrix of the test which gives us the True Positives, True Negatives, False Negatives, False Positives of the test.
We use these 4 values to calculate the Accuracy, Specificity, Sensitivity, Precision and F1 Score of our model.
We can see that our formulated Logistic Regression Model is approximately 90.76% accurate.

NAÏVE BAYES ALGORITHM
Naïve Bayes is a machine learning algorithm used for classification problems.It calculates the probability of each class given a set of input features using Bayes' theorem.It assumes that the features are independent of each other, which simplifies the calculations.The algorithm works by calculating the prior probability of each class and the likelihood of each feature given each class.It then combines these probabilities to compute the posterior probability of each class given the input features.The class with the highest posterior probability is then chosen as the predicted class.
To perform the Naïve Bayes algorithm, we first remove the unnecessary columns(Sr.No. and Salary).After cleaning the data, we convert the Status column in 0 (for Not Placed) and 1 (for Placed).

This is the cleaned data:
Command for fitting the model is naiveBayes().
After formulating the model, we test the model for its fitness.The Accuracy, Sensitivity and Specificity of the model is calculated on the basis of confusion matrix.
Naïve BayesAlgorithm is 71.21% accurate for the given data.

K-NEAREST NEIGHBOURS (KNN) ALGORITHM
In

FITTING OF KNN ALGORITHM:
We consider column numbers 2,4,7,10,12,13 only for this model as we wish to work only of the scores of students and their impact on the placement status (1/0).We now Fit the KNN Model to our training dataset using the "knn" command in R.This is a direct command in R to fit the KNN model to any data.
We can see that based on the learnings from the training data, our model has classified the observations in testing data into "1" and "0".Now we need to see how accurately this has been done.In order to do that, we find the confusion matrix of the test and from the values of the confusion matrix; we calculate the Accuracy, Specificity, Sensitivity, Precision and F1 Score We observe that our KNN algorithm has an accuracy of about 77.27%.

DECISION TREE
A decision tree is a supervised learning algorithm that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility.It is one way to display an algorithm that only contains conditional control statements.
Decision trees are commonly used in machine learning to solve classification and regression problems.They are also used in decision analysis, where they are known as influence diagrams.For classification problems, the decision tree is used to predict the class of a new data point.For regression problems, the decision tree is used to predict the value of a continuous variable.
To make a prediction, the algorithm starts at the root of the tree and follows the branches down to the leaf that corresponds to the values of the features for the data point being predicted.The outcome at the leaf is then used as the prediction.
Decision tree algorithms are relatively easy to understand and interpret, and they can be used to solve a wide variety of problems.However, they can also be sensitive to over fitting, which is when the algorithm learns the training data too well and is not able to generalize to new data.Following is the Decision tree classification for the data, the data here used is the training data which has been obtained earlier.

INFERENCE
After computing various machine learning models and testing them for their efficiencies, we get the following output to compare them with each other: We hereby infer that amongst all the formulated ML models, Logistic Regression Prediction Model has the highest accuracy of 90.76%.

CONCLUSION
From the above table of inferences, we can declare that our Logistic Regression Model is the best model for handling and prediction of data for placement purposes.This model is meant to assist any recruiter to find the ideal candidates for their jobs based on the candidates' educational and professional qualifications.
In general, if the relationship between the input features and the output variable is complex and non-linear, a random forest may perform better than logistic regression.However, if the relationship is simple and linear, like in this case, logistic regression may be sufficient and more interpretable.For our dataset, Logistic Regression works the best because the sample size is small and data is linearly separable.Logistic Regression model performs well when data size is small and decision tree works better when data set is large.
In future it can be extended to create job profile selector which help interviewers reduce any sort of subjectivity or control rush of students at campus drives and shortlist best candidates directly.That can save a lot more time and money for the company.
Machine learning models are a powerful tool that can be used to make predictions from data.However, it is important to remember that machine learning models are only as good as the data that they are trained on.It is also important to carefully consider the limitations of machine learning models, such as their potential to overfit or underfit the data.
There are some obvious drawbacks to this process of using ML assistance such as inability to judge inter-personal skills and other human values of the candidate.Still, this model can help us cutdown the much tedious part if the recruitment process and give us an option to interview and screen the shortlisted candidates in the end.
In conclusion, we would like to say that although Artificial Intelligence cannot completely replace the current way of doing things, it can certainly assist us in increasing efficiency or reducing efforts and this project is a very good proof of the same.

2 STATISTICAL TOOLS USED 1 .PROCEDURE 1 .
• Email: editor@ijfmr.comIJFMR23056296Volume5, Issue 5, September-October 2023 Performing principal component analysis to control dimensions of data.2. Fitting of logistic regression model to given data.3. Fitting "k-nearest neighbours" ml model to data.4. Fitting "naïve bayes" ml model.5. Fitting Decision Tree Classifier Model.• Email: editor@ijfmr.comIJFMR23056296Volume5, Issue 5, September-October 2023 3 CLEANING THE DATA Data cleaning, also known as data cleansing or data pre-processing, is the process of identifying and correcting or removing errors, inconsistencies, missing values, and irrelevant or noisy data from a dataset.It is an essential step in data analysis and machine learning tasks to ensure the accuracy, reliability, and quality of the data.Data cleaning involves several operations, which can vary depending on the characteristics and requirements of the dataset.Here are some common data cleaning tasks: 1. Handling Missing Values 2. Removing Duplicates 3. Correcting Inconsistent Values 4. Handling Outliers 5. Standardizing and Normalizing Data 6.Dealing with Inconsistent or Incomplete Data Structures 7. Handling Irrelevant or Noisy Data

1 .
LOGISTIC REGRESSION MODEL 2. NAÏVE BAYES ALGORITHM 3. K-NEAREST NEIGHBOURS ALGORITHM 4. DECISION TREE ALGORITHM Training and testing data: To prepare the dataset for analysis, we split it into two parts: a training set and a testing set.This was accomplished by creating separate datasets for training and testing.During the training phase, the machine learning model will attempt to comprehend the various correlations present in the training dataset, and the accuracy of the predictions will be assessed.To split the dataset, we utilized an 70-30 ratio, where 70% of the data was reserved for training, and the remaining 30% was used for testing purpose.PRINCIPAL COMPONENT ANALYSIS Principal component analysis (PCA) is a statistical technique used to identify patterns in data by reducing the number of variables while retaining the essential information.It does this by finding the principal components, which are linear combinations of the original variables that capture the most variance in the data.PCA is often used for data visualization, dimensionality reduction, and feature extraction.PCA can help in several ways, including: a) Reducing the dimensions of data.b) Classifying and grouping the data.c) To detect and avoid multicollinearity.
a single sentence, nearest neighbours'classifiers are defined by their characteristic of classifying unlabelled examples by assigning them the class of the most similar labelled examples.Despite the simplicity of this idea, nearest neighbours' methods are extremely powerful.They have been used successfully for several complicated prediction projects.The k-NN algorithm is a non-parametric method used for classification tasks, where the class membership of a new observation is determined by the majority vote of its k nearest neighbours in the feature space.KNN Algorithm: • The KNN algorithm begins with a training dataset made up of examples that are classified into several categories, as labelled by a nominal variable.• Assume that we have a test dataset containing unlabelled examples that otherwise have the same features as the training data.• For each record in the test dataset, KNN identifies k records in the training data that are the "nearest" in similarity, where k is an integer specified in advance.• The unlabelled test instance is assigned the class of the majority of the k nearest neighbours.
From this data, only classifier variables are considered.Now, after considering the data, we fit the Decision Tree Classifier Algorithm.After fitting, we check the fitness of the model by obtaining confusion matrix.Accuracy of Decision Tree Classifier is 81.81 % COMPARISON OF MODELS Here the algorithms used by us for classification of prediction are LR, KNN, NB and DT.The classification and comparison of algorithms is done on the basis of Accuracy, Recall, F1 and Precision.There are several terms that are commonly used along with the description of sensitivity, specificity and accuracy.They are true positive (TP), true negative (TN), false negative (FN), and false positive (FP).Sensitivity: In machine learning, sensitivity, also known as recall or true positive rate, measures the proportion of actual positive instances that are correctly identified by the model, indicating its ability to minimize false negatives.Precision: Precision in machine learning measures the proportion of correctly predicted positive instances out of all instances predicted as positive, highlighting the model's ability to minimize false positives.Specificity: In machine learning, specificity measures the proportion of actual negative instances that are correctly identified by the model, indicating its ability to minimize false positives and correctly identify true negative instances.Accuracy: In machine learning, accuracy is a metric that measures the overall correctness of the model's predictions by calculating the ratio of correctly classified instances to the total number of instances in the dataset.F1 Score: In machine learning, the F1 score is a metric that combines precision and recall (sensitivity) into a single value.It provides a balanced measure of a model's performance by considering both the true positive rate and the positive prediction accuracy.All these measures are described in terms of TP, TN, FN and FP as follows: • Sensitivity = TP/(TP + FN) • Specificity = TN/(TN + FP) • Accuracy = (TN + TP)/(TN+TP+FN+FP) • Precision = TP / (TP + FP) • F1 score = 2*Precision*Sensitivity/(Precision + Sensitivity)

•
Measures of Symmetry• If skewness is -0.131, it means that the distribution is slightly negatively skewed.This indicates that the tail of the distribution is longer on the left-hand side and that the majority of the data values are clustered towards the right-hand side of the distribution.while all the remaining others are slightly positively skewed.

SSC percentage HSC percentage Degree percentage Entrance test percentage MBA percentage
HSC percentage and Degree percentage has kurtosis >3 from which we can infer that the distribution is leptokurtic.And for others the distribution is platykurtic since kurtosis <3.