A Statistical Analysis of Credit and Debit Card Usage Patterns

This a n a l y s i s deals with finding relation between encountering credit and debit card frauds and different age groups, gender, area of residence, monthly income, credit card limit and several other factors. It also involves fitting a prediction model on the credit card usage pattern of individuals. It includes studying a times series model for credit and debit card usage over the last decade and developing a forecasting model to predict the forecast using the previous data available


INTRODUCTION
The world is turning cashless and Debit and credit cards are two of the most commonly used payment cards in the world.They both have a series of numbers embossed or printed along with the cardholder's name on the front.Each has a magnetic stripe on the back, a special security code, and an embedded microchip on the front that encrypts key personal and financial information related to the cardholder and the related account.Credit cards give you access to a line of credit issued by a bank and thus provides us a flexibility to make purchases and pay for it later which is the biggest advantage for people turning towards using credit cards.Whereas when a purchase is made through debit card the money is debited from ones account at that very moment.Through the introduction of credit and debit card people don't have to worry about carrying cash everywhere and thus limiting their transaction amount.With the introduction of UPI the flexibility of doing cashless transactions is achieved even more.All the transactions can be made at the ease of the fingertips within the smartphone.Even the smallest of a transaction is made through UPI right from paying to the autodriver or purchasing grocery or shopping online.But with the increasing number of online transactions there also is an increased risk of encountering frauds.Fraud' in card transactions is unauthorized and unwanted usage of an account by someone other than the owner of that account.The first universal credit card which could be used at a variety of establishments, was introduced by the Diners' Club, Inc., in 1950.Another major card of this type, known as a travel and entertainment card, was established by the American Express Company in 1958 where as the first debit card was introduced in 1982 in Canada by Saskatchewan Credit Unions.But the fashion of making transactions by a card was not much popular back then now through digitalization and the world on the verge of turning cashless the use plastic money (credit and debit card) has increased a lot.Through this paper i would like to shed light on if there is any relation between the different age groups using credit and debit card and the chances of them encountering a fraud or is fraud related to specific age group or a specific area of residence and also many other factors.

DATA COLLECTION
This project is executed using with two datasets.The primary data collected through GOOGLE FORM.The data is collected for various aspects such as age, educational level, job profile, use of credit card , debit card , the area of residence, monthly income, frauds encountered and it's types, etc.Another important thing to note is that the volunteers were not disclosed to disclose their personal information like Phone Number, Email ID, Bank Account number, CVV, Credit/Debit Card number, even the names of te volunteers was not recorded.This confidentiality gave the volunteers a sense of reassurance that their data will not be misused.
Out of the total 514 observations, 466 were legitimate transactions while 47 were fraud transactions.Conclusion-Here, we can observe that for Credit card users the Modal age group is 20-30.However, an important thing to observe here is that this is a sample of only 500 observations and so, there might be some deviation from the Age of the Population under study.
where, e β 0 +β1 * X1+...+β9 * X9 1+e β 0 +β1 * X1+...+β9 * X9 and ε ∼ B(π(x)) Note that-β0, β1, .., β9 are the regression coefficients$ We are performing Logistic regression on the Debit Card users data-set to obtain a prediction model for FRAUD with the following regressors and response variable.Logistic regression belongs to a family, named Generalized Linear Model (GLM), developed for extending the linear regression model to other situations.Other synonyms are binary logistic regression, binomial logistic regression and logit model.Logistic regression does not return directly the class of observations.It allows us to estimate the probability (p) of class membership.The probability will range between 0 and 1.You need to decide the threshold probability at which the category flips from one to the other.By default, this is set to p = 0.5, but in reality it should be settled based on the analysis purpose.Also, an important fact to note here is that in the 4th graph, we observe that there is a sharp decrease in transactions, this can be attributed to the Nationwide Lock-down imposed attributed to the COVID-19 out-break.Also, in the 4th graph itself one can observe that there is a sharp irregular increase in the Number of transactions around Nov 2016.This can be attributed to the Demonetization exercise carried out by the Govt. of India.
Now, we look to fit a 3 period Moving Average, the following is a brief look back on Moving averages.years However, this method is not that useful when it comes to prediction on the time series is concerned.

Monthly moving average for Credit card transactions
We can use the Holt-Winters Triple exponential smoothing model to predict the data.

HOLT-WINTERS TRIPLE EXPONENTIAL SMOOTHING-
Here is a brief recall on Exponential Smoothing -Triple exponential smoothing is used to handle the time series data containing a seasonal component.This method is based on three smoothing equations: sta-tionary component, trend, and seasonal.Both seasonal and trend can be additive or multiplicative.The three aspects of the time series behavior-value, trend, and seasonality-are expressed as three types of exponential smoothing, so Holt-Winters is called triple exponential smoothing.The model predicts a current or future value by computing the combined effects of these three influences.The model requires several parameters: one for each smoothing (α, β, γ), the length of a season, and the number of periods in a season.In this plot we can observe that the observed values lie in the 90% confidence bands of our predictions for both Credit and Debit Card transactions, hence we can say our predictions are accurate.This can be further proved by the value of RMSE for both the observations.

Stationarity and ACF, PACF plots
We will now check the stationarity of both the Time Series, before that, let us recall Stationarity of a Time series-In the most intuitive sense, stationarity means that the statistical properties of a process generating a time series do not change over time i.e the time series shows a constant mean and variance.We perform the KPSS test for stationarity for both datasets-H0: The time series is stationary.H1: The time series is non-stationary.We can observe that the both the time series is non-stationary.Now we proceed to plot the Partial Auto-correlation functions, and Auto-Correlation Function this will help us identify whether the time series has White Noise.
White Noise-A time series is white noise if the variables are independent and identically distributed with a mean of zero.Interpretation-We can conclude from the plots that that we will get an AR(II) component in the model, however there is a chance that the model will be a mixed Model .We can also say that the time series is stationary Series CCtot_ts Series DCtot_ts 0.5 1.0 1.5 0.5 1.0 1.5

Lag Lag
From both of the graphs we can observe that most of the points lie inside the Autocorrelation band.We now look to prove this by performing the L-Jung Box test

ARIMA Parameters-
Each component in ARIMA functions as a parameter with a standard notation.For ARIMA models, a standard notation would be ARIMA with p, d, and q, where integer values substitute for the parameters to indicate the type of ARIMA model used.The parameters can be defined as: p: the number of lag observations in the model; also known as the lag order.
d: the number of times that the raw observations are differenced; also known as the degree of differencing.
q: the size of the moving average window; also known as the order of the moving average.Now, from the graph we can say that our model is a good fit, because all of our predictions lie in the 95% confidence bands.

Just like in
And, the accuracy of the plot can be measured using the RMSE value.The assumption which we made from the AR(II) plot is right.

OBJECTIVES 1 .
• Email: editor@ijfmr.comIJFMR23069522 Volume 5, Issue 6, November-December 2023 2 To analyze credit card usage pattern of individuals.2. To fit a prediction model using logistic repression on the basis of the usage pattern for detection of fraud.3. To fit a Time series on credit and debit card transactions per month To identify seasonal component, trend component, irregular component.4. To develop a forecasting model using ARIMA technique to predict a forecast for a given year using up the data of previous.
Here, we can observe that close to 80% of the people in our data-set use Debit cards.Whilst only 7% use a Credit Card.However, the percentage of people using both Debit and Credit Card is 13%.Here, we can observe that in both Credit and Debit card cate-gories, the people with almost undergraduate degree are using Debit/Credit card more.And most importantly, the penetration of Credit/Debit card in the people having almost High School qualification is the least.

Frequency
regression is used to predict the class (or category) of individuals based on one or multiple predictor variables (x).It is used to model a binary outcome, that is a variable, which can have only two possible values: 0 or 1, yes or no, diseased or non-diseased.Here we use Binary Logistic Regression Model-Used when the response is binary (i.e., it has two possible outcomes).The cracking example given above would utilize binary logistic regression.Other examples of binary responses could include passing or failing a test, responding yes or no on a survey, and having high or low blood pressure.The model for logistic regression is given as random seasonal trend observed −6e+07 2e− +0 17 e+07 5e+06 5.0e+07 5.0e+07 2.0e+08 random seasonal trend observed From the above graph it is clearly visible that the Number of transactions has an increasing secular trend, also one can observe a seasonal pattern from the graph, this may be attributed to Festivals like Diwali(High spending pattern is observed), and a reduce in Number of transactions is observed in the months of February which can be attributed to Financial Year ending, where all of the banks are closing their books and the failure rate of transactions grow.

EXPLORATORY DATA ANALYSIS FOR THE FORM DATA: -1. 1.Fraud among Job profiles
The total observations recorded were 514 out of this close to 80% were debit card users, 7% were Credit card users and 13% used both Credit and Debit cards.The secondary data has

Interpretation-We can say that the maximum frauds which are faced by a category are students. 2. Pie chart for Area of residence for Credit card users Conclusion-
We can observe that 31.4% of the Credit card users in our data-set are residing in Rural Areas while 68.6 percent are residing in Urban Areas We can observe that the majority of the fraud type is Phishing and Hacking which account for 30.4% of the total frauds each, the next prevalent fraud type is Stealing/dumpster diving.
IJFMR23069522Volume 5, Issue 6, November-December 2023 5 5 Pie chart for Area of residence of Debit card users Conclusion-We can observe that 24.4% of the Credit card users in our data-set are residing in Rural Areas while 75.6 percent are residing in Urban Areas 6 Pie Chart for types of frauds Types of Frauds

for the age of Credit Card users
IJFMR23069522Volume 5, Issue 6, November-December 2023 79 HistogramAge Conclusion-Here, we can observe that for Credit card users the Modal age group is 20-30.However, an important thing to observe here is that this is a sample of only 500 observations and so, there might be some deviation from the Age of the Population under study.

ANALYSIS Decomposing the Time Series Time
series arise as recordings of processes which vary over time.Time series plot displays the values of the process output in the order in which the values occur.A recording can either be a continuous trace or a set of discrete observations.We will concentrate on the case where observations are made at discrete equally spaced times.By appropriate choice of origin and scale we can take the observation times to be 1, 2, . . .T and we can denote the observations by Y1, Y2, .. ., YT .A key analyzing a time series is to understand the form of any underlying pattern of the data ordered over time.The pattern potentially consists of several different components, all of which combine to yield the observed values of the time series.There are 4 components of time series Trend, Seasonality, Cyclical Component and Random Component.
This data is now segregated into 80% Training and 20% Test data-set.A logistic regression model is fitted on the Training data-set, and using it we can proceed to predict the values of the Test data-set.Now, we proceed for defining the Confusion matrix From the Confusion Matrix it is clear that the accuracy of our model is 90.9%.Further we get to know that, the following regressors are significant -Debit card limit, Debit card expense, Debit card usage Frequency, Job, Gender.Now, p r o c e e d to obtain the confusion matrix and get the Accuracy of our predictions χ IJFMR23069522 Volume 5, Issue 6, November-December 2023 13TIME SERIES Trend: Long-term, gradual increasing, decreasing or stagnating tendency of the variable Y. Seasonality: Regular, relatively short-term (yearly) repetitive up and down fluc-tuations of the variable Y depending on the season.Cyclical Component: A gradual, long-term, up and down potentially irregular swings of the variable Y. Random Component: A random increase or decrease of the variable Y for a specific time period.The data which we have is Monthly Debit and Credit Card transactions per month from April 2011 to Feb 2022, first we will decompose the Time Series for Debit Cards, Credit cards • Email: editor@ijfmr.comIJFMR23069522 Volume 5, Issue 6, November-December 2023 14 To perform Holt's triple exponential smoothing, we divide the data into 2 parts, the training set and the test set.The training set is from April 2011 to July 2021, and the test set is from August 2021 to Here we can observe that in the case of Credit card transactions, α=0.9244575, β=0 and γ=1.And in the case of Debit cards transactions, we observe that α=0.8947077, β=0 and γ=0.3572679.Now, we will check for the accuracy for both of our models,