Vulnerabilities and Ethical Implications in Machine Learning

: Recent times have witnessed a convergence of expansive datasets, cost-effective parallelized computational capabilities, and progress in statistical learning techniques, particularly deep learning. This convergence has significantly propelled the integration of machine learning (ML) into commonplace applications. Machine learning models have proven their utility across diverse contexts, spanning from visual recognition tasks to personalized recommendation systems and the analysis of human language. Despite their widespread employment, the exact nature of more complex models as well as the details of their decision-making processes elude the understanding of much of the technical community. Such systems contain nebulous vulnerabilities that need to be better understood and guarded against, especially in critical applications like autonomous vehicle navigation. Recent research has elucidated some of these threats against ML systems, known as "adversarial attacks," and has attempted to describe mechanisms for both attack and defense. Within this document, we elucidate ongoing investigations, showcase tangible instances of hostile interventions, juxtapose various approaches for crafting disruptive instances, and finally delve into the ethical ramifications stemming from these susceptibilities in ML frameworks. We conclude that certain defensive measures, namely adversarial training, should be employed when creating production ready ML models.


Introduction
Recent progress in ML and deep learning has led to the development of highly effective models used in image classification, machine translation, game playing, and many other practical problem domains.(10; 2; 15).Though these models demonstrate considerable performance in classification tasks, they are susceptible to adversarial inputs which confound models and lead to inaccurate predictions.Neural networks and similar classes of ML models seem to exhibit particular vulnerability to these adversarial examples, which, concerningly, can be constructed with perturbations subtle enough that they may be completely imperceptible to humans.
Though adversarial examples can take on many forms depending on the classification system, for the purposes of this paper we focus on image classification systems and the generation of adversarial images.In structuring the perturbed instances challenge, we draw inspiration from a certain methodology (11).Consider an ML model M that takes an input X and generates a correct class prediction ytrue: M(X) = ytrue.It is possible to generate an adversarial input A that is nearly indistinguishable from X, but yields an incorrect class prediction: M(A) ̸ = ytrue.Though A may be generated with the addition of only a small amount of noise to X, the model may be highly confident in its incorrect class prediction.
As ML models have become more ubiquitous in their application, understanding their vulnerabilities and seeking to mitigate them is increasingly necessitated.In certain critical problem domains such as health care or autonomous vehicle navigation, it is completely conceivable that such models might make ethically challenging decisions that directly affect human lives.These domains highlight the importance of developing a deeper understanding of how models generate decisions and what their shortcomings may be.With this in mind, exploration of both adversarial attacks and proposed defenses is unavoidable if we are to employ ML models confidently in ethically fraught domains.
For a comprehensive empirical and qualitative exploration of the characteristics of perturbed instances, we create modified images against the well-established ImageNet Inception v3 model (refer to Figure 1), a cutting-edge convolutional neural network architecture developed by Google (20).Using a pre-trained model facilitates faster experimentation and provides us with a classification system that outperforms any model we could train ourselves on a reasonable timeline.We experiment with limiting perturbation magnitude and learning rate in multiple kinds of attacks against Inception v3.While our examinations mainly focus on situations where the intruder has full knowledge of a system's specifications during the generation of perturbed instances, earlier investigations have demonstrated the transfer of modified instances across models and tactics that can be employed against concealed systems (16; 17  The property of transferability between models was then examined by Papernot et al.'s study(16), which opened the door to the study of adversarial attacks against black box systems in which only outputs, not parameters, were available.Kurakin et al.'s work (11) illustrated attacks in fully black box settings where models were hosted by third parties and began the transition of research into more real world domains.Papernot (17) demonstrated the robustness of adversarial examples in real world applications by feeding physical examples through a cell phone camera before classification.This work also introduced an effective iterative approach to generating both targeted and non-targeted adversarial examples.Evtimov and his group (7) established a new general attack algorithm called "Robust Physical Perturbations" that they used to modify street signs which fooled a classifier from multiple angles and distances.This represents the current state-of-the-art in adversarial attack vectors, and its robustness certainly raises concerns about models employed in the field today and going forward.
On the defensive side, aforementioned work from Papernot et al. (17) explored the idea of using adversarial examples during training to improve resilience against these types of attacks.They also demonstrated the added effects of adversarial training in network regularization.Gradient masking techniques and defensive distillation, which focus on hiding information from attackers through deployment considerations, were formalized (19).These techniques make attacks more difficult, but models will continue to be vulnerable against the same types of attacks when executed with greater computing power.Finally, further work (18) established an overview of the defensive landscape, discussed the trade-offs between model accuracy and resilience, and began to situate these types of attacks within the AI safety discourse.That said, there is still a dearth of research in the spheres of defense and ethics in the adversarial attack space.

Adversarial Attacks
When reviewing adversarial attacks in ML, there are multiple attack environments and vectors that must be considered.As mentioned earlier, the discourse within this manuscript directs its attention to breaches within the realm of computer vision.This segment entails a comprehensive survey of the landscapes and avenues of adversarial attacks.We subsequently introduce mathematical frameworks to elucidate various adversarial approaches.Finally, we assess each method's efficacy and present instances of the attacks they produce.

Attack Environments
There are two primary situations in regard to model information availability: "full knowledge," where a system's architecture and parameters are accessible, and "black box," where only network outputs are accessible.The full knowledge environment is by far the most dangerous, but it is also the least likely attack vector in deployment situations with effective security practices.In this scenario, a perpetrator possesses familiarity with the structure of the foundational system and possesses entry to the model's attributes.This attribute accessibility enables the creation of the error gradient, a tool that can be harnessed to directly fabricate instances of adversarial manifestations.These examples prey upon the weakest sections of the decision manifolds and can be constructed with the least amount of noise and visual perturbation.These methods are explored in the following subsection.
The more likely environment in which attacks are to occur is the black box setting.In this situation, the model and its parameters are hidden, but its outputs are available.For example, an attacker might be able to upload an image to a computer vision system that returns information about the model's class predictions and corresponding likelihoods.Though no error gradient is available, iterative probing of the network using adversarial examples can lend directional insight into the underlying decision boundaries.It has been demonstrated( 16) that the property of transferability also allows for the training of a similar "surrogate" model to the target model which can be used to estimate the target gradient and speed up adversarial attack generation.
Aside from considerations regarding information availability, there are also two primary types of input environments: software, where information is passed directly to a system (e.g. a picture is uploaded to a publicly available API), and physical, where a system processes information from the real world (e.g. a stop sign has been modified with adversarial stickers to mislead an autonomous vehicle).The primary distinction between these cases is environmental sterility.In the real world scenario, the attack must be physically manufactured and placed in the vicinity of the system.In the context of computer vision, such an attack must be robust across multiple viewing angles and distances.No such difficulties apply to the software scenario.That said, real world attacks are certainly feasible and improving in effectiveness (17; 7).Lastly, there exist two primary categories of assaults that can be initiated against multinomial classifiers: undirected and directed.In undirected attacks, the assailant aims to diminish the likelihood of the model generating the accurate class output, without a specific target in mind.An instance of such an assault can be observed in Figure 3b.The sole intention of this kind of attack is to lower the probability of the current class, thereby inadvertently elevating the likelihood of alternative classes.In models where there are many potential output classes, such as in Inception v3 which has 1000 possible output predictions, non-targeted attacks tend to be less interesting (20).Due to the similar nature of certain classes, a non-targeted attack may lead to an image of a dog being misclassified as another similar breed of dog (11).This phenomenon led to the development of the more focused, and perhaps more nefarious, targeted attack.In this setting, a specific alternate class is selected for which to optimize prediction probability.An example of this attack carried out can be seen in Figures 3c, 3d, and 3e.

Creating Adversarial Instances
One should note that the techniques detailed herein provide no guarantees over whether a generated image will be misclassified by a targeted ML system.Nonetheless, these images are denoted "adversarial."This paper employs the following notation, largely informed by a certain approach(11): -X: an input image, represented as a tensor along the dimensions of width, height, and depth.

IJFMR23056516
Volume 5, Issue 5, September-October 2023 5 -C(X, y): the neural network's cost function for an input X and class prediction y.If a network outputs a softmax distribution across classes and uses a cross-entropy cost function, the cost will be equal to the negative log-likelihood of the correct class: C(X, y) = − log p(y|X).-ClampX,ϵ{X ′ }: a function that clamps the pixel values of X, ensuring that adversarial example X ′ pixel values are within ϵ of the original image X, where ϵ is a modifiable hyperparameter.This limits the added noise and ensures the adversarial example is nearly visually identical to the input.The function is defined as follows: ClampX,ϵ{X ′ }(x, y, z) = min(X(x, y, z)+ϵ, max(0, X(x, y, z)−ϵ, X(x, y, z))) where X(x, y, z) refers to the z channel's value at pixel location (x, y).
Fast Gradient Sign Method This method (8) requires only a single back propagation call to retrieve an error signal and assumes a relatively linear cost function.It is less precise and generates successful adversarial examples with less subtlety than following methods but can be calculated rapidly.
Iterative Non-targeted Method This is an iterated extension of the fast method where an adversarial example is repeatedly generated and clamped at each step.The gradient error applied at each step is modulated by a learning rate α: X0adv = X,XNadv+1 = ClampX,ϵ(XNadv +αsign(∇XC(XNadv, ytrue)) The number of iterations run during experimentation was balanced to allow for fast generation times as well as interesting results.
Iterative Targeted Method This is a modified version of the iterated method and a slight modification of the iterative least-likely class method devised by (11).Instead of just increasing the error of the originally predicted class, this method seeks to decrease the error of a specific selected class label ytarget.To generate an adversarial example, the method maximizes log(p(ytarget|X)) by transforming the image in the direction of sign∇X log(p(ytarget|X)).

Experimental Design
To compare the capabilities of each adversarial example generation method, we examine the performance of each method on a subset of ImageNet images across a range of ϵ values.Intuitively, an adversarial image generated with ϵ = 0 yields the same image as before.A higher ϵ indicates more leeway for modification of the original image.The learning rate α is held constant at 1 across all experiments.To examine the targeted method, we choose a random target class to optimize for.Although there is a chance that this random class will lay close to the true class, this is rather unlikely across 1000 possible outputs.Due to the computationally expensive nature of generating adversarial examples with the iterative methods, we unfortunately had to resort to a relatively small sample size at each value of ϵ, quantitatively • Email: editor@ijfmr.com

IJFMR23056516
Volume 5, Issue 5, September-October 2023 6 evaluating each method on a subset of only 20 random samples.We also perform a qualitative analysis of each method's performance and behavioral tendencies by examining generated adversarial images.

Experimental Results
Fig. 2: Top-1 and top-5 model accuracy when processing adversarial examples.The fast gradient sign, iterative non-targeted, and iterative targeted methods are compared for different values of ϵ.To evaluate performance, we compare the model's predictions for adversarial images against its predictions for unmodified images.As mentioned above, there are no guarantees on whether an adversarial image generated with these methods will successfully fool the model.First, each method was evaluated on its ability to alter the model's top-1 accuracy.We define top-1 accuracy to be the rate at which the model's top prediction for each initial image matches its top prediction for each adversarial image.Next, we evaluated each method on its ability to alter the model's top-5 accuracy.We define top-5 accuracy to be the rate at which the model's top prediction for each initial image appears within its five top predictions for each adversarial image.Results for each adversarial method's effectiveness at different values of ϵ can be seen in Figure 2.
The motivation for examining both top-1 and top-5 accuracy lies in the difference between the targeted and non-targeted methods.The targeted method seeks to increase the likelihood of a random of alternative class.As discussed, it is likely that this class is far away from the initially predicted class.Because of this distance, the error gradient will pull the adversarial image away from all the top classes (which should be grouped together in the class space) faster.When the non-targeted method is run, the likelihood of the top class is reduced, and though this is turn will reduce the likelihood of classes in its neighborhood, there is less of a pull towards a completely new area in the latent class space.In summation, we might hypothesize that the targeted method would reduce the top-5 accuracy of the model more rapidly than the non-targeted method.In our limited results, this hypothesis was not confirmed as the targeted and non-targeted attacks performed very similarly.
One interesting thing to note is the apparent effectiveness of the fast gradient sign method.This is slightly misleading, however, as this method operates in a less subtle fashion than the other two by effectively introducing ϵ-scaled noise in a single shot.For a given ϵ, the fast method is far more "destructive" as it modifies individual pixel data more aggressively.The iterative methods attempt to create an attack smoothed over the image space, thereby reducing the visual artifacts introduced by modification.Figure 3a visually demonstrates the destructive nature of the fast method.The swans in the image have been visibly manipulated through the introduction of foreign colors even at a relatively small ϵ. Figure 3b illustrates an iterated attack at the same ϵ that introduces far less visible noise.

Figures 3b and 3c
showcase the non-targeted and targeted adversarial methods best.Any noise introduced is remarkably subtle, even though the model's class predictions for the attacking image are completely different.In the first image, the model changes its prediction from "convertible" to "crayfish"; in the second, from "Granny Smith" apples to "cello."A human looking closely might be able to spot defects within these images, but they would never make the same severity of classification mistakes as the model.Figures 3d and 3e are included to demonstrate the destruction introduced by the iterative algorithms at higher values of ϵ.Images generated with integer values of ϵ and higher display significant artifacts.Interestingly, these adversarial images would still likely be correctly classified by a human even though the model may produce an incorrect prediction with near 100% confidence.This last point is an important one to note regarding adversarial attacks.When the model processes these adversarial images, its misclassifications are incredibly confident.In Figure 3d, for example, the model does not "see" strange looking apples-it very confidently "sees" a cello.
In summary, the fast gradient sign method is effective, but more destructive and less subtle than the iterative methods.Both iterative methods performed similarly quantitatively, though in models with a large number of output classes we expect the iterative targeted approach to be more successful in reducing accuracy, especially top-5 accuracy.While generating adversarial images with iterative methods is slower, a qualitative analysis indicates their superiority in fooling the model while still retaining visual information from the original image.

Defensive and Ethical Considerations
In this section, we will discuss current research on potential defensive mechanisms that could be employed to combat adversarial attacks, as well as the ethical considerations that must be weighed before deploying systems susceptible to such attacks.

Defensive Approaches
An adversary may attempt to gain access to the deployment of the model itself or perhaps provide malicious inputs if a model is trained on-line.We saw this latter case play out particularly poorly for Microsoft with the launch of its chatbot Tay (12).For the purposes of this paper, we will narrow the scope to focus only on defending against attackers employing the types of subtle adversarial techniques shown thus far.These defenses can be distilled into two primary camps: reactive and proactive (14).An example of a reactive defense might be the preprocessing and/or sanitization of all inputs by another model.It is feasible to train a model to recognize adversarial inputs before they ever reach the primary classifier.This solution is far from ideal, however, as it requires the maintenance of two heavy duty models in production instead of one.It also opens up the possibility of incorrectly flagging legal inputs, which could be more limiting to the original system than the possibility of adversarial inputs in the first place.There is a certain inelegance to this approach as well; though the problem of robust vision is far from simple, the fact that humans do not fall prey to these types of schemes is encouraging to the technical community that more complete proactive solutions exist.
Adversarial Training A forefront proactive strategy involves adversarial training.In this technique, the model's gradients during training are harnessed to create adversarial instances using the methods previously outlined.These generated adversarial instances are then incorporated into the training dataset, enhancing the model's resilience against images that possess imperceptible distortions (18).Though adversarial training does improve the resilience of a model against such attacks, along with the added benefit of increased regularization, the decision boundaries remain relatively fragile.An attacker with more computing power still seems able to locate weaknesses.One drawback is the increased complexity  Gradient Masking, Defensive Distillation, and Label Smoothing Another set of approaches rely on minimizing the information available to an attacking algorithm.Though these techniques are technically different, they are similarly motivated and will thus be considered in tandem.In the process of distillation, a larger, more complicated model is compressed into a smaller form while sacrificing a very small amount predictive accuracy.The intuition behind this approach is that the smaller model will learn a "softer" probability distribution and will encode less helpful information to an attacker in its output predictions than the larger model (19).Gradient masking similarly relies on information reduction by artificially limiting gradient norm during training, thus dramatically decreasing the signal available to an attacker at inference time.A final approach in this category is label smoothing, where the output probabilities are held closer together (i.e., there is no single class with extremely high probability); this should theoretically limit information by making it more difficult to target a specific class (18).Though these defenses have been shown to increase the difficulty of generating adversarial examples, they are not insurmountable at this time.By a certain transferability property (16), an attacker can create a surrogate model given access to similar training data and basic knowledge of the target model's architecture.This surrogate model can then be used to obtain a gradient not dissimilar from that of the target, which can then in turn be used to fine tune adversarial examples (18).This surrogate model makes the distinction between full knowledge and black box scenarios far less relevant.

Ethical Considerations
We have technically described the nature of adversarial attacks and the limitations of available defenses, but we have not yet made it clear why we should care about such vulnerabilities.Though ML systems have reached mass application in some domains, we are only at the beginning of the adoption curve.Many traditional software systems will be replaced by more intelligent ML algorithms, and sometimes not without cost.Despite their increased flexibility and intelligence, systems built with neural networks and similar approaches are not only susceptible to adversarial attacks, but are also increasingly opaque in their decision-making.As our society continues to place a greater quantity of increasingly complex decisions in the hands of these systems, it is concerning that our understanding of their operation seems to be decreasing.
We can conceive of highly critical applications in which adversarial examples are particularly troubling.Banks that process checks programmatically might easily be defrauded by an adversarial forgery that appears legitimate to a human.Autonomous weapon systems that rely on computer vision, as described by Arkin (1) and others, might be confused by disguised weapons or tricked into firing on innocents that are invisibly marked against their knowledge.An autonomous vehicle might be fooled into interpreting a stop sign as a 45 mph sign (7).This last example is specifically relevant given the seeming inevitability of autonomous vehicles overtaking our streets in the coming years.Further, it is concerning that this specific attack has already been demonstrated successfully in real world settings.
Even though these models exhibit clear defects in adversarial settings, it is difficult to make a conclusive recommendation on their deployment.Improving training practices is a good place for the technical community to start, though this is easier said than done.(13) focuses on improving outcomes through the tighter integration of human oversight during training, is a positive contribution to the advancement of AI safety; unfortunately, it is not applicable when the problems encountered during training are at times more subtle than humans can perceive.Adversarial training should be encouraged in the development of these models, and it brings along the positive side effect of increased regularization.Information masking techniques, despite their provable fragility, also do increase the safety and resilience of ML systems deployed in the real world.
Fairness Even beyond the scope of adversarial examples, it is important to think about the fairness of employing such opaque models.Recent work (4) evaluating the fairness of recidivism prediction systems demonstrated that bias free predictive instruments can still result in disparate impact across populations when input data is not carefully curated.Beyond augmenting ML training data with adversarial examples, it is necessary for the technical community to become more careful about dataset construction, especially in ethically complex domains like recidivism prediction.Institutions and organizations could require the use of black box auditing tools like FairML1 to prevent the manifestation of obvious biases in production systems.
Accountability As we do employ these more advanced systems, we develop an increased expectation of their accountability.This accountability is often a justification for their necessity.Despite limitations, autonomous vehicles are easy to justify when they are expected to dramatically reduce the frequency of accidents (3).Our evaluation of the accountability of these systems, however, is framed by an assumption of generally good faith actors.The ethical calculus of autonomous vehicles and many other systems certainly changes in the face of malicious adversarial attacks.Can we permit these systems to be responsible for human lives when it is seemingly simple for a knowledgeable actor to influence them with attacks hidden in plain sight?At this current juncture given the limited scope of these systems in real world applications, there is little cause for concern.But in coming years when ML systems dictate greater portions of our lives, slight perturbations may indeed be a worthy anxiety.

Trade-offs
The systems in development today are far from perfect, and in all likelihood they'll continue to have flaws for the remainder of their existence.It is necessary then not to write them off entirely, but to evaluate their trade-offs.One of the essential trade-offs in models vulnerable to adversarial attacks is the trade-off between representative capacity and interpretability.Studies (9) showed that multilayer feedforward neural networks are universal approximators; essentially, a model with enough parameters can represent any function mathematically.As model size and complexity increase, however, the variety and quantity of data required to prevent overfitting increases.
At this time, our models are limited in representative capacity by available computing power and data; this opens them up to the kinds of manipulations abused by adversarial attacks.In order to combat these attacks in coming years, more complicated models will be trained on larger datasets.This practice, though, comes at the cost of interpretability.The decisions made by such models will become progressively difficult to explicate, which might prove problematic in fields like health care where the reasoning behind certain decisions should be comprehensible (5).There is no obvious solutions to these concerns.At first, it makes sense to require that models used in such critical applications should be highly interpretable.Yet models limited by such a requirement will have lesser representative power and will in turn make less intelligent decisions.Would we be willing to sacrifice potentially worse outcomes for a better understanding of how they came about?
In total, a model should be evaluated not just on its predictive accuracy, but also along the dimensions of robustness, fairness, accountability, and interpretability.There is no free lunch when it comes to ML models, and these properties are at times orthogonal to one another.That said, the threat of adversarial attacks, especially in ethically critical domains, should be taken seriously.Defensive techniques like adversarial training should be used to improve the resilience and safety of models in the field.

Conclusion and Future Work
In this paper we have provided an overview of the current state of adversarial attack research as it pertains to machine learning models.We have reviewed several adversarial image generation methods and identified the family of iterative methods as particularly promising given their more subtle approach.Aside from a technical evaluation of attacks, we have outlined primary known defensive mechanisms alongside analysis of their shortcomings.Finally, we situate this work within the AI safety discussion and elucidate ethical concerns surrounding the deployment of models susceptible to adversarial attacks.Despite deficiencies in defensive mechanisms, we encourage the technical community to take these methods seriously to increase the safety and accountability of models employed in real world settings.Future work should focus on developing adversarial attacks and defenses together while at the same time improving model interpretability.A more nuanced understanding of how opaque ML models form decision boundaries will be vital to their successful deployment in coming years.

A Appendix
Code All code used to generate adversarial examples can be found at https: //github.com/cgyulay/adversarial-attacks.Experiments were run using the PyTorch library.Word Count The final word count is 5051.
; 7).The property of transferability renders the conclusions of our specific research setting applicable to other black box models.After discussing the nature of these adversarial examples, we delve into possible defenses and ethical concerns surrounding adversarial examples of AI safety.Section one describes related work and the current research environment.Section two formalizes the adversarial attack problem, classifies domains in which attacks may occur, and outlines methods used for generating adversarial examples.Section three presents experimental results surrounding generated examples.Section four introduces defense mechanisms and describes the implications of these adversarial examples as they relate to AI safety and security.Section five concludes the study and presents various directions for future work.

Fig. 1 :
Fig. 1: Inception v3 model. 1 Related Work The examination of adversarial attacks against AI classifiers began over a decade ago with naive Bayes models (6).It wasn't until the proliferation of neural networks and deep learning, however, that adversarial examples caught the attention of the technical community.As these deeper, more complex models began

( a )
Fast gradient sign attack with ϵ= 0.05.The attack modifies the prediction, but yields a similar class.(b)Iterative non-targeted attack with ϵ= 0.05.The attack successfully modifies the prediction.

( c )
Iterative targeted attack with ϵ= 0.02.The attack was successful given the target class "cello."(d) Iterative targeted attack with ϵ= 0.08.The attack was successful given the target class "cowboy boot."(e) Iterative targeted attack with ϵ= 10.0.The attack was successful given the target class "wall clock."Fig. 3: Examples of adversarial images generated with different methods and ϵ values.Above each image the model's most likely class prediction is shown. of training; additional computing resources and time are necessary to realize this training schema.Nevertheless, the adversarial training approach is a first step in the direction of developing more durable models.