The Emerging Threats of AI in CyberSecurity

10 min readDec 13, 2024

Generative AI based Attacks

Generative AI (GenAi) is in vogue nowadays and impacts every domain. Some of its capabilities not only exceed existing models by a large extent, but the key differentiation is its ability to create almost human-like interactions and content generation. In the domain of Cybersecurity, GenAi presents new challenges and threats that every organization must be aware of. The malicious actors can leverage this technology to create more complex, novel, and sophisticated attacks with just a few prompts while making the attacks look more human-like (e.g. phishing) and difficult to detect. We discuss some of the challenges in cybersecurity that have come to the fore with the advent of Generative AI.

Prompt Injection Attacks

The prompt injection attacks may sound simple, easy to execute, and easy to detect but the the Open Worldwide Application Security Project (OWASP) has identified it as the top threat for the Large Language Models. Some of them may seem like harmless pranks — like the case where a customer convinced the chatbot of an SUV dealership to get a legally binding agreement to sell the SUV for $1 by coercing the chatbot into “the customer is always right” mode. More seriously, the right definition would be to manipulate the LLM input by appending malicious instructions along with the user input in such a way so as to change the behind-the-scene model’s behavior, which causes the model to execute unintended instructions instead of following user inputs..

Here is a very simple example that anyone can try — this is how you can manipulate OpenAi’s GPT-4o model which is the most advanced one to date

Think about it in a corporate scenario — if a company’s chatbot is not very secure and it has access to all its databases including PII information, the chatbots can be ‘tricked’ to provide any information. People have been using prompt injection attacks on different chatbots that have access to lots of documents while training to get the access keys, PII information, codebase, or just about anything. Even the initial versions of chatbots launched by Microsoft and Meta have fallen victim to prompt injection attacks which revealed information it was not supposed to reveal, or turning the future responses into abusive and racist.

One of the reasons why these types of attacks have increased in recent times is that you do not need knowledge of advanced SQL injection attacks, malware injection, or any specialized knowledge. Users have been getting under the skin of LLMs/Chatbots with just plain English!

Types of Prompt Injection Attacks

Regular Scenario

In a regular scenario, there will be some checks and filters that are usually applied to the prompts sent by users. The filters may block some of the prompts from user (“How do I make a bomb?”) because they have been trained to do that.

Regular Scenario — Some prompts are blocked

Direct Prompt Injection Attacks

Direct Prompt Injection attacks are cases where the attacked directly inserts a malicious prompt into a model (using a chatbot or any other way). These are some of the simplest attacks, but they can be made more advanced when these malicious instructions combine with system prompts to generate unintended consequences and results by the AI system.

Direct Prompt Injection Attack

Jailbreaking an LLM is the task of getting around all the guardrails around the model which are intended to prevent these types of prompts. The famous “Do Anything Now (DAN)” prompts allow OpenAi’s GPT models to ignore all guardrails and policies to prevent the model from making a malicious comment. Jailbreaking a model has somewhat become like a sport where hackers and malicious actors discuss and share ways of jailbreaking various LLMs on the dark web.

This is emerging as a major threat as most of the LLMs are black box models, and implemented as-is by companies. If the LLM has access to any data or is integrated with the larger system and environments, a malicious actor can use direct prompt injection with jailbreaks not only to execute malicious commands but also to extract sensitive data and write to databases, or enable a financial transaction, or delete sensitive data.

Indirect Prompt Injection Attacks

Indirect prompt injection attacks are more sophisticated than direct attacks. These occur when a malicious actor ‘poisons’ or manipulates the data sources (documents & training data for RAG, websites & sources being used by LLM for training, or any data used by the model where the attacker can infect the external sources of data. The model itself or the prompts are not changed, and they remain as is, but the context is completely changed which forces the LLM to provide unintended results.

A simple example to explain this injection is an LLM model which has been designed to test a website’s security and the code that takes in the HTML of the website and does analysis on it. A plain text in white font can be inserted somewhere in the HTML which will not appear on the website, but which instructs the LLM to provide a favorable rating irrespective of the site’s security or codebase.

Here is an illustration of an indirect prompt injection attack:

Indirect Prompt Injection Attack

The implications of these prompt injection attacks are huge — not only it can leak out sensitive data, but the LLM can be tricked into writing malware to bypass its security or provide a jailbreak prompt. With more and more companies deploying off-the-shelf LLMs into their everyday tasks, there is a dire need to detect and prevent the Prompt injection attacks.

Other types of Prompt Injection Attacks

There are many more ways of direct and indirect prompt injections — some may look quite simple but are very effective, while some are more sophisticated and require knowledge about underlying models

  • Multiple payloads — This is when the attacker injects multiple prompts which may seem harmless individually but when combined together or with system prompts, they become dangerous
  • Adversarial payloads — This requires detailed knowledge of how models are computing weights but these are highly transferable between models. The payloads couple be added as prefixes or suffix to prompts and may look complete gibberish but when they are injected into the model, it may cause the model to provide unintended results

How to prevent prompt injection attacks

  • Scrub the training data for all sensitive information — great care should be taken to ensure that the LLM gets only the data it needs
  • Isolate the model from other environments and limit its capability to do only the small set of tasks it is intended for
  • For any actions or instructions coming in from the model, deploy a Human-in-the-loop (HITL) to ensure the actions are vetted before they are executed
  • Threat detection and response for models — continuously vet the model’s ability to write malware, and any spike in access to unintended destinations or accessing data it is not supposed to access (in case it has been jailbroken)
  • Sanitize input and output with another model to ensure the prompts going in and output going out do not contain any sensitive or unintended information
  • Monitor, log, and evaluate model instructions, prompts, and actions to check for any deviations
  • Never expose the model to system-level interactions or databases

Adversarial Attacks

Quis custodiet ipsos custodes? (Who guards the guardians?)” — latin proverb

The Latin proverb above (in the context of cybersecurity means that if you have a security system with advanced AI capabilities which is designed to protect your organization, who is guarding these AI capabilities? What if the attacker hacks into these AI algorithms? How do you protect yourself from these attacks?

These types of attacks are highlighted in the 2024 annual threat assessment of the US intelligence community which highlights the importance of securing your models from adversarial attacks.

What is an Adversarial Attack

The word ‘adversarial ’ comes from a class of ML algorithms called Generative Adversarial Networks (GANs). GANs are the precursor of existing Generative AI and GPT models in some form. GANs consist of two networks — one a Generator and the other a Discriminator. In the case of images, the Generator generates an image, and the Discriminator compares it with the actual image and sends the feedback to the Generator, which now generates a better image that is closer to the actual image. The loop continues till the Generator generates an image that the Discriminator cannot differentiate from the real image, thus passing an artificially generated image as an actual image. This works for other domains too — speech, text, videos, music, etc.

Example of an Adversarial attack

Adversarial attacks can happen on any AI or ML model — image classification, facial recognition, biometrics, speech recognition, threat prediction models, malware detection, password attack detection, anomaly detection algorithms, and just about any other model.

An example of this adversarial attack is the Panda-Gibbon classification[1]. The image of a Panda is fed to the network, which correctly classifies the image as that of a panda. Now, a carefully constructed noise (the noise has to be constructed specifically for the algorithm that is being used ) is added to the image. The new image looks just like the old one to the human eye, but to an algorithm, the underlying pixel values have changed. The weights, which now are applied to the new pixels generate a different matrix, which forces the algorithm to classify the image as a ‘Gibbon’ instead of a ‘Panda’

Another example is presented by Simen Thys et.al shows how to fool the surveillance cameras [3]. The researcher used an AI algorithm that classifies a person correctly (left part of the image below). Now, a patch of the image is put on the body of the person, and the algorithm fails to classify the image as that of a person (the right part of the image below). Why? Because pixel values in a certain part of the image have been manipulated in such a way that it affects the activations in the layers of the deep neural network, forcing it to misclassify the input image and failing to detect the person in the image.

Implications of Adversarial Attacks in Cybersecurity

So what if the model fails to detect an image? Imagine an attacker using this technology to create an ‘adversarial patch’ that can bypass your biometric systems (think FaceId on your phone or fingerprint on your laptop) and bypass these systems to gain access to your sensitive locations or do social engineering-based attacks. There could be a model deployed in a bank that profiles users’ behavior to assign credit scores to them, and the ‘adversarial patch’ bypasses it to provide the highest rating to itself. A threat actor who understands how your model detects brute-force password attacks can use this technology to ensure your AI-enabled threat detection framework does not catch these attacks.

In short, any machine learning model that is more like a black box (Neural Networks — CNN, LSTM, RNN, or more advanced LLMs) can be tricked with these adversarial data to either bypass them or give an entirely different output. Even the classical ML models can be fooled to some extent but these attacks are easier to detect there because of their interpretability and not being a black box. Also, since a lot of these ‘black box’ models are used off the shelf, they can be explored by the threat actors and if they find a way to bypass one model, they can bypass the same models deployed elsewhere.

How to prevent adversarial attacks

  • Secure your models: if you have built your machine learning and AI models, make sure the modeling code, network architecture, model weights, and training data are secure and never revealed
  • Model ensembling: If you have to use a black-box model, make sure you use an ensemble of these models which generate independent predictions and the final prediction is an aggregation of these models. In case one of the models is compromised by the adversarial attack, other models will still work to provide the correct prediction or atleast to inform that one of the models is not behaving as intended
  • Adversarial learning: This is essentially thinking like an attacker to identify the potential vulnerabilities in the models figuring out how any potential gap in the model can be exploited, and then plugging the gap
  • Data sanitization: This is to ensure no data poisoning has happened in your training data that may cause the model to predict something that it is not intended to.

References:

[1]Xiaoyong Yuan, Pan He, Qile Zhu, Xiaolin Li; Adversarial Examples: Attacks and Defenses for Deep Learning https://arxiv.org/pdf/1712.07107.pdf

[2] Andrew Ilyas, Logan Engstrom, Anish Athalye, Jessy Lin. Black-box Adversarial Attacks with Limited Queries and Information. https://arxiv.org/abs/1804.08598

[3] Simen Thys, Wiebe Van Ranst, Toon Goedeme, KU Leuven. Fooling automated surveillance cameras: adversarial patches to attack person detection https://arxiv.org/pdf/1904.08653.pdf

--

--

Ashutosh Kumar
Ashutosh Kumar

Written by Ashutosh Kumar

Data Science in CyberSecurity @ AuthMind ; interested in technology, data , algorithms and blockchain. Reach out to me at ashu.iitkgp@gmail.com

No responses yet