Google researchers have identified six specific attacks against real-world AI systems, finding that these common attack vectors exhibit a unique level of sophistication that they note will require a combination of adversarial simulations and the help of AI experts to build a solid defense.
In a report released this week, the company revealed that its dedicated AI Red Team has identified a variety of threats to this rapidly evolving technology, based primarily on how attackers manipulate the Large Language Models (LLMs) that drive generative AI products like ChatGPT, Google Bard, and others.
These attacks largely lead to technologies that produce unintended or even maliciously driven results, which can lead to consequences ranging from the mundane, such as photos of ordinary people appearing on celebrity photo sites, to the more serious, such as security-evading phishing attacks or data theft.
Google's findings come hot on the heels of its release of the Secure Artificial Intelligence Framework (SAIF), which the company says is designed to address AI security before it's too late, as the technology has experienced rapid adoption, creating new security threats.
6 Common Attacks Facing Modern AI Systems The first set of common attacks identified by Google are hint attacks, which involve "hint engineering." This is a term that refers to the production of effective hints that direct LLM to perform desired tasks. When this influence on the model is malicious, it can in turn maliciously influence the output of an LLM-based application in ways that are not intended, the researchers said.
One example would be if someone added a paragraph to an AI-based phishing attack that was not visible to the end user, but could instruct the AI to classify the phishing email as legitimate. This could allow it to bypass email anti-phishing protections and increase the chances of a successful phishing attack.
Another attack the team discovered is training data extraction, which targets the reconstruction of verbatim training examples used by LLM - such as content from the Internet.
In this way, attackers can extract confidential information, such as verbatim personally identifiable information or passwords, from the data. "Attackers have an incentive to target personalized models or models trained on data containing personally identifiable data to collect sensitive information," the researchers wrote.
A third potential AI attack is backdoor manipulation of models, where an attacker "may attempt to covertly alter the behavior of a model to produce outputs that are incorrectly characterized by specific 'trigger' words or features, also known as backdoors," the researchers wrote. In this type of attack, a threat actor can hide code in the model or its output to perform malicious activities.
The fourth type of attack, called adversarial examples, is when an attacker provides an input to a model that results in a "deterministic, but highly unexpected output," the researchers wrote. In one example, the model could display an image that looks like one thing to the human eye, but the model recognizes it as something completely different. Such attacks can be fairly benign, and in one case, someone could train the model to recognize a photo of himself or herself as one deemed worthy of appearing on a celebrity website.
An attacker could also use a data contamination attack to manipulate the model's training data to influence the model's output based on the attacker's preferences-which could also threaten the security of the software supply chain if developers are using AI to help them develop software. The impact of such an attack could be similar to backdoor manipulation of models, the researchers noted.
The final type of attack identified by Google's specialized AI red team is a data leakage attack, in which an attacker can copy a model's file representation to steal sensitive intellectual property or other information. For example, if a model is used for speech recognition or text generation, an attacker might try to extract speech or text information from the model.