Thousands of companies rely on the Ray framework to scale and manage complex, compute-intensive AI workloads. In fact, it's challenging to find a large language model (LLM) that hasn't been developed using Ray.
However, these workloads often contain sensitive data, which researchers have identified as being vulnerable due to a critical security flaw (CVE) within the open-source unified compute framework.
For the past seven months, this flaw has allowed attackers to exploit AI production workloads, gaining access to computing power, credentials, passwords, keys, tokens, and a multitude of other sensitive information, according to research by Oligo Security.
This vulnerability, dubbed "ShadowRay," remains under dispute. It is classified as a “shadow vulnerability,” meaning it isn't recognized as a threat and lacks an official patch. As a result, it doesn't show up in standard scanning processes.
This situation marks the “first known instance of AI workloads being actively exploited via vulnerabilities in contemporary AI infrastructure,” according to researchers Avi Lumelsky, Guy Kaplan, and Gal Elbaz. They state, “When attackers access a Ray production cluster, it’s a jackpot. Valuable company data combined with remote code execution creates opportunities for monetization, all while remaining undetected.”
A Significant Blind Spot
Many organizations depend on Ray for large-scale AI, data, and SaaS workloads, including Amazon, Instacart, Shopify, LinkedIn, and OpenAI—all of whose GPT-3 model was trained using Ray. This framework is essential for models with billions of parameters that require substantial computational power and cannot be executed on a single machine. Ray, maintained by Anyscale, supports distributed workloads for training, serving, and tuning diverse AI models. Users do not need extensive Python knowledge, and the installation process is straightforward with minimal dependencies.
Oligo researchers refer to Ray as the “Swiss Army knife for Pythonistas and AI practitioners.”
Despite its advantages, the ShadowRay vulnerability makes this reliance on Ray even more troubling. Known as CVE-2023-48022, the vulnerability arises from insufficient authorization in the Ray Jobs API, exposing it to remote code execution attacks. Anyone with access to the dashboard can execute arbitrary jobs without permission.
Though this vulnerability was reported to Anyscale alongside four others in late 2023, the only one left unaddressed is CVE-2023-48022. Anyscale disputed the vulnerability, claiming it represents expected behavior and a product feature that facilitates job triggering and dynamic code execution within a cluster.
Anyscale asserts that dashboards should not be publicly accessible or should be restricted to trusted users; thus, Ray lacks authorization because it assumes operation within a secure environment with “proper routing logic” via network isolation, Kubernetes namespaces, firewall rules, or security groups.
This decision illustrates “the complexity of balancing security and usability in software development,” Oligo researchers note, emphasizing the need for careful consideration when modifying critical systems like Ray.
Moreover, because disputed vulnerabilities often evade detection, many security scanners overlook them. Oligo researchers discovered that ShadowRay did not appear in multiple databases, including Google’s Open Source Vulnerability Database (OSV), nor was it visible to static application security testing (SAST) and software composition analysis (SCA) solutions.
“This created a blind spot: Security teams were unaware of potential risks,” researchers highlighted, noting, “AI experts are not security experts, leaving them vulnerable to risks posed by AI frameworks.”
From Production Workloads to Critical Tokens
The researchers revealed that compromised servers leaked a "trove" of sensitive information, including:
- Disruption of AI production workloads, leading to compromised model integrity or accuracy during training.
- Access to sensitive cloud environments (AWS, GCP, Azure) which could expose customer databases and sensitive production data.
- Access to Kubernetes API, enabling infections of cloud workloads or extraction of Kubernetes secrets.
- Sensitive credentials for platforms like OpenAI, Stripe, and Slack.
- Database credentials that allow silent downloads or modifications of complete databases.
- Private SSH keys for accessing additional machines for malicious activities.
- OpenAI tokens, potentially draining account credits.
- Hugging Face tokens, which provide access to private repositories, facilitating supply chain attacks.
- Stripe tokens that could be exploited to deplete payment accounts.
- Slack tokens, which could be used for unauthorized messaging or reading.
Researchers reported that many compromised GPUs are currently scarce and costly. They have identified “hundreds” of compromised clusters, primarily utilized in cryptocurrency mining.
“Attackers target these systems not only for valuable information but also because GPUs are expensive and difficult to acquire, especially today,” researchers noted, with some GPU on-demand prices on AWS reaching an annual cost of $858,480.
With attackers having seven months to exploit this hardware, estimates suggest compromised machines and compute power could be valued at $1 billion.
Addressing Shadow Vulnerabilities
The Oligo researchers acknowledge that “shadow vulnerabilities will always exist” and indicators of exploitation can vary. They recommend several actions for organizations:
- Operate Ray within a secure and trusted environment.
- Implement firewall rules and security groups to prevent unauthorized access.
- Continuously monitor AI clusters and production environments for anomalies.
- Use a proxy that adds an authorization layer if a Ray dashboard needs to be publicly accessible.
- Never assume default security is sufficient.
Ultimately, they emphasize: “The technical burden of securing open source falls on you. Do not solely rely on maintainers.”