OpenAI is enhancing its internal safety protocols to mitigate the risks associated with advanced AI technology. A newly established “Safety Advisory Group” will oversee technical teams and offer recommendations to leadership, including the board, which now holds veto power—a point of contention regarding its actual utilization.
Typically, discussions around policy updates like these go unnoticed, as they consist largely of closed-door meetings with opaque functions that outsiders rarely see. However, given the recent leadership turmoil and the ongoing discourse surrounding AI risks, it’s essential to examine how the leading AI development company is addressing safety protocols.
In a recent document and blog post, OpenAI introduced its revamped “Preparedness Framework,” which appears to have been adjusted following the leadership shake-up that led to the departure of two board members known for their “decelerationist” stance: Ilya Sutskever (who remains in a modified role) and Helen Toner (who has left the company).
The primary goal of this update is to clarify the process for identifying, analyzing, and addressing "catastrophic" risks associated with their developing models. According to OpenAI, catastrophic risks encompass any threat that could result in hundreds of billions of dollars in economic damage or could severely harm or kill individuals—this includes existential risks, such as the potential for AI to operate independently.
OpenAI categorizes its in-production models under a “safety systems” team, which addresses systemic abuses of ChatGPT through API restrictions or fine-tuning. Models still in development fall under the "Preparedness" team, which identifies and quantifies risks prior to release. Additionally, a “superalignment” team focuses on establishing theoretical guidelines for potential “superintelligent” models.
For the first two categories, which are grounded in current technology rather than theoretical constructs, the evaluation process is straightforward. Each model is assessed across four risk categories: cybersecurity, manipulation (including disinformation), autonomy (i.e., self-directed actions), and CBRN risks (chemical, biological, radiological, and nuclear threats).
There are various assumed mitigations in place; for example, a cautious approach is taken when discussing the creation of dangerous materials. If a model still poses a "high" risk after considering these mitigations, it cannot be deployed. Any model classified with "critical" risks will not progress in development.
OpenAI's risk evaluation framework includes clear documentation, rejecting any notion that assessments are left solely to individual engineers or product managers.
For instance, within the cybersecurity category, increasing operators' productivity on significant cyber operations poses a medium risk. Conversely, a high-risk model could autonomously develop proofs of concept for high-value exploits against fortified targets. A critical risk might involve devising and executing comprehensive cyberattack strategies against such targets based merely on a broad objective. Undoubtedly, this type of capability would be concerning.
I’ve reached out to OpenAI for clarification on how these risk categories evolve—specifically if new risks, such as the creation of photorealistic fake videos, fall under manipulation or necessitate their own category—and will update this post with any responses.
Only medium and high risks are deemed tolerable within these frameworks. However, those developing the models may not always be the most qualified to evaluate them. To address this, OpenAI is establishing a “cross-functional Safety Advisory Group” that will review technical reports from engineers, providing recommendations from a broader perspective. This is intended to uncover potential “unknown unknowns,” a challenging endeavor by nature.
Recommendations from this group will be sent simultaneously to both the board and the leadership team, which includes CEO Sam Altman and CTO Mira Murati. While leadership will make the final decision regarding product deployment, the board can override these choices.
This mechanism aims to prevent situations like the recent controversy, where a high-risk product potentially advanced without the board's knowledge or approval. Interestingly, the leadership changes resulted in sidelining critical voices while introducing business-savvy leaders like Bret Taylor and Larry Summers—whose expertise lies outside AI.
If expert recommendations yield a decision from the CEO, will the board genuinely feel empowered to challenge this and press pause? Furthermore, will we be informed of any such actions? Transparency appears largely unaddressed, aside from OpenAI's commitment to seek independent third-party audits.
Should a model be classified with a “critical” risk level, OpenAI has previously publicized its reluctance to release such powerful models, leveraging this as part of its brand. However, can we be assured that similar accountability will be maintained in light of the genuine risks posed? Whether this approach is sound remains unclear, and it has yet to be thoroughly explored.