AI Models Assess Their Own Safety: Insights from OpenAI's Latest Alignment Research

OpenAI has introduced a new approach for aligning AI models with safety policies, termed Rules-Based Rewards (RBR).

Lilian Weng, head of safety systems at OpenAI, explained that RBR automates parts of model fine-tuning, significantly reducing the time needed to prevent unintended model responses.

Traditionally, models have relied on reinforcement learning from human feedback for alignment training, which, according to Weng, is effective but time-consuming. “We often spend considerable time discussing policy nuances, and by the end, the policy may have already changed,” she noted in an interview.

Reinforcement learning from human feedback involves prompting models and evaluating their responses based on accuracy and preference. If a model is programmed not to respond a certain way—such as refusing dangerous requests—human evaluators check whether it aligns with safety guidelines.

With RBR, OpenAI enables safety and policy teams to utilize a model that evaluates responses against established rules. For instance, a mental health app development team may require their AI model to reject unsafe prompts without being judgmental, while encouraging users to seek help. This necessitates formulating three rules: the model must deny the request, maintain a non-judgmental tone, and provide supportive reminders.

The RBR model assesses responses from the mental health AI against these three rules to determine compliance. Weng reported that testing results using RBR are comparable to those obtained through human-led reinforcement learning.

Despite the promise of RBR, ensuring AI models operate within defined parameters remains challenging, sometimes leading to controversies. For instance, Google faced backlash in February after its Gemini model overcorrected, resulting in the refusal to generate images of white people, instead producing historically inaccurate outputs.

Mitigating Human Subjectivity

The concept of AI overseeing another AI’s safety raises valid concerns. However, Weng argues that RBR helps minimize subjectivity, a common challenge for human evaluators. “When working with human trainers, ambiguous instructions yield lower quality data,” she remarked. Clear rules, she argues, guide the model's behavior more effectively.

OpenAI acknowledges that while RBR could lessen human oversight, it also poses ethical challenges, such as the potential for increasing bias. The company emphasizes the importance of designing RBR systems that ensure fairness and accuracy, suggesting a combination of RBR and human feedback.

RBR may struggle with tasks that demand subjective judgment, like creative writing.

OpenAI began exploring RBR methods during the development of GPT-4, and Weng states that the methodology has significantly advanced since then.

OpenAI has faced scrutiny regarding its commitment to safety. In March, former researcher Jan Leike criticized the company’s safety culture and processes, stating they have been overshadowed by the pursuit of innovative products. Ilya Sutskever, co-founder and chief scientist who previously led the Superalignment team alongside Leike, has since left OpenAI to establish a new company focused on safe AI systems.

Most people like

Find AI tools in YBX

Related Articles
Refresh Articles