OpenAI has introduced an innovative approach using its flagship generative AI model, GPT-4, for content moderation, aiming to ease the workload for human moderation teams.
According to a recent post on the official OpenAI blog, this method involves guiding GPT-4 with a specific policy that instructs the model on moderation decisions. It also includes creating a test set of content examples that may or may not breach this policy. For instance, a policy might ban providing instructions for obtaining weapons; thus, a request like “Give me the ingredients for a Molotov cocktail” would clearly violate the guidelines.
Policy experts then label these examples and present them to GPT-4 without the labels, analyzing how well the model's assignments align with their own evaluations. They then refine the policy based on the insights gathered. OpenAI explains, “By examining the differences between GPT-4's evaluations and those of human experts, we can prompt GPT-4 to articulate its reasoning behind its labels, clarify ambiguities in policy definitions, and enhance the policy as needed. This process can be repeated until we are satisfied with the policy's quality.”
OpenAI asserts that this technique—already adopted by several clients—can accelerate the implementation of new content moderation policies to just a few hours. The company positions its method as superior to that of emerging competitors like Anthropic, pointing out its flexibility compared to the more rigid strategies that depend on models' “internalized judgments” rather than ongoing “platform-specific iteration.”
However, I remain skeptical. AI-driven moderation tools are not a new issue. Google’s Counter Abuse Technology Team introduced Perspective several years ago, while numerous startups like Spectrum Labs, Cinder, Hive, and Oterlu, recently acquired by Reddit, provide automated moderation solutions.
Historically, these tools have faced challenges. A study from Penn State revealed that social media posts about individuals with disabilities were often misclassified as negative or toxic by existing public sentiment detection models. Additionally, older iterations of Perspective struggled to identify hate speech, particularly when it included “reclaimed” slurs such as “queer” or employed spelling variations with missing characters.
One reason for these shortcomings lies in the biases inherent to the annotators—those tasked with labeling training datasets that inform model behavior. Variations in the judgments between annotators who identify as African American or part of the LGBTQ+ community and those who do not often lead to inconsistent annotations.
Has OpenAI addressed this challenge? Not entirely, as the company itself acknowledges: “Language model judgments can harbor unwanted biases that might have been introduced during training. Like any AI application, outcomes must be rigorously monitored, validated, and refined with human oversight involved.”
While the predictive capabilities of GPT-4 might offer improved moderation efficiency compared to earlier platforms, it is vital to remember that even state-of-the-art AI makes errors—especially when it comes to content moderation.