OpenAI Unveils DALL-E 3 API and Innovative Text-to-Speech Models

OpenAI made waves during its inaugural Developer Day by unveiling a variety of new APIs designed to enhance its offerings. At the forefront is the DALL-E 3 API, which follows its successful integration into ChatGPT and Bing Chat. Building on the capabilities of DALL-E 2, the DALL-E 3 API includes built-in moderation features to prevent misuse and ensure responsible use.

The DALL-E 3 API provides users with multiple format and quality options, delivering resolutions from 1024×1024 to 1792×1024, with costs starting at just $0.04 per generated image. However, it currently offers fewer features than its predecessor, DALL-E 2. Notably, the DALL-E 3 API does not support the creation of edited versions of existing images or the generation of variations from them. Additionally, when a prompt is submitted, OpenAI automatically rewrites it to enhance detail and safety, which may result in less precise output depending on the specifics of the request.

In another exciting development, OpenAI introduced its Audio API, a text-to-speech solution featuring six diverse voices—Alloy, Echo, Fable, Onyx, Nova, and Shimer—and two generative AI model variants. This feature is live now, starting at $0.015 per 1,000 characters of input. “This is much more natural than anything else out there, which can make applications easier to interact with and more accessible,” said OpenAI CEO Sam Altman during the announcement. “It also opens up numerous possibilities for language learning and voice assistance.”

However, unlike some other speech synthesis tools, OpenAI’s Audio API does not allow for control over the emotional tone of generated audio. The documentation notes that certain text characteristics, such as grammar and capitalization, may influence how the voices sound, though results from internal tests have been mixed. It’s also mandated that developers who utilize this API inform users that the audio is generated by artificial intelligence.

Lastly, OpenAI has released an updated version of its automatic speech recognition model, Whisper large-v3. This open-source model claims to provide enhanced performance across multiple languages and is accessible on GitHub under a permissive license.

Most people like

Find AI tools in YBX

Related Articles
Refresh Articles