Recently, OpenAI unveiled its innovative video generation model, Sora, which has made significant strides in text-to-video technology. However, testing by Bloomberg highlighted some notable limitations of Sora. For instance, during a scene where a parrot flies past a monkey, the parrot's wings exhibited distortion, and the monkey bizarrely sported the parrot's tail.
These occurrences underscore Sora's challenges in comprehending the physical properties of objects. Bill Peebles, an OpenAI scientist, acknowledged these issues, stating, "There are indeed some strange movements within the clips."
Sora employs diffusion transformer technology to break down video content into a series of smaller segments, using denoising techniques to predict the original clear images. While this approach enhances video generation quality, Sora continues to face numerous hurdles, including accuracy in physical interactions, consistency in object state changes, coherence in long samples, spontaneous object appearance, proper handling of hands and body parts, computational resource demands, model generalization capabilities, and the ability to edit and extend videos. In complex scenes, Sora may produce unrealistic behaviors, such as a basketball passing through the side of a hoop or dogs unintentionally phasing through one another while walking.
Despite Sora's impressive performance in specific scenarios, OpenAI recognizes the need for the model to overcome a range of technical challenges. As technology progresses, we have reason to anticipate a brighter future for Sora.