OpenAI has released the text-to-video model, Sora - Japanese word for "sky" reflecting its limitless creative potential.
What can Sora do:
Generate hyper-realistic videos up to a minute long, seamlessly integrating user input with lifelike portrayal of the physical world.
Generate text-to-video
Generate image-to-video
Extend video input forward and backwards
Edit video to video
Connect input videos seamlessly
How Sora works:
Sora analyses the text prompt to extract relevant keywords. It then searches for the most suitable videos from its dataset that match the keywords, blends them together and then gradually removes noise to get the desired video.
A video compression model reduces the dimensionality of the data, followed by extraction of patches similar to the token based approach used in Large Language Models.
These patches enable Sora to train on variable resolution videos and images and generate variable outputs with better framing.
Limitations to note:
OpenAI mentions that Sora encounters challenges with complex scenes, cause-and-effect relationships, spatial commands, and replicating physics of basic interactions.
OpenAI is actively working on addressing potential biases and ensuring ethical usage, with plans to develop tools for distinguishing AI-generated content from original material.
While Sora isn't yet available to the public(except for some testers and creative designers), the results showcased on the OpenAI blog are incredibly promising. I am keen to see how OpenAI ensures ethical robustness of the model before launching it to the public.
Comments