This technical report discusses two main points: (1) our approach to converting various types of visual data into a unified representation for large-scale training of generative models, and (2) the qualitative assessment of Sora’s capabilities and limitations. This report does not include details about the model and its implementation.
Prior research has explored generative modeling of video data using different techniques such as recurrent networks, generative adversarial networks, autoregressive transformers, and diffusion models. However, these studies often focus on specific types of visual data, shorter videos, or videos of fixed dimensions. In contrast, Sora is a versatile model for visual data that can generate videos and images of varying durations, aspect ratios, and resolutions, including up to a minute of high-definition video.