Can Sora-like Models Understand Physics? ByteDance’s Doubao Team Unveilsthe Truth
LeCun Shares, ByteDance’s Doubao Team’s Systematic Research Reveals: While Video Generation Models Can Create Videos that Seem to Follow Common Sense, They Have Been Proven to Not Understand Physical Laws.
Sincethe emergence of Sora, the industry has been embroiled in a debate about whether video generation models truly understand physical laws. Turing Award winner Yann LeCun hasstated unequivocally that realistic videos generated based on text prompts do not indicate that the model truly understands the physical world. He went on to say that modeling the world by generating pixels, as Sora does, is destined to fail.
Keras creatorFrançois Chollet believes that video generation models like Sora do indeed embed physical models, but the question is: is this physical model accurate? Can it generalize to new situations, those that are not simply interpolations of training data? These questions arecrucial, determining the scope of application of generated images – whether they are limited to media production or can be used as reliable simulations of the real world. He concludes that one cannot simply expect to obtain a model that can generalize to all possible situations in the real world by fitting a large amount of data.
Since then,there has been no definitive answer in the industry about whether video generation models are actually learning and understanding physical laws. Until recently, a systematic study published by ByteDance’s Doubao large model team has drawn a clear line between the two. Through large-scale experiments, the team discovered that even by scaling up models according to theScaling Law, they still do not understand physical laws.
The Doubao team’s research focused on three key aspects:
-
Object Dynamics: The team designed a series of experiments to test the model’s ability to predict the motion of objects in videos. The results showed that while the model could generatevideos with objects moving in a seemingly realistic manner, it often failed to accurately predict the trajectory of objects under different conditions.
-
Physical Constraints: The team further investigated the model’s understanding of physical constraints, such as gravity and collisions. They found that the model could generate videos that seemingly obeyed these constraints, butoften produced unrealistic results when faced with complex scenarios.
-
Generalization: The team tested the model’s ability to generalize to new situations not seen during training. The results showed that the model struggled to adapt to new environments and conditions, suggesting a lack of true understanding of underlying physical principles.
Thefindings of this research have significant implications for the development and application of video generation models. While these models can create impressive and realistic videos, they are currently limited in their ability to understand and simulate the physical world. This suggests that further research is needed to develop models that can truly understand and reason about physical laws.
LeCun himself shared and commented on the research, acknowledging the importance of these findings. This research provides valuable insights into the current limitations of video generation models and highlights the need for continued exploration in this field.
The Doubao team’s research serves as a crucial reminder that the ability to generate realistic videos does notequate to understanding the physical world. As the field of video generation continues to evolve, it is essential to focus on developing models that can not only generate visually compelling content but also truly understand and reason about the underlying physical principles.
Views: 0