Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language

NeurIPS 2021

Mingyu Ding1 Zhenfang Chen2 Tao Du3 Ping Luo1 Joshua B. Tenenbaum3 Chuang Gan2
1The University of Hong Kong   2MIT-IBM Watson AI Lab       3MIT



In this work, we propose a unified framework, called Visual Reasoning with Differentiable Physics (VRDP), that can jointly learn visual concepts and infer physics models of objects and their interactions from videos and language. This is achieved by three seamlessly integrated parts, including a visual perception module, a concept learner, and a differentiable physics engine. They work as follows: first, the visual perception module parses each video frame to object-centric trajectories and representations; second, the concept learner grounds visual concepts (e.g., color, shape, and material) from the representations and language to provide prior knowledge for the physics engine; third, the differentiable physics model, implemented as an impulse-based differentiable rigid-body simulator, performs differentiable physical simulation based on the grounded concepts to infer physical properties, such as mass, restitution, and velocity, by fitting the simulation into the perceived object trajectories. Consequently, these learned concepts and physical models could be used to explain what we have seen and imagine what is about to happen in both future and counterfactual scenarios. Integrating differentiable physics into the dynamic reasoning framework offers several appealing benefits: 1) Powered by accurate dynamics prediction of learned physics models, VRDP achieves state-of-the-art performance on both synthetic and real-world benchmarks while still maintaining high transparency and interpretability; remarkably, it improves the accuracy of predictive and counterfactual questions by 4.5% and 11.5% compared to its best counterpart. 2) VRDP is highly data-efficient as the physical parameters can be optimized from few, even one single video. 3) With all physical parameters inferred, VRDP can quickly learn new concepts from few examples.




(a) Physics simulation

(b) Physics simulation

(c) Predictive simulation for question answering

(d) Counterfactual simulation for question answering



@inproceedings{ding2021dynamic, author = {Ding, Mingyu and Chen, Zhenfang and Du, Tao and Luo, Ping and Tenenbaum, Joshua B and Gan, Chuang}, title = {Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language}, booktitle = {Advances In Neural Information Processing Systems}, year = {2021} }

Paper     Code     Video     Data