Children quickly learn to predict what will happen if they turn a cup filled with juice upside down. Robots, on the other hand, don’t have a clue.
Researchers at the Allen Institute for Artificial Intelligence (Ai2) in Seattle have developed a computer program that shows how machines determine how the objects captured by a camera will most likely behave. This could help make robots and other machines less prone to error, and might help self-driving cars navigate unfamiliar scenes more safely.
The system, developed by Roozbeh Mottaghi and colleagues, draws conclusions about the physical properties of a scene using a combination of machine learning and 3-D modeling. The researchers converted more than 10,000 images into scenes rendered in a simplified format using a 3-D physics engine. The 3-D renderings were created by volunteers through Amazon’s Mechanical Turk crowdsourcing platform.
The researchers fed the images as well as their 3-D representations into a computer running a large “deep learning” neural network, which gradually learned to associate a particular scene with certain simple forces and motions. When the system was then shown unfamiliar images it could suggest the various forces that might be in play.
It doesn’t work flawlessly, but more often than not the computer will draw a sensible conclusion. For an image of a stapler sitting on a desk, for instance, the program can tell that the stapler would slide across the desk and then abruptly fall to the floor. For a picture of a coffee table and sofa, it knows the table could be pushed across the floor until it reached the sofa.
“The goal is to learn the dynamics of the physics engine,” says Mottaghi. “You need to infer everything based on just the image that you see.”
The work could be especially useful for robots that need to quickly interpret a scene and then act in it. Even a robot equipped with a 3-D scanner would often need to infer the physics of the scene it perceives. And it would be impractical to have a robot learn how to do everything through trial and error. “Data collection for this is very difficult,” Mottaghi says. “If I take my robot to a store, it cannot push objects and collect data; it would be very costly.”
This program is part of a larger effort called Project Plato, aimed at equipping machines with visual intelligence that goes beyond simple object recognition and categorization. A related project, also part of Project Plato, allows a computer to recognize a physical force already in play: for example, how a skier would move down a mountain, or how a kicked soccer ball would fly through the air.
In recent years, computers have become much better at parsing images, thanks to advances in deep learning, more powerful hardware, and large labeled image data sets. After being fed many examples, computers can now describe or answer questions about a scene (see “Google’s Brain-Inspired Software Describes What It Sees in Complex Images” and “Facebook App Can Answer Basic Questions About What’s In Photos”). But this betrays a very superficial understanding of what’s happening in an image. For a deeper understanding, a computer needs to grasp how the physical world works.
Brendan Lake, a research fellow at New York University who specializes in modeling human cognitive capabilities, says the Ai2 work is an important step in that direction.
“True scene understanding requires a lot more than just recognizing objects,” Lake says. “When people see a snapshot of a scene, they tell a story: what are the objects, why are they there, and what will happen next. Understanding physics is a key part of telling this story.”
According to Lake, however, a great deal more reasoning is involved in human perception, something that might hold back progress in robotics and machine vision for a while yet. “While this is exciting progress, it does not yet rival our human ability to understand physics,” he says. “People can understand a far broader range of physical events, and can accurately predict physical events in completely novel types of scenes.”