New deep learning technology opens the way for pizza-making robots


This article is part of the latest news articles. AI research.

For humans, working with deformable objects is not much more difficult than dealing with hard objects. We naturally shape, fold, and manipulate in different ways, but we still learn to perceive.

However, for robots and artificial intelligence systems, manipulating deformable objects is a huge challenge. Consider the series of steps a robot must take to shape a dough ball into the shape of a pizza crust. You need to keep track of how the dough changes shape, while choosing the right tool for each step of the job. This is a challenging task for current AI systems that are more reliable for handling rigid objects with more predictable states.

Now, a new deep learning technique, developed by researchers at MIT, Carnegie Mellon University, and the University of California, San Diego, shows promise for robotic systems to more reliably handle deformable objects. called DiffSkillThe technique uses deep neural networks to learn simple skills and a planning module to combine skills to solve tasks that require multiple steps and tools.

Handling deformable objects with reinforcement learning and deep learning

For an AI system to process an object, it must be able to detect and define the state of the object and predict how it will look in the future. This is a largely solved problem for rigid bodies. With a good set of training examples, deep neural network It can detect hard objects from different angles. However, for deformable objects, the space of possible states becomes much more complex.

Xingyu Lin, Ph.D. “For a rigid body, you can describe a state with six numbers, three numbers for the XYZ coordinates and another three numbers for direction,” said Xingyu Lin, Ph.D. A student at CMU and lead author of the DiffSkill paper told TechTalk.

“However, deformable objects such as dough or cloth have infinite degrees of freedom, making it much more difficult to accurately describe their state. Moreover, deformation methods are more difficult to model mathematically than rigid bodies.”

The development of differentiable physics simulators has enabled the application of gradient-based methods to solve deformable object manipulation tasks. This is the existing reinforcement learning An approach to learning the dynamics of the environment and objects through pure trial and error interaction.

DiffSkill is inspired by: plasticine wrapDifferential physics simulator presented at the 2021 ICLR conference. PlasticineLab has shown that differentiation simulators can help with short-distance tasks.

PlasticineLab is a differentiable physics-based simulator for deformable objects.  It is suitable for training gradient-based models.
PlasticineLab is a differentiable physics-based simulator for deformable objects. It is suitable for training gradient-based models.
PlasticineLab is a differentiable physics-based simulator for deformable objects.  It is suitable for training gradient-based models.

However, differentiable simulators still suffer from the problem of long horizons that require the use of multiple steps and different tools. AI systems based on differentiable simulators require the agent to know the overall simulation state and the relevant physical parameters of the environment. This is particularly limited for real-world applications where agents perceive the world, typically through visual and depth sensory data (RGB-D).

“I started asking if I could extract it. [the steps required to accomplish a task] By understanding and learning abstract concepts about technology, you can connect technology to solve more complex tasks,” said Lin.

DiffSkill is a framework for AI agents to learn technical abstractions using differentiable physics models and construct them to perform complex manipulation tasks.

Lin’s past work has focused on the use of reinforcement learning for the manipulation of deformable objects such as cloth, rope, and liquids. For DiffSkill, he chose dough manipulation because of the problems it poses.

“Dough manipulation is particularly interesting because it cannot be easily performed with a robotic gripper, but it’s not very common what humans do but robots do, which require the sequential use of different tools,” said Lin.

Once trained, DiffSkill can successfully perform a set of dough manipulation tasks using only RGB-D input.

Learning abstract techniques with neural networks

DiffSkill trains a neural network to predict the viability of a target state from its initial state and parameters obtained from a differentiable physics simulator.
DiffSkill trains a neural network to predict the feasibility of a target state from an initial state and parameters obtained from a differentiable physics simulator.
DiffSkill trains a neural network to predict the feasibility of a target state from an initial state and parameters obtained from a differentiable physics simulator.

DiffSkill consists of two main components. neural network Learn individual skills and “planners” that organize skills to solve long-range tasks.

DiffSkill uses a differentiable physics simulator to generate training examples for skill abstractions. This sample demonstrates how to achieve short horizontal goals with a single tool, such as using a roller to spread the dough or using a spatula to move the dough.

An example of this is given in Technical Abstraction as RGB-D video. Given an image observation, the technical abstracter must predict whether the desired goal is feasible. The model learns and adjusts parameters by comparing the predictions to the actual results of the physics simulator.

At the same time, DiffSkill trains a variational autoencoder (VAE) to learn the latent spatial representation of examples generated in a physics simulator. VAE encodes images in a low-dimensional space that preserves important features and discards information that is not relevant to the task. By transferring high-dimensional image space into latent space, VAE plays an important role in enabling DiffSkill to plan long horizons and observe sensory data to predict outcomes.

One of the important challenges of VAE training is to ensure that the VAE learns the correct functions and generalizes to real-world environments. The configuration of the visual data here is different from the configuration generated by the physics simulator. For example, the color of a roller pin or table has nothing to do with the job, but the position and angle of the rollers and the position of the dough.

Currently, researchers are using a technique called “domain randomization” that randomizes irrelevant properties of the training environment, such as background, lighting, and preserves important characteristics, such as the position and orientation of tools. This makes VAE more reliable when applied in the real world.

“It is not easy to do this because we have to deal with all possible transformations between the simulation and the real world. [known as the sim2real gap]”A better approach would be to use a 3D point cloud as the scene representation. It’s a lot easier to transfer from simulation to the real world,” Lin said. “I’m actually working on a follow-up project using the point cloud as input.”

Long horizontal deformable object work plan

DiffSkill uses the planner module to evaluate the different combinations and sequences of skills that can achieve a goal goal.
DiffSkill uses the planner module to evaluate the different combinations and sequences of skills that can achieve a goal goal.
DiffSkill uses the planner module to evaluate the different combinations and sequences of skills that can achieve a goal goal.

Once the skill abstracter is trained, DiffSkill uses the planner module to solve long horizontal tasks. The planner must determine the number and sequence of skills needed to get from the initial state to the destination.

This planner iterates over possible combinations of techniques and the intermediate results they produce. This is where the Variational autoencoder comes in handy. Instead of predicting the full image outcome, DiffSkill uses VAE to predict the latent spatial outcome of intermediate steps towards a final goal.

The combination of abstract techniques and latent spatial representations makes drawing a trajectory from an initial state to a goal much more computationally efficient. In fact, the researchers did not need to optimize the search function and used an exhaustive search for every combination.

“We have a plan for the technology, so the calculations aren’t too much and the horizon isn’t that long,” said Lin. “This exhaustive search eliminates the need to design sketches for planners and may lead to new solutions that we did not observe in the limited work we tried, but did not consider by designers in a more general way. Also, more sophisticated search techniques could be applied. can.”

According to the DiffSkill paper, “Optimizations can be efficiently performed in about 10 seconds for each combination of technologies on a single NVIDIA 2080 Ti GPU.”

Preparing Pizza Dough with DiffSkill

PizzaPizza

Researchers tested the performance of DiffSkill against several basic methods applied to deformable objects, including a reinforcement learning algorithm without two models and trajectory optimization using only a physics simulator.

The model was tested in multiple operations that required multiple steps and tools. For example, in one of the jobs, an AI agent must lift the dough with a spatula, place it on a chopping board, and roll it out with a roller.

Results show that DiffSkill is far superior to other techniques for solving long horizontal, multi-tool tasks using only sensory information. Experiments show that when well trained, the planner of DiffSkill can find a good intermediate state between the initial state and the target state and find the appropriate sequence of skills to solve the task.

DiffSkill's planner can predict intermediate steps with impressive accuracy.
DiffSkill’s planner can predict intermediate steps with impressive accuracy.
DiffSkill's planner can predict intermediate steps with impressive accuracy.

“One implication is that a set of techniques can provide a very important temporal abstraction, which allows us to reason in the long run,” said Lin. “It’s also similar to how humans approach other tasks. Instead of thinking about what to do every second, think in a different temporal abstraction.”

However, the capacity of DiffSkill is also limited. For example, when performing one of the tasks that requires a three-step plan, the performance of DiffSkill is significantly degraded (although it is still superior to other techniques). Lin also noted that, in some cases, likelihood predictors produce false positives. Researchers believe that learning a better latent space could help solve this problem.

Researchers are also exploring other directions to improve DiffSkill, including a more efficient planner algorithm that can be used for long-term tasks.

Lin hopes one day he will be able to use DiffSkill on robots that make real pizzas. “We still have a long way to go. Various issues arise in control, sim2real transport and safety. But now we are more confident to try some long-term work,” he said.

This article was originally published by Ben Dickson. tech talk, a publication that examines technology trends, how technology trends affect the way we live and do business, and the problems technology solves. But we also discuss the downsides of technology, the darker implications of new technologies, and what we need to be wary of. You can read the original article Here.

Leave a Comment