16-824 Project : Modeling 3D Deformations under External Force from 2D Images

Ananya Bal (abal), FNU Abhimanyu (abhiman2)

Repo: https://github.com/AnanyaBal/DeformNet

Motivation:

That robots deal with rigid objects has been a latent assumption in many domains of research in robotics, specifically manipulation. In reality, most robot-object interactions deal with non-rigid objects. Hence, understanding and modeling deformations is of vital importance when it comes to teaching robots grasping. While forces applied to a rigid object result in a sequence of rigid body transformations for the object, they result in changes to the shape of non-rigid objects, which is harder to model. The exact deformation and translation (if any) of an object, depends on the precise material composition, the force being applied and its point of application in 3D space. Most efforts in this space have largely focused on mechanical modeling either using simulators or using neural networks to estimate parameters.

The traditional approach is to perform mechanical modeling with methods like Finite Element Modeling or Mass-Spring Systems. However, these methods require detailed knowledge of material properties and are time consuming, often requiring human effort to define conditions. Both of which are not readily available to a robot in the wild. This detailed information is not available to us humans either, yet we are able to intuitively understand how certain objects will deform. This motivates the hypothesis that visual comprehension can help robots understand and even model deformations.

Very few works have tried to solve this as a Computer Vision problem, let alone a 3D vision problem. To solve this problem with deep learning further reduces the need for a human in the loop. In this project, we formulate deformation prediction in objects as a 3D vision problem. We have trained a conditional-VAE network to learn different deformations in objects of varying material properties, caused by varying forces. Our pipeline learns using 3D point clouds of the object deformation, material properties, force and its point of application and predicts a deformed version of the object. As we go from images to point clouds, our method uses 2D RGB images to learn 3D deformations.

Prior Work:

One of the first works to use computer vision to determine material compositions of objects was [Xue et al., 2017]. They use available approaches such as the recent Differential Angular Imaging for Material Recognition framework trained on the GTOS (Ground Terrain in Outdoor Scenes) material reflectance database, containing 40 material classes.

The authors in [Wu et al., 2015] first proposed to use deep generative networks to learn physics from videos, such as the effect of gravity and friction on moving objects. The key idea was based on inverting a physics engine to obtain model dynamics from observations.

Generative models have been applied successfully to the problem of reconstructing 3D objects from partial views or synthesizing 3D objects. The authors of [Yang et al., 2017] have attempted shape completion with single depth view by combining autoencoder with a conditional GAN. In [Wang et al., 2018], the authors have proposed a conditional VAE-GAN architecture to learn non-rigid body deformations of 3D objects. Their framework deals with voxel grids and their network predicts occupancy in the grid using 2.5D images as the input.

Screen Shot 2022-05-09 at 8.52.30 AM.png

Fig 1. 3D-PhysNet [Wang et al., 2018] Architecture

There are also non-vision based learning methods that predict deformations. In [Mrowca et al., 2018], the authors use an end-to-end differentiable Hierarchical Relation Network based with hierarchical graph convolutions for forces, collisions, past states etc. The network takes in particle graphs and learns to predict physical dynamics of objects.

Our Idea:

Our Idea: Generative models like Generative Adversarial Networks (GANs) and Variational AutoEncoders (VAEs) learn a mapping from latent encoding space to data space. The latent space learned by these models is often organized in a near-linear fashion, so that neighboring points in latent space map to similar points in data space. Generative networks have been applied with success to the problem of reconstructing 3D objects from partial views or synthesizing 3D objects. Conditional Variational Autoencoders (cVAEs) offer a natural way to encode the effects of physical properties and applied forces. Utilizing this property, we have proposed a Conditional Variational Autoencoders architecture that generates a point cloud of the deformed object.

Once we pass RGB images of the object through Pixel2Mesh or MeshRCNN and generate corresponding meshes and point clouds, we propose a generative model to generate 3D deformed point clouds. We use a Conditional Variational Autoencoder (cVAE) with a simple PointNet encoder, conditioned with material properties (Poisson Ratio, Young’s Modulus), force and its point of application, to output a point cloud of the deformed object. We have tested our pipeline on a custom generated dataset which we will discuss later. Additionally, we have also trained this Network with Cycle Consistency Loss to reduce supervision. This is discussed in more detail later.

1_1 Updates 2022 (1).jpg

Fig 2. cVAE Framework without Cycle Consistency Loss

1_1 Updates 2022 (1).png