Firstname Lastname

Marc Habermann

Max-Planck-Institut für Informatik
Department 6: Visual Computing and Artificial Intelligence
 office: Campus E1 4, Room 214
Saarland Informatics Campus
66123 Saarbrücken
 phone: +49 681 9325-4014
 fax: +49 681 9325-4099

Research Interests

  • Computer Vision, Computer Graphics, Machine Learning

  • Human Performance Capture and Synthesis

  • Reconstruction of Non-Rigid Deformations from RGB Video

  • Neural Rendering

  • Motion Capture

Recent Talks

  • Computer Vision Reading Group @EPFL [video]


Neural Actor: Neural Free-view Synthesis of Human Actors with Pose Control

Lingjie Liu   Marc Habermann   Viktor Rudnev   Kripasindhu Sarkar   Jiatao Gu   Christian Theobalt

arxiv 2021

We propose Neural Actor (NA), a new method for high-quality synthesis of humans from arbitrary viewpoints and under arbitrary controllable poses. Our method is built upon recent neural scene representation and rendering works which learn representations of geometry and appearance from only 2D images. While existing works demonstrated compelling rendering of static scenes and playback of dynamic scenes, photo-realistic reconstruction and rendering of humans with neural implicit methods, in particular under user-controlled novel poses, is still difficult. To address this problem, we utilize a coarse body model as the proxy to unwarp the surrounding 3D space into a canonical pose. A neural radiance field learns pose-dependent geometric deformations and pose- and view-dependent appearance effects in the canonical space from multi-view video input. To synthesize novel views of high fidelity dynamic geometry and appearance, we leverage 2D texture maps defined on the body model as latent variables for predicting residual deformations and the dynamic appearance. Experiments demonstrate that our method achieves better quality than the state-of-the-arts on playback as well as novel pose synthesis, and can even generalize well to new poses that starkly differ from the training poses. Furthermore, our method also supports body shape control of the synthesized results.

[pdf], [project page], [arxiv]

Real-time Deep Dynamic Characters

Marc Habermann   Lingjie Liu   Weipeng Xu   Michael Zollhoefer   Gerard Pons-Moll   Christian Theobalt


We propose a deep videorealistic 3D human character model displaying highly realistic shape, motion, and dynamic appearance learned in a new weakly supervised way from multi-view imagery. In contrast to previous work, our controllable 3D character displays dynamics, e.g., the swing of the skirt, dependent on skeletal body motion in an efficient data-driven way, without requiring complex physics simulation. Our character model also features a learned dynamic texture model that accounts for photo-realistic motion-dependent appearance details, as well as view-dependent lighting effects. During training, we do not need to resort to difficult dynamic 3D capture of the human; instead we can train our model entirely from multi-view video in a weakly supervised manner. To this end, we propose a parametric and differentiable character representation which allows us to model coarse and fine dynamic deformations, e.g., garment wrinkles, as explicit space-time coherent mesh geometry that is augmented with high-quality dynamic textures dependent on motion and view point. As input to the model, only an arbitrary 3D skeleton motion is required, making it directly compatible with the established 3D animation pipeline. We use a novel graph convolutional network architecture to enable motion-dependent deformation learning of body and clothing, including dynamics, and a neural generative dynamic texture model creates corresponding dynamic texture maps. We show that by merely providing new skeletal motions, our model creates motion-dependent surface deformations, physically plausible dynamic clothing deformations, as well as video-realistic surface textures at a much higher level of detail than previous state of the art approaches, and even in real-time.

[pdf], [video], [project page], [arxiv]

Efficient and Differentiable Shadow Computation for Inverse Problems

Linjie Lyu   Marc Habermann   Lingjie Liu   Mallikarjun B R   Ayush Tewari  
Christian Theobalt  

arxiv 2021

Differentiable rendering has received increasing interest for image-based inverse problems. It can benefit traditional optimization-based solutions to inverse problems, but also allows for self-supervision of learning-based approaches for which training data with ground truth annotation is hard to obtain. However, existing differentiable renderers either do not model visibility of the light sources from the different points in the scene, responsible for shadows in the images, or are too slow for being used to train deep architectures over thousands of iterations. To this end, we propose an accurate yet efficient approach for differentiable visibility and soft shadow computation. Our approach is based on the spherical harmonics approximations of the scene illumination and visibility, where the occluding surface is approximated with spheres. This allows for a significantly more efficient shadow computation compared to methods based on ray tracing. As our formulation is differentiable, it can be used to solve inverse problems such as texture, illumination, rigid pose, and geometric deformation recovery from images using analysis-by-synthesis optimization.

[pdf], [video], [project page], [arxiv]

Monocular Real-time Full Body Capture with Inter-part Correlations

Yuxiao Zhou   Marc Habermann   Ikhsanul Habibie   Ayush Tewari   Christian Theobalt  
Feng Xu  

CVPR 2021

We present the first method for real-time full body capture that estimates shape and motion of body and hands together with a dynamic 3D face model from a single color image. Our approach uses a new neural network architecture that exploits correlations between body and hands at high computational efficiency. Unlike previous works, our approach is jointly trained on multiple datasets focusing on hand, body or face separately, without requiring data where all the parts are annotated at the same time, which is much more difficult to create at sufficient variety. The possibility of such multi-dataset training enables superior generalization ability. In contrast to earlier monocular full body methods, our approach captures more expressive 3D face geometry and color by estimating the shape, expression, albedo and illumination parameters of a statistical face model. Our method achieves competitive accuracy on public benchmarks, while being significantly faster and providing more complete face reconstructions.

[pdf], [video], [project page], [arxiv]

Deep Physics-aware Inference of Cloth Deformation for Monocular Human Performance Capture

Yue Li   Marc Habermann   Bernhard Thomaszewski   Stelian Coros   Thabo Beeler   Christian Theobalt

arxiv 2020

Recent monocular human performance capture approaches have shown compelling dense tracking results of the full body from a single RGB camera. However, existing methods either do not estimate clothing at all or model cloth deformation with simple geometric priors instead of taking into account the underlying physical principles. This leads to noticeable artifacts in their reconstructions, such as baked-in wrinkles, implausible deformations that seemingly defy gravity, and intersections between cloth and body. To address these problems, we propose a person-specific, learning-based method that integrates a finite element-based simulation layer into the training process to provide for the first time physics supervision in the context of weakly-supervised deep monocular human performance capture. We show how integrating physics into the training process improves the learned cloth deformations, allows modeling clothing as a separate piece of geometry, and largely reduces cloth-body intersections. Relying only on weak 2D multi-view supervision during training, our approach leads to a significant improvement over current state-of-the-art methods and is thus a clear step towards realistic monocular capture of the entire deforming surface of a clothed human.


Differentiable Rendering Tool

Marc Habermann   Mallikarjun B R   Ayush Tewari   Linjie Lyu   Christian Theobalt


This is a simple and efficient differentiable rasterization-based renderer which has been used in several GVV publications. The implementation is free of most third-party libraries such as OpenGL. The core implementation is in CUDA and C++. We use the layer as a custom Tensorflow op. The renderer supports the following features:
  • Shading based on spherical harmonics illumination. This shading model is differentiable with respect to geometry, texture, and lighting.
  • Different visualizations, such as normals, UV coordinates, phong-shaded surface, spherical-harmonics shading and colors without shading.
  • Texture map lookups.
  • Rendering from multiple camera views in a single batch


DeepCap: Monocular Human Performance Capture Using Weak Supervision

Marc Habermann   Weipeng Xu   Michael Zollhoefer   Gerard Pons-Moll   Christian Theobalt

CVPR 2020 (Oral) CVPR 2020 Best Student Paper Honorable Mention

Human performance capture is a highly important computer vision problem with many applications in movie production and virtual/augmented reality. Many previous performance capture approaches either required expensive multi-view setups or did not recover dense space-time coherent geometry with frame-to-frame correspondences. We propose a novel deep learning approach for monocular dense human performance capture. Our method is trained in a weakly supervised manner based on multi-view supervision completely removing the need for training data with 3D ground truth annotations. The network architecture is based on two separate networks that disentangle the task into a pose estimation and a non-rigid surface deformation step. Extensive qualitative and quantitative evaluations show that our approach outperforms the state of the art in terms of quality and robustness.

[pdf], [video], [project page], [arxiv], [dataset]

Neural Human Video Rendering by Learning Dynamic Textures and Rendering-to-Video Translation

Lingjie Liu   Weipeng Xu   Marc Habermann   Michael Zollhoefer   Florian Bernard   Hyeongwoo Kim   Wenping Wang   Christian Theobalt

TVCG 2020

Synthesizing realistic videos of humans using neural networks has been a popular alternative to the conventional graphics-based rendering pipeline due to its high efficiency. Existing works typically formulate this as an image-to-image translation problem in 2D screen space, which leads to artifacts such as over-smoothing, missing body parts, and temporal instability of fine-scale detail, such as pose-dependent wrinkles in the clothing. In this paper, we propose a novel human video synthesis method that approaches these limiting factors by explicitly disentangling the learning of time-coherent fine-scale details from the embedding of the human in 2D screen space. More specifically, our method relies on the combination of two convolutional neural networks (CNNs). Given the pose information, the first CNN predicts a dynamic texture map that contains time-coherent high-frequency details, and the second CNN conditions the generation of the final video on the temporally coherent output of the first CNN. We demonstrate several applications of our approach, such as human reenactment and novel view synthesis from monocular video, where we show significant improvement over the state of the art both qualitatively and quantitatively

[pdf], [video], [project page], [arxiv]

EventCap: Monocular 3D Capture of High-Speed Human Motions using an Event Camera

Lan Xu   Weipeng Xu   Vladislav Golyanik   Marc Habermann   Lu Fang   Christian Theobalt

CVPR 2020 (Oral)

The high frame rate is a critical requirement for capturing fast human motions. In this setting, existing markerless image-based methods are constrained by the lighting requirement, the high data bandwidth and the consequent high computation overhead. In this paper, we propose EventCap — the first approach for 3D capturing of high-speed human motions using a single event camera. Our method combines model-based optimization and CNN-based human pose detection to capture high-frequency motion details and to reduce the drifting in the tracking. As a result, we can capture fast motions at millisecond resolution with significantly higher data efficiency than using highframe rate videos. Experiments on our new event-based fast human motion dataset demonstrate the effectiveness and accuracy of our method, as well as its robustness to challenging lighting conditions.

[pdf], [video], [project page], [arxiv]

Monocular Real-time Hand Shape and Motion Capture using Multi-modal Data

Yuxiao Zhou   Marc Habermann   Weipeng Xu   Ikhsanul Habibie   Christian Theobalt  
Feng Xu

CVPR 2020

We present a novel method for monocular hand shape and pose estimation at unprecedented runtime performance of 100fps and at state-of-the-art accuracy. This is enabled by a new learning based architecture designed such that it can make use of all the sources of available hand training data: image data with either 2D or 3D annotations, as well as stand-alone 3D animations without corresponding image data. It features a 3D hand joint detection module and an inverse kinematics module which regresses not only 3D joint positions but also maps them to joint rotations in a single feed-forward pass. This output makes the method more directly usable for applications in computer vision and graphics compared to only regressing 3D joint positions. We demonstrate that our architectural design leads to a significant quantitative and qualitative improvement over the state of the art on several challenging benchmarks. We will make our code publicly available for future research.

[pdf], [video], [project page], [arxiv]

LiveCap: Real-time Human Performance Capture from Monocular Video

Marc Habermann   Weipeng Xu   Michael Zollhoefer   Gerard Pons-Moll   Christian Theobalt

ACM ToG 2019 @ SIGGRAPH 2019

We present the first real-time human performance capture approach that reconstructs dense, space-time coherent deforming geometry of entire humans in general everyday clothing from just a single RGB video.We propose a novel two-stage analysis-by-synthesis optimization whose formulation and implementation are designed for high performance. In the first stage, a skinned template model is jointly fitted to background subtracted input video, 2D and 3D skeleton joint positions found using a deep neural network, and a set of sparse facial landmark detections. In the second stage, dense non-rigid 3D deformations of skin and even loose apparel are captured based on a novel real-time capable algorithm for non-rigid tracking using dense photometric and silhouette constraints. Our novel energy formulation leverages automatically identified material regions on the template to model the differing non-rigid deformation behavior of skin and apparel. The two resulting nonlinear optimization problems per-frame are solved with specially-tailored data-parallel Gauss-Newton solvers. In order to achieve real-time performance of over 25Hz, we design a pipelined parallel architecture using the CPU and two commodity GPUs. Our method is the first real-time monocular approach for full-body performance capture. Our method yields comparable accuracy with off-line performance capture techniques, while being orders of magnitude faster.

[pdf], [video], [project page], [arxiv]

Neural Animation and Reenactment of Human Actor Videos

Lingjie Liu   Weipeng Xu   Michael Zollhoefer   Hyeongwoo Kim   Florian Bernard  
Marc Habermann   Wenping Wang   Christian Theobalt

ACM ToG 2019 @ SIGGRAPH 2019

We propose a method for generating (near) video-realistic animations of real humans under user control. In contrast to conventional human character rendering, we do not require the availability of a production-quality photo-realistic 3D model of the human, but instead rely on a video sequence in conjunction with a (medium-quality) controllable 3D template model of the person. With that, our approach significantly reduces production cost compared to conventional rendering approaches based on production-quality 3D models, and can also be used to realistically edit existing videos. Technically, this is achieved by training a neural network that translates simple synthetic images of a human character into realistic imagery. For training our networks, we first track the 3D motion of the person in the video using the template model, and subsequently generate a synthetically rendered version of the video. These images are then used to train a conditional generative adversarial network that translates synthetic images of the 3D model into realistic imagery of the human. We evaluate our method for the reenactment of another person that is tracked in order to obtain the motion data, and show video results generated from artist-designed skeleton motion. Our results outperform the state-of-the-art in learning-based human image synthesis.

[pdf], [video], [project page], [arxiv]

NRST: Non-rigid Surface Tracking from Monocular Video

Marc Habermann   Weipeng Xu   Helge Rhodin   Michael Zollhoefer   Gerard Pons-Moll   Christian Theobalt

Oral @ German Conference on Pattern Recognition (GCPR) 2018

We propose an efficient method for non-rigid surface tracking from monocular RGB videos. Given a video and a template mesh, our algorithm sequentially registers the template non-rigidly to each frame.We formulate the per-frame registration as an optimization problem that includes a novel texture term specifically tailored towards tracking objects with uniform texture but fine-scale structure, such as the regular micro-structural patterns of fabric. Our texture term exploits the orientation information in the micro-structures of the objects, e.g., the yarn patterns of fabrics. This enables us to accurately track uniformly colored materials that have these high frequency micro-structures, for which traditional photometric terms are usually less effective. The results demonstrate the effectiveness of our method on both general textured non-rigid objects and monochromatic fabrics.

[pdf], [video], [project page]

Awards & Honors


  • April 2021 - August 2021
    Supervisor for Computer Vision and Machine Learning for Computer Graphics, Lecturer: Prof. Dr. Christian Theobalt, Dr. Mohamed Elgharib, Dr. Vladislav Golyanik at the Saarland University, Saarbrücken, Germany

  • April 2020 - August 2020
    Supervisor for Computer Vision and Machine Learning for Computer Graphics, Lecturer: Prof. Dr. Christian Theobalt, Dr. Mohamed Elgharib, Dr. Vladislav Golyanik at the Saarland University, Saarbrücken, Germany

  • April 2019 - August 2019
    Supervisor for Computer Vision and Machine Learning for Computer Graphics, Lecturer: Prof. Dr. Christian Theobalt, Dr. Mohamed Elgharib, Dr. Vladislav Golyanik at the Saarland University, Saarbrücken, Germany

  • April 2018 - August 2018
    Supervisor for 3D Shape Analysis, Lecturer: Dr. Florian Bernard and Prof. Dr. Christian Theobalt at the Saarland University, Saarbrücken, Germany

  • September 2016 - June 2018
    Tutor for Seminarfach 3D Modellierung at the Leibniz Gymnasium/Albertus Magnus Gymnasium, Sankt Ingbert, Germany

  • July 2013 - September 2016:
    Tutor for 3D Modellierung Alte Schmelz, Sankt Ingbert, Germany


  • September 2017 - present
    PhD student at the Max Planck Institute for Informatics in the GVV Group, Saarbrücken, Germany

  • April 2016 - November 2017
    Master Studies in Computer Science at Saarland University, Saarbrücken, Germany
    Title of Master's Thesis (Diplomarbeit): RONDA - Reconstruction of Non-rigid Surfaces from High Resolution Video (supervisor: Prof. Dr. Christian Theobalt) (PDF)

  • October 2012 - April 2016:
    Bachelor Studies in Computer Science at Saarland University, Saarbrücken, Germany
    Title of Bachelor's Thesis: Drone Path Planning (supervisor: Dr.-Ing. Tobias Ritschel) (PDF)

  • July 2012:
    Abitur at the Albertus Magnus Gymnasium, Sankt Ingbert, Germany


I also regularly serve as reviewer for the following conferences and journals:



  • CVPR

  • ToG

  • IJCV

  • TVCG



  • Photography

  • Bouldering

  • Reading Books