Header logo is


2025


{OpenCapBench}: A Benchmark to Bridge Pose Estimation and Biomechanics
OpenCapBench: A Benchmark to Bridge Pose Estimation and Biomechanics

Gozlan, Y., Falisse, A., Uhlrich, S., Gatti, A., Black, M., Chaudhari, A.

In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , February 2025 (inproceedings)

Abstract
Pose estimation has promised to impact healthcare by enabling more practical methods to quantify nuances of human movement and biomechanics. However, despite the inherent connection between pose estimation and biomechanics, these disciplines have largely remained disparate. For example, most current pose estimation benchmarks use metrics such as Mean Per Joint Position Error, Percentage of Correct Keypoints, or mean Average Precision to assess performance, without quantifying kinematic and physiological correctness - key aspects for biomechanics. To alleviate this challenge, we develop OpenCapBench to offer an easy-to-use unified benchmark to assess common tasks in human pose estimation, evaluated under physiological constraints. OpenCapBench computes consistent kinematic metrics through joints angles provided by an open-source musculoskeletal modeling software (OpenSim). Through OpenCapBench, we demonstrate that current pose estimation models use keypoints that are too sparse for accurate biomechanics analysis. To mitigate this challenge, we introduce SynthPose, a new approach that enables finetuning of pre-trained 2D human pose models to predict an arbitrarily denser set of keypoints for accurate kinematic analysis through the use of synthetic data. Incorporating such finetuning on synthetic data of prior models leads to twofold reduced joint angle errors. Moreover, OpenCapBench allows users to benchmark their own developed models on our clinically relevant cohort. Overall, OpenCapBench bridges the computer vision and biomechanics communities, aiming to drive simultaneous advances in both areas.

ps

arXiv [BibTex]

2025


arXiv [BibTex]


no image
Policy Design in Long-run Welfare Dynamics

Wu, J., Abebe, R., Hardt, M., Stoica, A.

2025 (misc) Submitted

sf

[BibTex]

2024


{SPARK}: Self-supervised Personalized Real-time Monocular Face Capture
SPARK: Self-supervised Personalized Real-time Monocular Face Capture

Baert, K., Bharadwaj, S., Castan, F., Maujean, B., Christie, M., Abrevaya, V., Boukhayma, A.

In SIGGRAPH Asia 2024 Conference Proceedings, SIGGRAPH Asia 2024, December 2024 (inproceedings) Accepted

Abstract
Feedforward monocular face capture methods seek to reconstruct posed faces from a single image of a person. Current state of the art approaches have the ability to regress parametric 3D face models in real-time across a wide range of identities, lighting conditions and poses by leveraging large image datasets of human faces. These methods however suffer from clear limitations in that the underlying parametric face model only provides a coarse estimation of the face shape, thereby limiting their practical applicability in tasks that require precise 3D reconstruction (aging, face swapping, digital make-up, ...). In this paper, we propose a method for high-precision 3D face capture taking advantage of a collection of unconstrained videos of a subject as prior information. Our proposal builds on a two stage approach. We start with the reconstruction of a detailed 3D face avatar of the person, capturing both precise geometry and appearance from a collection of videos. We then use the encoder from a pre-trained monocular face reconstruction method, substituting its decoder with our personalized model, and proceed with transfer learning on the video collection. Using our pre-estimated image formation model, we obtain a more precise self-supervision objective, enabling improved expression and pose alignment. This results in a trained encoder capable of efficiently regressing pose and expression parameters in real-time from previously unseen images, which combined with our personalized geometry model yields more accurate and high fidelity mesh inference. Through extensive qualitative and quantitative evaluation, we showcase the superiority of our final model as compared to state-of-the-art baselines, and demonstrate its generalization ability to unseen pose, expression and lighting.

ps

Website Code Paper+Supmat link (url) DOI [BibTex]

2024


Website Code Paper+Supmat link (url) DOI [BibTex]


{PuzzleAvatar}: Assembling 3D Avatars from Personal Albums
PuzzleAvatar: Assembling 3D Avatars from Personal Albums

Xiu, Y., Liu, Z., Tzionas, D., Black, M. J.

ACM Transactions on Graphics, 43(6), ACM, December 2024 (article) To be published

Abstract
Generating personalized 3D avatars is crucial for AR/VR. However, recent text-to-3D methods that generate avatars for celebrities or fictional characters, struggle with everyday people. Methods for faithful reconstruction typically require full-body images in controlled settings. What if a user could just upload their personal "OOTD" (Outfit Of The Day) photo collection and get a faithful avatar in return? The challenge is that such casual photo collections contain diverse poses, challenging viewpoints, cropped views, and occlusion (albeit with a consistent outfit, accessories and hairstyle). We address this novel "Album2Human" task by developing PuzzleAvatar, a novel model that generates a faithful 3D avatar (in a canonical pose) from a personal OOTD album, while bypassing the challenging estimation of body and camera pose. To this end, we fine-tune a foundational vision-language model (VLM) on such photos, encoding the appearance, identity, garments, hairstyles, and accessories of a person into (separate) learned tokens and instilling these cues into the VLM. In effect, we exploit the learned tokens as "puzzle pieces" from which we assemble a faithful, personalized 3D avatar. Importantly, we can customize avatars by simply inter-changing tokens. As a benchmark for this new task, we collect a new dataset, called PuzzleIOI, with 41 subjects in a total of nearly 1K OOTD configurations, in challenging partial photos with paired ground-truth 3D bodies. Evaluation shows that PuzzleAvatar not only has high reconstruction accuracy, outperforming TeCH and MVDreamBooth, but also a unique scalability to album photos, and strong robustness. Our code and data are publicly available for research purpose.

ps

Page Code Video DOI [BibTex]

Page Code Video DOI [BibTex]


no image
Latent Diffusion for Neural Spiking Data

Kapoor, J., Schulz, A., Vetter, J., Pei, F., Gao, R., Macke, J. H.

Advances in Neural Information Processing Systems 37 (NeurIPS 2024), 38th Annual Conference on Neural Information Processing Systems, December 2024 (conference) Accepted

ei

[BibTex]

[BibTex]


no image
Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving

Didolkar, A. R., Goyal, A., Ke, N. R., Guo, S., Valko, M., Lillicrap, T. P., Rezende, D. J., Bengio, Y., Mozer, M. C., Arora, S.

Advances in Neural Information Processing Systems 37 (NeurIPS 2024), 38th Annual Conference on Neural Information Processing Systems, December 2024 (conference) Accepted

ei

[BibTex]

[BibTex]


no image
Learning partitions from Context

Buchholz, S.

Advances in Neural Information Processing Systems 37 (NeurIPS 2024), 38th Annual Conference on Neural Information Processing Systems, December 2024 (conference) Accepted

ei

[BibTex]

[BibTex]


no image
From Causal to Concept-Based Representation Learning

Rajendran*, G., Buchholz*, S., Aragam, B., Schölkopf, B., Ravikumar, P. K.

Advances in Neural Information Processing Systems 37 (NeurIPS 2024), 38th Annual Conference on Neural Information Processing Systems, December 2024 (conference) Accepted

ei

[BibTex]

[BibTex]


no image
Theoretical Characterisation of the Gauss Newton Conditioning in Neural Networks

Zhao, J., Singh, S. P., Lucchi, A.

Advances in Neural Information Processing Systems 37 (NeurIPS 2024), 38th Annual Conference on Neural Information Processing Systems, December 2024 (conference) Accepted

ei

[BibTex]

[BibTex]


no image
A Generative Model of Symmetry Transformations

Allingham, J. U., Mlodozeniec, B. K., Padhy, S., Antoran, J., Krueger, D., Turner, R. E., Nalisnick, E., Hernández-Lobato, J. M.

Advances in Neural Information Processing Systems 37 (NeurIPS 2024), 38th Annual Conference on Neural Information Processing Systems, December 2024 (conference) Accepted

ei

[BibTex]

[BibTex]


no image
Causal vs. Anticausal merging of predictors

Garrido, S., Blöbaum, P., Schölkopf, B., Janzing, D.

In Advances in Neural Information Processing Systems 37 (NeurIPS 2024) , 38th Annual Conference on Neural Information Processing Systems, December 2024 (inproceedings) Accepted

ei

[BibTex]

[BibTex]


no image
Neural Characteristic Activation Analysis and Geometric Parameterization for ReLU Networks

Chen, W., Ge, H.

Advances in Neural Information Processing Systems 37 (NeurIPS 2024) , 38th Annual Conference on Neural Information Processing Systems, December 2024 (conference) Accepted

ei

[BibTex]

[BibTex]


no image
Robust Mixture Learning when Outliers Overwhelm Small Groups

Dmitriev, D., Buhai, R., Tiegel, S., Wolters, A., Novikov, G., Sanyal, A., Steurer, D., Yang, F.

Advances in Neural Information Processing Systems 37 (NeurIPS 2024), 38th Annual Conference on Neural Information Processing Systems, December 2024 (conference) Accepted

ei

[BibTex]

[BibTex]


no image
Improving Linear System Solvers for Hyperparameter Optimisation in Iterative Gaussian Processes

Lin, J. A., Padhy, S., Mlodozeniec, B. K., Antoran, J., Hernández-Lobato, J. M.

Advances in Neural Information Processing Systems 37 (NeurIPS 2024) , 38th Annual Conference on Neural Information Processing Systems, December 2024 (conference) Accepted

ei

[BibTex]

[BibTex]


MotionFix: Text-Driven 3D Human Motion Editing
MotionFix: Text-Driven 3D Human Motion Editing

Athanasiou, N., Cseke, A., Diomataris, M., Black, M. J., Varol, G.

In SIGGRAPH Asia 2024 Conference Proceedings, ACM, December 2024 (inproceedings) To be published

Abstract
The focus of this paper is 3D motion editing. Given a 3D human motion and a textual description of the desired modification, our goal is to generate an edited motion as described by the text. The challenges include the lack of training data and the design of a model that faithfully edits the source motion. In this paper, we address both these challenges. We build a methodology to semi-automatically collect a dataset of triplets in the form of (i) a source motion, (ii) a target motion, and (iii) an edit text, and create the new dataset. Having access to such data allows us to train a conditional diffusion model that takes both the source motion and the edit text as input. We further build various baselines trained only on text-motion pairs datasets and show superior performance of our model trained on triplets. We introduce new retrieval-based metrics for motion editing and establish a new benchmark on the evaluation set. Our results are encouraging, paving the way for further research on fine-grained motion generation. Code and models will be made publicly available.

ps

link (url) Project Page Project Page [BibTex]

link (url) Project Page Project Page [BibTex]


no image
Cooperate or Collapse: Emergence of Sustainability in a Society of LLM Agents

Piatti*, G., Jin*, Z., Kleiman-Weiner*, M., Schölkopf, B., Sachan, M., Mihalcea, R.

Advances in Neural Information Processing Systems 37 (NeurIPS 2024), 38th Annual Conference on Neural Information Processing Systems, December 2024, *equal contribution (conference) Accepted

ei

[BibTex]

[BibTex]


no image
What Makes Safety Fine-tuning Methods Safe? A Mechanistic Study

Jain, S., Lubana, E. S., Oksuz, K., Joy, T., Torr, P., Sanyal, A., Dokania, P. K.

Advances in Neural Information Processing Systems 37 (NeurIPS 2024), 38th Annual Conference on Neural Information Processing Systems, December 2024 (conference) Accepted

ei

[BibTex]

[BibTex]


no image
Sourcerer: Sample-based Maximum Entropy Source Distribution Estimation

Vetter, J., Moss, G., Schröder, C., Gao, R., Macke, J. H.

Advances in Neural Information Processing Systems 37 (NeurIPS 2024), 38th Annual Conference on Neural Information Processing Systems, December 2024 (conference) Accepted

ei

[BibTex]

[BibTex]


{StableNormal}: Reducing Diffusion Variance for Stable and Sharp Normal
StableNormal: Reducing Diffusion Variance for Stable and Sharp Normal

Ye, C., Qiu, L., Gu, X., Zuo, Q., Wu, Y., Dong, Z., Bo, L., Xiu, Y., Han, X.

ACM Transactions on Graphics, 43(6), ACM, December 2024 (article) To be published

Abstract
This work addresses the challenge of high-quality surface normal estimation from monocular colored inputs (i.e., images and videos), a field which has recently been revolutionized by repurposing diffusion priors. However, previous attempts still struggle with stochastic inference, conflicting with the deterministic nature of the Image2Normal task, and costly ensembling step, which slows down the estimation process. Our method, StableNormal, mitigates the stochasticity of the diffusion process by reducing inference variance, thus producing "Stable-and-Sharp" normal estimates without any additional ensembling process. StableNormal works robustly under challenging imaging conditions, such as extreme lighting, blurring, and low quality. It is also robust against transparent and reflective surfaces, as well as cluttered scenes with numerous objects. Specifically, StableNormal employs a coarse-to-fine strategy, which starts with a one-step normal estimator (YOSO) to derive an initial normal guess, that is relatively coarse but reliable, then followed by a semantic-guided refinement process (SG-DRN) that refines the normals to recover geometric details. The effectiveness of StableNormal is demonstrated through competitive performance in standard datasets such as DIODE-indoor, iBims, ScannetV2 and NYUv2, and also in various downstream tasks, such as surface reconstruction and normal enhancement. These results evidence that StableNormal retains both the "stability" and "sharpness" for accurate normal estimation. StableNormal represents a baby attempt to repurpose diffusion priors for deterministic estimation. To democratize this, code and models have been publicly available.

ps

Page Huggingface Demo Code Video DOI [BibTex]

Page Huggingface Demo Code Video DOI [BibTex]


no image
Do Finetti: On Causal Effects for Exchangeable Data

Guo, S., Zhang, C., Muhan, K., Huszár*, F., Schölkopf*, B.

Advances in Neural Information Processing Systems 37 (NeurIPS 2024), 38th Annual Conference on Neural Information Processing Systems, December 2024, *joint senior authors (conference) Accepted

ei

[BibTex]

[BibTex]


no image
On Affine Homotopy between Language Encoders

Chan, R., Bourmasmoud, R., Svete, A., Ren, Y., Guo, Q., Jin, Z., Ravfogel, S., Sachan, M., Schölkopf, B., El-Assady, M., Cotterell, R.

Advances in Neural Information Processing Systems 37 (NeurIPS 2024), 38th Annual Conference on Neural Information Processing Systems, December 2024 (conference) Accepted

ei

[BibTex]

[BibTex]


no image
Inferring stochastic low-rank recurrent neural networks from neural data

Pals, M., Sağtekin, A. E., Pei, F., Gloeckler, M., Macke, J.

Advances in Neural Information Processing Systems 37 (NeurIPS 2024) , 38th Annual Conference on Neural Information Processing Systems, December 2024 (conference) Accepted

ei

[BibTex]

[BibTex]


no image
Demonstration: OCRA - A Kinematic Retargeting Algorithm for Expressive Whole-Arm Teleoperation

Mohan, M., Kuchenbecker, K. J.

Hands-on demonstration presented at the Conference on Robot Learning (CoRL), Munich, Germany, November 2024 (misc) Accepted

Abstract
Traditional teleoperation systems focus on controlling the pose of the end-effector (task space), often neglecting the additional degrees of freedom present in human and many robotic arms. This demonstration presents the Optimization-based Customizable Retargeting Algorithm (OCRA), which was designed to map motions from one serial kinematic chain to another in real time. OCRA is versatile, accommodating any robot joint counts and segment lengths, and it can retarget motions from human arms to kinematically different serial robot arms with revolute joints both expressively and efficiently. One of OCRA's key features is its customizability, allowing the user to adjust the emphasis between hand orientation error and the configuration error of the arm's central line, which we call the arm skeleton. To evaluate the perceptual quality of the motions generated by OCRA, we conducted a video-watching study with 70 participants; the results indicated that the algorithm produces robot motions that closely resemble human movements, with a median rating of 78/100, particularly when the arm skeleton error weight and hand orientation error are balanced. In this demonstration, the presenter will wear an Xsens MVN Link and teleoperate the arms of a NAO child-size humanoid robot to highlight OCRA's ability to create intuitive and human-like whole-arm motions.

hi

Project Page [BibTex]

Project Page [BibTex]


no image
Demonstration: Minsight - A Soft Vision-Based Tactile Sensor for Robotic Fingertips

Andrussow, I., Sun, H., Martius, G., Kuchenbecker, K. J.

Hands-on demonstration presented at the Conference on Robot Learning (CoRL), Munich, Germany, November 2024 (misc) Accepted

Abstract
Beyond vision and hearing, tactile sensing enhances a robot's ability to dexterously manipulate unfamiliar objects and safely interact with humans. Giving touch sensitivity to robots requires compact, robust, affordable, and efficient hardware designs, especially for high-resolution tactile sensing. We present a soft vision-based tactile sensor engineered to meet these requirements. Comparable in size to a human fingertip, Minsight uses machine learning to output high-resolution directional contact force distributions at 60 Hz. Minsight's tactile force maps enable precise sensing of fingertip contacts, which we use in this hands-on demonstration to allow a 3-DoF robot arm to physically track contact with a user's finger. While observing the colorful image captured by Minsight's internal camera, attendees can experience how its ability to detect delicate touches in all directions facilitates real-time robot interaction.

al hi ei

Project Page [BibTex]

Project Page [BibTex]


no image
Active Haptic Feedback for a Virtual Wrist-Anchored User Interface

Bartels, J. U., Sanchez-Tamayo, N., Sedlmair, M., Kuchenbecker, K. J.

Hands-on demonstration presented at the ACM Symposium on User Interface Software and Technology (UIST), Pittsburgh, USA, October 2024 (misc) Accepted

hi

DOI [BibTex]

DOI [BibTex]


Reinforcement learning in cold atom experiments
Reinforcement learning in cold atom experiments

Reinschmidt, M., Fortágh, J., Günther, A., Volchkov, V.

nature communications, 15:8532, October 2024 (article)

Abstract
Cold atom traps are at the heart of many quantum applications in science and technology. The preparation and control of atomic clouds involves complex optimization processes, that could be supported and accelerated by machine learning. In this work, we introduce reinforcement learning to cold atom experiments and demonstrate a flexible and adaptive approach to control a magneto-optical trap. Instead of following a set of predetermined rules to accomplish a specific task, the objectives are defined by a reward function. This approach not only optimizes the cooling of atoms just as an experi- mentalist would do, but also enables new operational modes such as the preparation of pre-defined numbers of atoms in a cloud. The machine control is trained to be robust against external perturbations and able to react to situations not seen during the training. Finally, we show that the time con- suming training can be performed in-silico using a generic simulation and demonstrate successful transfer to the real world experiment.

OS Lab

link (url) DOI [BibTex]

link (url) DOI [BibTex]


Human Hair Reconstruction with Strand-Aligned {3D} Gaussians
Human Hair Reconstruction with Strand-Aligned 3D Gaussians

Zakharov, E., Sklyarova, V., Black, M. J., Nam, G., Thies, J., Hilliges, O.

In European Conference on Computer Vision (ECCV 2024), LNCS, Springer Cham, October 2024 (inproceedings)

Abstract
We introduce a new hair modeling method that uses a dual representation of classical hair strands and 3D Gaussians to produce accurate and realistic strand-based reconstructions from multi-view data. In contrast to recent approaches that leverage unstructured Gaussians to model human avatars, our method reconstructs the hair using 3D polylines, or strands. This fundamental difference allows the use of the resulting hairstyles out-of-the-box in modern computer graphics engines for editing, rendering, and simulation. Our 3D lifting method relies on unstructured Gaussians to generate multi-view ground truth data to supervise the fitting of hair strands. The hairstyle itself is represented in the form of the so-called strand-aligned 3D Gaussians. This representation allows us to combine strand-based hair priors, which are essential for realistic modeling of the inner structure of hairstyles, with the differentiable rendering capabilities of 3D Gaussian Splatting. Our method, named Gaussian Haircut, is evaluated on synthetic and real scenes and demonstrates state-of-the-art performance in the task of strand-based hair reconstruction.

ps

pdf project code video arXiv [BibTex]

pdf project code video arXiv [BibTex]


no image
Decline Now: A Combinatorial Model for Algorithmic Collective Action

Sigg, D., Hardt, M., Mendler-Dünner, C.

arXiv preprint arXiv:2410.12633, October 2024 (conference) Submitted

sf

[BibTex]

[BibTex]


Stable Video Portraits
Stable Video Portraits

Ostrek, M., Thies, J.

In European Conference on Computer Vision (ECCV 2024), LNCS, Springer Cham, European Conference on Computer Vision (ECCV 2024), October 2024 (inproceedings) Accepted

Abstract
Rapid advances in the field of generative AI and text-to-image methods in particular have transformed the way we interact with and perceive computer-generated imagery today. In parallel, much progress has been made in 3D face reconstruction, using 3D Morphable Models (3DMM). In this paper, we present Stable Video Portraits, a novel hybrid 2D/3D generation method that outputs photorealistic videos of talking faces leveraging a large pre-trained text-to-image prior (2D), controlled via a 3DMM (3D). Specifically, we introduce a person-specific fine-tuning of a general 2D stable diffusion model which we lift to a video model by providing temporal 3DMM sequences as conditioning and by introducing a temporal denoising procedure. As an output, this model generates temporally smooth imagery of a person with 3DMM-based controls, i.e., a person-specific avatar. The facial appearance of this person-specific avatar can be edited and morphed to text-defined celebrities, without any test-time fine-tuning. The method is analyzed quantitatively and qualitatively, and we show that our method outperforms state-of-the-art monocular head avatar methods.

ncs ps

link (url) [BibTex]

link (url) [BibTex]


Generating Human Interaction Motions in Scenes with Text Control
Generating Human Interaction Motions in Scenes with Text Control

Yi, H., Thies, J., Black, M. J., Peng, X. B., Rempe, D.

In European Conference on Computer Vision (ECCV 2024), LNCS, Springer Cham, October 2024 (inproceedings)

Abstract
We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models. Previous text-to-motion methods focus on characters in isolation without considering scenes due to the limited availability of datasets that include motion, text descriptions, and interactive scenes. Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model, emphasizing goal-reaching constraints on large-scale motion-capture datasets. We then enhance this model with a scene-aware component, fine-tuned using data augmented with detailed scene information, including ground plane and object shapes. To facilitate training, we embed annotated navigation and interaction motions within scenes. The proposed method produces realistic and diverse human-object interactions, such as navigation and sitting, in different scenes with various object shapes, orientations, initial body positions, and poses. Extensive experiments demonstrate that our approach surpasses prior techniques in terms of the plausibility of human-scene interactions, as well as the realism and variety of the generated motions.

ps

pdf project [BibTex]

pdf project [BibTex]


On predicting {3D} bone locations inside the human body
On predicting 3D bone locations inside the human body

Dakri, A., Arora, V., Challier, L., Keller, M., Black, M. J., Pujades, S.

In 26th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), October 2024 (inproceedings)

Abstract
Knowing the precise location of the bones inside the human body is key in several medical tasks, such as patient placement inside an imaging device or surgical navigation inside a patient. Our goal is to predict the bone locations using only an external 3D body surface obser- vation. Existing approaches either validate their predictions on 2D data (X-rays) or with pseudo-ground truth computed from motion capture using biomechanical models. Thus, methods either suffer from a 3D-2D projection ambiguity or directly lack validation on clinical imaging data. In this work, we start with a dataset of segmented skin and long bones obtained from 3D full body MRI images that we refine into individual bone segmentations. To learn the skin to bones correlations, one needs to register the paired data. Few anatomical models allow to register a skeleton and the skin simultaneously. One such method, SKEL, has a skin and skeleton that is jointly rigged with the same pose parameters. How- ever, it lacks the flexibility to adjust the bone locations inside its skin. To address this, we extend SKEL into SKEL-J to allow its bones to fit the segmented bones while its skin fits the segmented skin. These precise fits allow us to train SKEL-J to more accurately infer the anatomical joint locations from the skin surface. Our qualitative and quantitative results show how our bone location predictions are more accurate than all existing approaches. To foster future research, we make available for research purposes the individual bone segmentations, the fitted SKEL-J models as well as the new inference methods.

ps

Project page [BibTex]

Project page [BibTex]


Synthesizing Environment-Specific People in Photographs
Synthesizing Environment-Specific People in Photographs

Ostrek, M., O’Sullivan, C., Black, M., Thies, J.

In European Conference on Computer Vision (ECCV 2024), LNCS, Springer Cham, European Conference on Computer Vision (ECCV 2024), October 2024 (inproceedings) Accepted

Abstract
We present ESP, a novel method for context-aware full-body generation, that enables photo-realistic synthesis and inpainting of people wearing clothing that is semantically appropriate for the scene depicted in an input photograph. ESP is conditioned on a 2D pose and contextual cues that are extracted from the photograph of the scene and integrated into the generation process, where the clothing is modeled explicitly with human parsing masks (HPM). Generated HPMs are used as tight guiding masks for inpainting, such that no changes are made to the original background. Our models are trained on a dataset containing a set of in-the-wild photographs of people covering a wide range of different environments. The method is analyzed quantitatively and qualitatively, and we show that ESP outperforms the state-of-the-art on the task of contextual full-body generation.

ncs ps

link (url) [BibTex]

link (url) [BibTex]


Explorative Inbetweening of Time and Space
Explorative Inbetweening of Time and Space

Feng, H., Ding, Z., Xia, Z., Niklaus, S., Fernandez Abrevaya, V., Black, M. J., Zhang, X.

In European Conference on Computer Vision (ECCV 2024), LNCS, Springer Cham, October 2024 (inproceedings)

Abstract
We introduce bounded generation as a generalized task to control video generation to synthesize arbitrary camera and subject motion based only on a given start and end frame. Our objective is to fully leverage the inherent generalization capability of an image-to-video model without additional training or fine-tuning of the original model. This is achieved through the proposed new sampling strategy, which we call Time Reversal Fusion, that fuses the temporally forward and backward denoising paths conditioned on the start and end frame, respectively. The fused path results in a video that smoothly connects the two frames, generating inbetweening of faithful subject motion, novel views of static scenes, and seamless video looping when the two bounding frames are identical. We curate a diverse evaluation dataset of image pairs and compare against the closest existing methods. We find that Time Reversal Fusion outperforms related work on all subtasks, exhibiting the ability to generate complex motions and 3D-consistent views guided by bounded frames.

ps

Paper Website [BibTex]

Paper Website [BibTex]


no image
Limits to Scalable Evaluation at the Frontier: LLM as Judge Won’t Beat Twice the Data

Dorner, F. E., Nastl, V. Y., Hardt, M.

arXiv preprint arXiv:2410.13341, October 2024 (conference) Submitted

Abstract
High-quality annotations are increasingly a bottleneck in the explosively growing machine learning ecosystem. Scalable evaluation methods that avoid costly annotation have therefore become an important research ambition. Many hope to use strong existing models in lieu of costly labels to provide cheap model evaluations. Unfortunately, this method of using models as judges introduces biases, such as self-preferencing, that can distort model comparisons. An emerging family of debiasing tools promises to fix these issues by using a few high-quality labels to debias a large number of model judgments. In this paper, we study how far such debiasing methods, in principle, can go. Our main result shows that when the judge is no more accurate than the evaluated model, no debiasing method can decrease the required amount of ground truth labels by more than half. Our result speaks to the severe limitations of the LLM-as-a-judge paradigm at the evaluation frontier where the goal is to assess newly released models that are possibly better than the judge. Through an empirical evaluation, we demonstrate that the sample size savings achievable in practice are even more modest than what our theoretical limit suggests. Along the way, our work provides new observations about debiasing methods for model evaluation and points out promising avenues for future work.

sf

[BibTex]


{HUMOS}: Human Motion Model Conditioned on Body Shape
HUMOS: Human Motion Model Conditioned on Body Shape

Tripathi, S., Taheri, O., Lassner, C., Black, M. J., Holden, D., Stoll, C.

In European Conference on Computer Vision (ECCV 2024), LNCS, Springer Cham, October 2024 (inproceedings)

Abstract
Generating realistic human motion is essential for many computer vision and graphics applications. The wide variety of human body shapes and sizes greatly impacts how people move. However, most existing motion models ignore these differences, relying on a standardized, average body. This leads to uniform motion across different body types, where movements don't match their physical characteristics, limiting diversity. To solve this, we introduce a new approach to develop a generative motion model based on body shape. We show that it's possible to train this model using unpaired data by applying cycle consistency, intuitive physics, and stability constraints, which capture the relationship between identity and movement. The resulting model generates diverse, physically plausible, and dynamically stable human motions that are both quantitatively and qualitatively more realistic than current state-of-the-art methods.

ps

project arXiv [BibTex]

project arXiv [BibTex]


no image
Training on the Test Task Confounds Evaluation and Emergence

Dominguez-Olmedo, R., Dorner, F. E., Hardt, M.

arXiv preprint arXiv:2407.07890, October 2024 (conference) In revision

Abstract
We study a fundamental problem in the evaluation of large language models that we call training on the test task. Unlike wrongful practices like training on the test data, leakage, or data contamination, training on the test task is not malpractice. Rather, the term describes a growing set of techniques to include task-relevant data in the pretraining stage of a language model. We demonstrate that training on the test task confounds both relative model evaluations and claims about emergent capabilities. We argue that the seeming superiority of one model family over another may be explained by a different degree of training on the test task. To this end, we propose an effective method to adjust for training on the test task by fine-tuning each model under comparison on the same task-relevant data before evaluation. We then show that instances of emergent behavior largely vanish once we adjust for training on the test task. This also applies to reported instances of emergent behavior that cannot be explained by the choice of evaluation metric. Our work promotes a new perspective on the evaluation of large language models with broad implications for benchmarking and the study of emergent capabilities.

sf

ArXiv [BibTex]

ArXiv [BibTex]


no image
Hexagonal electrohydraulic modules for rapidly reconfigurable high-speed robots

Yoder, Z., Rumley, E., Schmidt, I., Rothemund, P., Keplinger, C.

Science Robotics, 9, September 2024 (article)

Abstract
Robots made from reconfigurable modular units feature versatility, cost efficiency, and improved sustainability compared with fixed designs. Reconfigurable modules driven by soft actuators provide adaptable actuation, safe interaction, and wide design freedom, but existing soft modules would benefit from high-speed and high-strain actuation, as well as driving methods well-suited to untethered operation. Here, we introduce a class of electrically actuated robotic modules that provide high-speed (a peak contractile strain rate of 4618% per second, 15.8-hertz bandwidth, and a peak specific power of 122 watts per kilogram), high-strain (49% contraction) actuation and that use magnets for reversible mechanical and electrical connections between neighboring modules, thereby serving as building blocks for rapidly reconfigurable and highly agile robotic systems. The actuation performance of each hexagonal electrohydraulic (HEXEL) module is enabled by a synergistic combination of soft and rigid components; a hexagonal exoskeleton of rigid plates amplifies the motion produced by soft electrohydraulic actuators and provides a mechanical structure and connection platform for reconfigurable robots composed of many modules. We characterize the actuation performance of individual HEXEL modules, present a model that captures their quasi-static force-stroke behavior, and demonstrate both a high-jumping and a fast pipe-crawling robot. Using embedded magnetic connections, we arranged multiple modules into reconfigurable robots with diverse functionality, including a high-stroke muscle, a multimodal active array, a table-top active platform, and a fast-rolling robot. We further leveraged the magnetic connections for hosting untethered, snap-on driving electronics, together highlighting the promise of HEXEL modules for creating rapidly reconfigurable high-speed robots.

rm

link (url) DOI [BibTex]


GraspXL: Generating Grasping Motions for Diverse Objects at Scale
GraspXL: Generating Grasping Motions for Diverse Objects at Scale

Zhang, H., Christen, S., Fan, Z., Hilliges, O., Song, J.

In European Conference on Computer Vision (ECCV 2024), LNCS, Springer Cham, September 2024 (inproceedings) Accepted

ps

Code Video Paper [BibTex]

Code Video Paper [BibTex]


no image
Fiber-Optic Shape Sensing Using Neural Networks Operating on Multispecklegrams

Cao, C. G. L., Javot, B., Bhattarai, S., Bierig, K., Oreshnikov, I., Volchkov, V. V.

IEEE Sensors Journal, 24(17):27532-27540, September 2024 (article)

Abstract
Application of machine learning techniques on fiber speckle images to infer fiber deformation allows the use of an unmodified multimode fiber to act as a shape sensor. This approach eliminates the need for complex fiber design or construction (e.g., Bragg gratings and time-of-flight). Prior work in shape determination using neural networks trained on a finite number of possible fiber shapes (formulated as a classification task), or trained on a few continuous degrees of freedom, has been limited to reconstruction of fiber shapes only one bend at a time. Furthermore, generalization to shapes that were not used in training is challenging. Our innovative approach improves generalization capabilities, using computer vision-assisted parameterization of the actual fiber shape to provide a ground truth, and multiple specklegrams per fiber shape obtained by controlling the input field. Results from experimenting with several neural network architectures, shape parameterization, number of inputs, and specklegram resolution show that fiber shapes with multiple bends can be accurately predicted. Our approach is able to generalize to new shapes that were not in the training set. This approach of end-to-end training on parameterized ground truth opens new avenues for fiber-optic sensor applications. We publish the datasets used for training and validation, as well as an out-of-distribution (OOD) test set, and encourage interested readers to access these datasets for their own model development.

hi ei OS Lab zwe-sw

DOI [BibTex]


Leveraging Unpaired Data for the Creation of Controllable Digital Humans
Leveraging Unpaired Data for the Creation of Controllable Digital Humans

Sanyal, S.

Max Planck Institute for Intelligent Systems and Eberhard Karls Universität Tübingen, September 2024 (phdthesis) To be published

Abstract
Digital humans have grown increasingly popular, offering transformative potential across various fields such as education, entertainment, and healthcare. They enrich user experiences by providing immersive and personalized interactions. Enhancing these experiences involves making digital humans controllable, allowing for manipulation of aspects like pose and appearance, among others. Learning to create such controllable digital humans necessitates extensive data from diverse sources. This includes 2D human images alongside their corresponding 3D geometry and texture, 2D images showcasing similar appearances across a wide range of body poses, etc., for effective control over pose and appearance. However, the availability of such “paired data” is limited, making its collection both time-consuming and expensive. Despite these challenges, there is an abundance of unpaired 2D images with accessible, inexpensive labels—such as identity, type of clothing, appearance of clothing, etc. This thesis capitalizes on these affordable labels, employing informed observations from “unpaired data” to facilitate the learning of controllable digital humans through reconstruction, transposition, and generation processes. The presented methods—RingNet, SPICE, and SCULPT—each tackles different aspects of controllable digital human modeling. RingNet (Sanyal et al. [2019]) exploits the consistent facial geometry across different images of the same individual to estimate 3D face shapes and poses without 2D-to-3D supervision. This method illustrates how leveraging the inherent properties of unpaired images—such as identity consistency—can circumvent the need for expensive paired datasets. Similarly, SPICE (Sanyal et al. [2021]) employs a self-supervised learning framework that harnesses unpaired images to generate realistic transpositions of human poses by understanding the underlying 3D body structure and maintaining consistency in body shape and appearance features across different poses. Finally, SCULPT (Sanyal et al. [2024] generates clothed and textured 3D meshes by integrating insights from unpaired 2D images and medium-sized 3D scans. This process employs an unpaired learning approach, conditioning texture and geometry generation on attributes easily derived from data, like the type and appearance of clothing. In conclusion, this thesis highlights how unpaired data and innovative learning techniques can address the challenges of data scarcity and high costs in developing controllable digital humans by advancing reconstruction, transposition, and generation techniques.

ps

[BibTex]

[BibTex]


no image
Predictors from Causal Features Do Not Generalize Better to New Domains

Nastl, V. Y., Hardt, M.

arXiv preprint arXiv:2402.09891, September 2024 (conference) Accepted

Abstract
We study how well machine learning models trained on causal features generalize across domains. We consider 16 prediction tasks on tabular datasets covering applications in health, employment, education, social benefits, and politics. Each dataset comes with multiple domains, allowing us to test how well a model trained in one domain performs in another. For each prediction task, we select features that have a causal influence on the target of prediction. Our goal is to test the hypothesis that models trained on causal features generalize better across domains. Without exception, we find that predictors using all available features, regardless of causality, have better in-domain and out-of-domain accuracy than predictors using causal features. Moreover, even the absolute drop in accuracy from one domain to the other is no better for causal predictors than for models that use all features. If the goal is to generalize to new domains, practitioners might as well train the best possible model on all available features.

sf

ArXiv link (url) [BibTex]

ArXiv link (url) [BibTex]


Localization and recognition of human action in {3D} using transformers
Localization and recognition of human action in 3D using transformers

Sun, J., Huang, L., Hongsong Wang, C. Z. J. Q., Islam, M. T., Xie, E., Zhou, B., Xing, L., Chandrasekaran, A., Black, M. J.

Nature Communications Engineering , 13(125), September 2024 (article)

Abstract
Understanding a person’s behavior from their 3D motion sequence is a fundamental problem in computer vision with many applications. An important component of this problem is 3D action localization, which involves recognizing what actions a person is performing, and when the actions occur in the sequence. To promote the progress of the 3D action localization community, we introduce a new, challenging, and more complex benchmark dataset, BABEL-TAL (BT), for 3D action localization. Important baselines and evaluating metrics, as well as human evaluations, are carefully established on this benchmark. We also propose a strong baseline model, i.e., Localizing Actions with Transformers (LocATe), that jointly localizes and recognizes actions in a 3D sequence. The proposed LocATe shows superior performance on BABEL-TAL as well as on the large-scale PKU-MMD dataset, achieving state-of-the-art performance by using only 10% of the labeled training data. Our research could advance the development of more accurate and efficient systems for human behavior analysis, with potential applications in areas such as human-computer interaction and healthcare.

ps

paper DOI [BibTex]

paper DOI [BibTex]


Realistic Digital Human Characters: Challenges, Models and Algorithms
Realistic Digital Human Characters: Challenges, Models and Algorithms

Osman, A. A. A.

University of Tübingen, September 2024 (phdthesis)

Abstract
Statistical models for the body, head, and hands are essential in various computer vision tasks. However, popular models like SMPL, MANO, and FLAME produce unrealistic deformations due to inherent flaws in their modeling assumptions and how they are trained, which have become standard practices in constructing models for the body and its parts. This dissertation addresses these limitations by proposing new modeling and training algorithms to improve the realism and generalization of current models. We introduce a new model, STAR (Sparse Trained Articulated Human Body Regressor), which learns a sparse representation of the human body deformations, significantly reducing the number of model parameters compared to models like SMPL. This approach ensures that deformations are spatially localized, leading to more realistic deformations. STAR also incorporates shape-dependent pose deformations, accounting for variations in body shape to enhance overall model accuracy and realism. Additionally, we present a novel federated training algorithm for developing a comprehensive suite of models for the body and its parts. We train an expressive body model, SUPR (Sparse Unified Part-Based Representation), on a federated dataset of full-body scans, including detailed scans of the head, hands, and feet. We then separate SUPR into a full suite of state-of-the-art models for the head, hands, and foot. The new foot model captures complex foot deformations, addressing challenges related to foot shape, pose, and ground contact dynamics. The dissertation concludes by introducing AVATAR (Articulated Virtual Humans Trained By Bayesian Inference From a Single Scan), a novel, data-efficient training algorithm. AVATAR allows the creation of personalized, high-fidelity body models from a single scan by framing model construction as a Bayesian inference problem, thereby enabling training from small-scale datasets while reducing the risk of overfitting. These advancements push the state of the art in human body modeling and training techniques, making them more accessible for broader research and practical applications.

ps

[BibTex]


no image
Cutaneous Electrohydraulic (CUTE) Wearable Devices for Pleasant Broad-Bandwidth Haptic Cues

Sanchez-Tamayo, N., Yoder, Z., Rothemund, P., Ballardini, G., Keplinger, C., Kuchenbecker, K. J.

Advanced Science, (2402461):1-14, September 2024 (article)

Abstract
By focusing on vibrations, current wearable haptic devices underutilize the skin's perceptual capabilities. Devices that provide richer haptic stimuli, including contact feedback and/or variable pressure, are typically heavy and bulky due to the underlying actuator technology and the low sensitivity of hairy skin, which covers most of the body. This paper presents a system architecture for compact wearable devices that deliver salient and pleasant broad-bandwidth haptic cues: Cutaneous Electrohydraulic (CUTE) devices combine a custom materials design for soft haptic electrohydraulic actuators that feature high stroke, high force, and electrical safety with a comfortable mounting strategy that places the actuator in a non-contact resting position. A prototypical wrist-wearable CUTE device produces rich tactile sensations by making and breaking contact with the skin (2.44 mm actuation stroke), applying high controllable forces (exceeding 2.3 N), and delivering vibrations at a wide range of amplitudes and frequencies (0-200 Hz). A perceptual study with fourteen participants achieved 97.9% recognition accuracy across six diverse cues and verified their pleasant and expressive feel. This system architecture for wearable devices gives unprecedented control over the haptic cues delivered to the skin, providing an elegant and discreet way to activate the user's sense of touch.

hi rm

DOI [BibTex]


Electrohydraulic Musculoskeletal Robotic Leg for Agile, Adaptive, yet Energy-Efficient Locomotion
Electrohydraulic Musculoskeletal Robotic Leg for Agile, Adaptive, yet Energy-Efficient Locomotion

Buchner, T. J. K., Fukushima, T., Kazemipour, A., Gravert, S., Prairie, M., Romanescu, P., Arm, P., Zhang, Y., Wang, X., Zhang, S. L., Walter, J., Keplinger, C., Katzschmann, R. K.

Nature Communications, 15(1), September 2024 (article)

Abstract
Robotic locomotion in unstructured terrain demands an agile, adaptive, and energy-efficient architecture. To traverse such terrains, legged robots use rigid electromagnetic motors and sensorized drivetrains to adapt to the environment actively. These systems struggle to compete with animals that excel through their agile and effortless motion in natural environments. We propose a bio-inspired musculoskeletal leg architecture driven by antagonistic pairs of electrohydraulic artificial muscles. Our leg is mounted on a boom arm and can adaptively hop on varying terrain in an energy-efficient yet agile manner. It can also detect obstacles through capacitive self-sensing. The leg performs powerful and agile gait motions beyond 5 Hz and high jumps up to 40 % of the leg height. Our leg’s tunable stiffness and inherent adaptability allow it to hop over grass, sand, gravel, pebbles, and large rocks using only open-loop force control. The electrohydraulic leg features a low cost of transport (0.73), and while squatting, it consumes only a fraction of the energy (1.2 %) compared to its conventional electromagnetic counterpart. Its agile, adaptive, and energy-efficient properties would open a roadmap toward a new class of musculoskeletal robots for versatile locomotion and operation in unstructured natural environments.

rm

Press release Video (overview) Video (technical description) Article in pdf link (url) DOI [BibTex]

Press release Video (overview) Video (technical description) Article in pdf link (url) DOI [BibTex]


Building Instructions You Can Feel: Edge-Changing Haptic Devices for Digitally Guided Construction
Building Instructions You Can Feel: Edge-Changing Haptic Devices for Digitally Guided Construction

Tashiro, N., Faulkner, R., Melnyk, S., Rodriguez, T. R., Javot, B., Tahouni, Y., Cheng, T., Wood, D., Menges, A., Kuchenbecker, K. J.

ACM Transactions on Computer-Human Interaction, September 2024 (article) Accepted

Abstract
Recent efforts to connect builders to digital designs during construction have primarily focused on visual augmented reality, which requires accurate registration and specific lighting, and which could prevent a user from noticing safety hazards. Haptic interfaces, on the other hand, can convey physical design parameters through tangible local cues that don't distract from the surroundings. We propose two edge-changing haptic devices that use small inertial measurement units (IMUs) and linear actuators to guide users to perform construction tasks in real time: Drangle gives feedback for angling a drill relative to gravity, and Brangle assists with orienting bricks in the plane. We conducted a study with 18 participants to evaluate user performance and gather qualitative feedback. All users understood the edge-changing cues from both devices with minimal training. Drilling holes with Drangle was somewhat less accurate but much faster and easier than with a mechanical guide; 89% of participants preferred Drangle over the mechanical guide. Users generally understood Brangle's feedback but found its hand-size-specific grip, palmar contact, and attractive tactile cues less intuitive than Drangle's generalized form factor, fingertip contact, and repulsive cues. After summarizing design considerations, we propose application scenarios and speculate how such devices could improve construction workflows.

hi

[BibTex]

[BibTex]


no image
Evaluating Language Models as Risk Scores

Cruz, A. F., Hardt, M., Mendler-Dünner, C.

arXiv preprint arXiv:2407.14614, September 2024 (conference) Accepted

Abstract
Current question-answering benchmarks predominantly focus on accuracy in realizable prediction tasks. Conditioned on a question and answer-key, does the most likely token match the ground truth? Such benchmarks necessarily fail to evaluate language models' ability to quantify outcome uncertainty. In this work, we focus on the use of language models as risk scores for unrealizable prediction tasks. We introduce folktexts, a software package to systematically generate risk scores using large language models, and evaluate them against benchmark prediction tasks. Specifically, the package derives natural language tasks from US Census data products, inspired by popular tabular data benchmarks. A flexible API allows for any task to be constructed out of 28 census features whose values are mapped to prompt-completion pairs. We demonstrate the utility of folktexts through a sweep of empirical insights on 16 recent large language models, inspecting risk scores, calibration curves, and diverse evaluation metrics. We find that zero-shot risk sores have high predictive signal while being widely miscalibrated: base models overestimate outcome uncertainty, while instruction-tuned models underestimate uncertainty and generate over-confident risk scores.

sf

ArXiv [BibTex]

ArXiv [BibTex]


no image
Learning to Control Emulated Muscles in Real Robots: Towards Exploiting Bio-Inspired Actuator Morphology

Schumacher, P., Krause, L., Schneider, J., Büchler, D., Martius, G., Haeufle, D.

In 10th International Conference on Biomedical Robotics and Biomechatronics (BioRob), September 2024 (inproceedings) Accepted

ei

arXiv [BibTex]

arXiv [BibTex]


Benchmarks and Challenges in Pose Estimation for Egocentric Hand Interactions with Objects
Benchmarks and Challenges in Pose Estimation for Egocentric Hand Interactions with Objects

Fan, Z., Ohkawa, T., Yang, L., Lin, N., Zhou, Z., Zhou, S., Liang, J., Gao, Z., Zhang, X., Zhang, X., Li, F., Zheng, L., Lu, F., Zeid, K. A., Leibe, B., On, J., Baek, S., Prakash, A., Gupta, S., He, K., Sato, Y., Hilliges, O., Chang, H. J., Yao, A.

In European Conference on Computer Vision (ECCV 2024), LNCS, Springer Cham, September 2024 (inproceedings) Accepted

ps

Paper Leaderboard [BibTex]

Paper Leaderboard [BibTex]


{AWOL: Analysis WithOut synthesis using Language}
AWOL: Analysis WithOut synthesis using Language

Zuffi, S., Black, M. J.

In European Conference on Computer Vision (ECCV 2024), LNCS, Springer Cham, September 2024 (inproceedings)

ps

Paper [BibTex]

Paper [BibTex]