Perzeptive Systeme Book 2014

Human Pose Estimation from Video and Inertial Sensors

Dissertation teaser scaled

The analysis and understanding of human movement is central to many applications such as sports science, medical diagnosis and movie production. The ability to automatically monitor human activity in security sensitive areas such as airports, lobbies or borders is of great practical importance. Furthermore, automatic pose estimation from images leverages the processing and understanding of massive digital libraries available on the Internet. We build upon a model based approach where the human shape is modelled with a surface mesh and the motion is parametrized by a kinematic chain. We then seek for the pose of the model that best explains the available observations coming from different sensors. In a first scenario, we consider a calibrated mult-iview setup in an indoor studio. To obtain very accurate results, we propose a novel tracker that combines information coming from video and a small set of Inertial Measurement Units (IMUs). We do so by locally optimizing a joint energy consisting of a term that measures the likelihood of the video data and a term for the IMU data. This is the first work to successfully combine video and IMUs information for full body pose estimation. When compared to commercial marker based systems the proposed solution is more cost efficient and less intrusive for the user. In a second scenario, we relax the assumption of an indoor studio and we tackle outdoor scenes with background clutter, illumination changes, large recording volumes and difficult motions of people interacting with objects. Again, we combine information from video and IMUs. Here we employ a particle based optimization approach that allows us to be more robust to tracking failures. To satisfy the orientation constraints imposed by the IMUs, we derive an analytic Inverse Kinematics (IK) procedure to sample from the manifold of valid poses. The generated hypothesis come from a lower dimensional manifold and therefore the computational cost can be reduced. Experiments on challenging sequences suggest the proposed tracker can be applied to capture in outdoor scenarios. Furthermore, the proposed IK sampling procedure can be used to integrate any kind of constraints derived from the environment. Finally, we consider the most challenging possible scenario: pose estimation of monocular images. Here, we argue that estimating the pose to the degree of accuracy as in an engineered environment is too ambitious with the current technology. Therefore, we propose to extract meaningful semantic information about the pose directly from image features in a discriminative fashion. In particular, we introduce posebits which are semantic pose descriptors about the geometric relationships between parts in the body. The experiments show that the intermediate step of inferring posebits from images can improve pose estimation from monocular imagery. Furthermore, posebits can be very useful as input feature for many computer vision algorithms.

Author(s): Gerard Pons-Moll
Book Title: Ph.D Thesis
Year: 2014
Publisher: -
Bibtex Type: Book (book)
Electronic Archiving: grant_archive
Attachments:

BibTex

@book{Pons-Moll_dissertation,
  title = {Human Pose Estimation from Video and Inertial Sensors},
  booktitle = {Ph.D Thesis},
  abstract = {The analysis and understanding of human movement is central to many applications
  such as sports science, medical diagnosis and movie production. The ability to 
  automatically monitor human activity in security sensitive areas such as airports,
  lobbies or borders is of great practical importance. Furthermore, automatic
  pose estimation from images leverages the processing
  and understanding of massive digital libraries available on the Internet. 
  We build upon a model based approach where the human shape is modelled with a surface mesh
  and the motion is parametrized by a kinematic chain. We then seek for the pose
  of the model that best explains the available observations coming from different sensors.
  
  In a first scenario, we consider a calibrated mult-iview setup in an indoor studio. To obtain very accurate
  results, we propose a novel tracker that combines information coming from video and a
  small set of Inertial Measurement Units (IMUs). We do so by locally optimizing a joint
  energy consisting of a term that measures the likelihood of the video data and a term
  for the IMU data. This is the first work to successfully combine video and IMUs
  information for full body pose estimation. When compared to commercial marker based systems
  the proposed solution is more cost efficient and less intrusive for the user. 
  
  In a second scenario, we relax the assumption of an indoor studio and we tackle outdoor scenes
  with background clutter, illumination changes, large recording volumes and difficult motions 
  of people interacting with objects. Again, we combine information from video and IMUs. 
  Here we employ a particle based optimization approach
  that allows us to be more robust to tracking failures. To satisfy the orientation constraints
  imposed by the IMUs, we derive an analytic Inverse Kinematics (IK) procedure to sample from the manifold
  of valid poses. The generated hypothesis come from a lower dimensional manifold and therefore the computational
  cost can be reduced. Experiments on challenging sequences suggest the proposed tracker can be applied
  to capture in outdoor scenarios. Furthermore, the proposed IK sampling procedure can be used
  to integrate any kind of constraints derived from the environment.
  
  Finally, we consider the most challenging possible scenario: pose estimation of monocular images. 
  Here, we argue that estimating the pose to the degree of accuracy as in an engineered environment is
  too ambitious with the current technology. Therefore, we propose to extract meaningful semantic information about
  the pose directly from image features in a discriminative fashion. In particular, we introduce posebits
  which are semantic pose descriptors about the geometric relationships between parts in the body. 
  The experiments
  show that the intermediate step of inferring posebits from images can improve pose estimation from 
  monocular imagery. Furthermore, posebits can be very useful as input feature for many computer vision
  algorithms. },
  publisher = {-},
  year = {2014},
  slug = {pons-moll_dissertation},
  author = {Pons-Moll, Gerard}
}