Usage
  • 58 views
  • 106 downloads

Deep Learning for 3D Human Action Modeling and Understanding

  • Author / Creator
    Guo, Chuan
  • Studying 3D human actions is a fundamental task in computer graphics, computer vision, and robotics, with a broad range of applications such as VR/AR, AAA gaming, filmmaking, and artistic creation. Conventional approaches often necessitate labor-intensive manual intervention, consuming substantial time and resources. While deep learning has fundamentally transformed the analysis of many other visual modalities like images and videos, its potential in human action analysis remains largely unexplored. In this thesis, we address this gap by developing solutions and resources for modeling and understanding human motions using deep generative models.

    Firstly, we collect and annotate two large-scale multimodal human action datasets: (i) HumanAct12, which consists of 1,191 motion clips and 90,099 frames, annotated with 12 coarse-grained and 34 fine-grained action classes; and (ii) HumanML3D, which contains 44,970 textual descriptions and 14,616 motions, totaling 28.59 hours of data. Secondly, we investigate the use of neural networks designed to accommodate various input modalities for motion generation, including action categories (action2motion), textual descriptions (text2motion), and style cues (motion stylization). Distinct from previous deterministic approaches to motion synthesis, our methods properly adopt a series of Variational Autoencoder (VAE)-base frameworks, which fully recognize the inherently stochastic nature of human movements, thereby achieving diverse and natural 3D human motion generation from various conditions. Thirdly, we develop an automated character animation pipeline that transfers the motions to the character in a single image (motion2video). This is accomplished through reconstructing the 3D shape and texture of the character from input image, rigging, animating, and rendering the 3D sequence into 2D video.
    Fourthly, we explore the reciprocal relationship between human motion and human language (i.e, text), offering a unified framework for both motion understanding and generation. To bridge the gap between human motion and text modalities, we introduce the concept of discrete motion tokens – a novel motion representation formed through deep vector quantization techniques. These motion tokens allow for the seamless translation of human motion into natural language, akin to neural machine translation processes. Furthermore, we present a inverse alignment technique, and demonstrate that understanding motions facilitates more accurate and robust motion generation.

  • Subjects / Keywords
  • Graduation date
    Spring 2024
  • Type of Item
    Thesis
  • Degree
    Doctor of Philosophy
  • DOI
    https://doi.org/10.7939/r3-v08q-4074
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.