Understanding Manipulation Contexts by Vision and Language for Robotic Vision

  • Author / Creator
    Jiang, Chen
  • In Activities of Daily Living (ADLs), humans perform thousands of arm and hand object manipulation tasks, such as picking, pouring and drinking a drink. Interpreting such tasks and grasping the underlying concepts of manipulation from vision is straightforward for humans, but difficult for robotics. Recent years, fusing computer vision with natural language processing has aided in many visual understanding tasks, such as action recognition and video captioning. Despite the advances in natural image tasks, applying visual understanding methods in robotic vision has proven to be challenging.

    Given the visual observations of the manipulation scene over time, we aim to estimate their visual attentions and describe the internal relational structures of all presenting manipulation concepts into a dynamic knowledge graph. In this thesis, we propose a framework to fuse an attention-based vision-language model with an ontology system. A convolutional neural network (CNN) with a spatial attention mechanism is invoked for weight feature extraction. A sequence-to-sequence structure with recurrent neural networks (RNN) is then followed, encoding temporal information and mapping from vision to command language. An ontology system, which defines the properties and attributes over various concepts of manipulation in a taxonomic manner, is inferred at last, converting command language into the intended dynamic knowledge graph and governing manipulation concepts with commonsense knowledge.

    To evaluate the effectiveness of our framework, we construct a specialized RGB-D dataset with 100K images spanning both robot and human manipulation tasks. The dataset is constructed under a strictly constrained knowledge domain for both robot and human manipulations, with annotated concepts and relations by frame. The performance of our framework is evaluated on our constructed Robot Semantics Dataset, plus an additional public benchmark dataset. Furthermore, ablation studies and online experiments with real-time camera streams are conducted. We demonstrate that our framework works well under the real world robot manipulation scenario, allowing the robot to attend to important manipulation concepts in the pixels and decompose manipulation relations using dynamic knowledge graphs in real time.

    The study serves as a fundamental baseline to process robotic vision along with natural language understanding, thus mimicking human-like intentional behaviors and represent the evolution of an intended manipulation procedure. In future, we aim to enhance this framework further for knowledge-guided assistive robotics.

  • Subjects / Keywords
  • Graduation date
    Fall 2021
  • Type of Item
  • Degree
    Master of Science
  • DOI
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.