Understanding Manipulation Contexts by Vision and Language for Robotic Vision

Jiang, Chen

doi:doi:10.7939/r3-tra9-cw93

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

287 views
373 downloads

Understanding Manipulation Contexts by Vision and Language for Robotic Vision

Author / Creator

Jiang, Chen
In Activities of Daily Living (ADLs), humans perform thousands of arm and hand object manipulation tasks, such as picking, pouring and drinking a drink. Interpreting such tasks and grasping the underlying concepts of manipulation from vision is straightforward for humans, but difficult for robotics. Recent years, fusing computer vision with natural language processing has aided in many visual understanding tasks, such as action recognition and video captioning. Despite the advances in natural image tasks, applying visual understanding methods in robotic vision has proven to be challenging.

Given the visual observations of the manipulation scene over time, we aim to estimate their visual attentions and describe the internal relational structures of all presenting manipulation concepts into a dynamic knowledge graph. In this thesis, we propose a framework to fuse an attention-based vision-language model with an ontology system. A convolutional neural network (CNN) with a spatial attention mechanism is invoked for weight feature extraction. A sequence-to-sequence structure with recurrent neural networks (RNN) is then followed, encoding temporal information and mapping from vision to command language. An ontology system, which defines the properties and attributes over various concepts of manipulation in a taxonomic manner, is inferred at last, converting command language into the intended dynamic knowledge graph and governing manipulation concepts with commonsense knowledge.

To evaluate the effectiveness of our framework, we construct a specialized RGB-D dataset with 100K images spanning both robot and human manipulation tasks. The dataset is constructed under a strictly constrained knowledge domain for both robot and human manipulations, with annotated concepts and relations by frame. The performance of our framework is evaluated on our constructed Robot Semantics Dataset, plus an additional public benchmark dataset. Furthermore, ablation studies and online experiments with real-time camera streams are conducted. We demonstrate that our framework works well under the real world robot manipulation scenario, allowing the robot to attend to important manipulation concepts in the pixels and decompose manipulation relations using dynamic knowledge graphs in real time.

The study serves as a fundamental baseline to process robotic vision along with natural language understanding, thus mimicking human-like intentional behaviors and represent the evolution of an intended manipulation procedure. In future, we aim to enhance this framework further for knowledge-guided assistive robotics.
Subjects / Keywords
Graduation date

Fall 2021
Type of Item

Thesis
Degree

Master of Science
DOI

https://doi.org/10.7939/r3-tra9-cw93
License

This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

English
Institution

University of Alberta
Degree level

Master's
Department
- Department of Computing Science
Supervisor / co-supervisor and their department(s)
- Jagersand, Martin