Discovering Reward Functions for Language Models

Hao, Yongchang

doi:doi:10.7939/r3-a49r-k971

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

375 views
299 downloads

Discovering Reward Functions for Language Models

Author / Creator

Hao, Yongchang
Language models are a fundamental component of natural language processing (NLP) systems. Numerous successful deployments of modern artificial intelligence systems are based on language models, including GPT-4 and ChatGPT. In practice, they are often trained with the teacher forcing objective, where the goal is to predict the next token given the context. One problem with this objective is that it suffers from inconsistency between training and inference, which leads to compounding errors during the inference phase. Reinforcement learning (RL) is considered a promising approach to address these issues. However, it is usually difficult to design reward functions for language models. Previous attempts at designing reward functions are typically sparse and handcrafted for a specific task only.

In this thesis, we propose a task-agnostic approach that derives a step-wise reward function directly from a model trained with the teacher forcing objective. Our work shows that under a common assumption, language models are trained to implicitly capture the rewards. This simple connection allows us to derive a reward function from any pre-trained language models. The derived reward functions can be used to conduct reinforcement learning for language modeling. We conduct experiments on different sequence-to-sequence tasks, including dialogue and paraphrase generation. Empirical results show that the model trained with our reward function outperforms self-training and reward regression methods on both tasks, confirming the effectiveness of our derivation.
Subjects / Keywords
Graduation date

Fall 2023
Type of Item

Thesis
Degree

Master of Science
DOI

https://doi.org/10.7939/r3-a49r-k971
License

This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

English
Institution

University of Alberta
Degree level

Master's
Department
- Department of Computing Science
Supervisor / co-supervisor and their department(s)
- Mou, Lili (Computing Science)