On Coding Schemes for Straggler Mitigation in Distributed Computing

Fetrat Qharabagh, Muhammad

doi:doi:10.7939/r3-m1rv-tr83

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

159 views
168 downloads

On Coding Schemes for Straggler Mitigation in Distributed Computing

Author / Creator

Fetrat Qharabagh, Muhammad
One of the main challenges in distributed computing are straggling servers i.e. servers that have temporary or permanent delay in sending the result of the computations assigned to them. One of the proposed methods to address the straggler problem is using forward error correction codes to encode the subtasks before distributing them among workers where the results of the lagging servers are treated as erasure. This method is referred to as coded distributed computing (CDC). In this thesis we study CDC in general and address two straggler related issues.

First we consider a master-worker setting to perform a matrix-vector multiplication. The main focus in prior CDC approaches for this problem is on minimising the computation time using a family of codes known as minimum distance separable(MDS) codes. The assumption is that the encoding and the decoding time in the master is negligible. However, when the task is large or the number of workers is high, and the encoding/decoding can not be done offline, this time is not small compared to the computation time anymore. Hence, we introduce a new family of binary locally repairable codes (BLRCs) specifically designed for CDC. Being binary removes the costly multiplication operations in the encoding/decoding process and the locally repairable nature of the code further reduces the decoding complexity. Therefore, due to the lower encoding/decoding complexity in our codes, the overall time delay is lower than the conventional MDS scheme. Secondly, we consider a general master-worker setting where each worker is assigned with multiple subtasks. Accordingly, we propose a CDC scheme based on the systematic MDS codes. Then we derive and calculate the analytical probabilities of finishing the coded distributed computation before a deadline in our scheme and obtain the optimal subtask loads and the code rate for it. The overall goal is to reduce the computation time compared to conventional schemes for a multi-subtask per worker distributed computing.
Subjects / Keywords
Graduation date

Fall 2023
Type of Item

Thesis
Degree

Master of Science
DOI

https://doi.org/10.7939/r3-m1rv-tr83
License

This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

English
Institution

University of Alberta
Degree level

Master's
Department
- Department of Electrical and Computer Engineering
Specialization
- Communications
Supervisor / co-supervisor and their department(s)
- Ardakani, Masoud (Electrical and Computer Engineering)