Usage
  • 153 views
  • 333 downloads

Compiler-Driven Performance on Heterogeneous Computing Platforms

  • Author / Creator
    Chikin, Artem
  • Modern parallel programming languages such as OpenMP provide simple, portable programming models that support offloading of computation to various accelerator devices. Coupled with the increasing prevalence of heterogeneous computing platforms and the battle for supremacy in the co-processor space, gives rise to additional challenges placed on compiler/runtime vendors to handle the increasing complexity and diversity of shared-memory parallel platforms.To start, this thesis presents three kernel re-structuring ideas that focus on improving the execution of high-level parallel code in GPU devices. The first addresses programs that include multiple parallel blocks within a single region of GPU code. A proposed compiler transformation can split such regions into multiple regions, leading to the launching of multiple kernels, onefor each parallel region. Second, is a code transformation that sets up a pipeline of kernel execution and asynchronous data transfers. This transformation enables the overlap of communication and computation. The third idea is that the selection of a grid geometry for the execution of a parallelregion must balance the GPU occupancy with the potential saturation of the memory throughput in the GPU. Adding this additional parameter to the geometry selection heuristic can often yield better performance at lower occupancy levels.This thesis next describes the Iteration Point Difference Analysis --- a new static-analysis framework that can be used to determine the memory coalescing characteristics of parallel loops that target GPU offloading and to ascertain safety and profitability of loop transformations with the goal of improvingtheir memory-access characteristics. GPU kernel execution time across the Polybench suite is improved by up to 25.5x on an Nvidia P100 with benchmark overall improvement of up to 3.2x. An opportunity detected in a SPEC ACCEL benchmark yields kernel speedup of 86.5x with a benchmark improvement of 3.4x, and a kernel speedup of 111.1x with a benchmark improvement of 2.3 on an Nvidia P100 and V100, respectively.The task of modelling performance takes on an ever increasing importance as systems must make automated decisions on the most suitable offloading target. The third contribution of this thesis motivates the need with a study of cross-architectural changes in profitability of kernel offloading to GPU versus host CPU execution, and presents a prototype design for a hybrid computing device selection framework.

  • Subjects / Keywords
  • Graduation date
    Spring 2019
  • Type of Item
    Thesis
  • Degree
    Master of Science
  • DOI
    https://doi.org/10.7939/r3-x23c-9985
  • License
    Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of Alberta will advise potential users of the thesis of these terms. The author reserves all other publication and other rights in association with the copyright in the thesis and, except as herein before provided, neither the thesis nor any substantial portion thereof may be printed or otherwise reproduced in any material form whatsoever without the author's prior written permission.