Usage
  • 139 views
  • 325 downloads

Program Analysis and Compiler Transformations for Computational Accelerators

  • Author / Creator
    Lloyd, Taylor
  • Heterogeneous computing is becoming increasingly common in high-end computer systems, with vendors often including compute accelerators such as Graphics Processing Units~(GPUs) and Field-Programmable Gate Arrays~(FPGAs) for increased throughput and power efficiency. This thesis addresses the usability and performance of compute accelerators, with an emphasis on compiler-driven analyses and transformations.

    First this thesis studies the challenge of programming for FPGAs. IBM and Intel both now produce systems with integrated FPGAs, but FPGA programming remains extremely challenging. To mitigate this difficulty, FPGA vendors now ship OpenCL-based High-Level Synthesis~(HLS) tools, capable of generating Hardware Description Language~(HDL) from Open Compute Language~(OpenCL) source. Unfortunately, most OpenCL source today is written to be executed on GPUs, and runs poorly on FPGAs. This thesis explores traditional compiler analyses and transformations to automatically transform GPU-targeted OpenCL, achieving speedups up to 6.7x over unmodified Rodinia OpenCL benchmarks written for GPUs.

    Second, this thesis addresses the problem of automatically mapping OpenMP 4.X target regions to GPU hardware. In OpenMP, the compiler is responsible for determining the number and grouping of GPU threads, and the existing heuristic in LLVM/Clang performs poorly for a large subset of programs. We perform an exhaustive data collection over 23 OpenMP benchmarks from the SPEC ACCEL and Unibench suites. From our dataset, we propose a new grid geometry heuristic resulting in a 25% geometric mean speedup over geometries selected by the original LLVM/Clang heuristic.

    The third contribution of this thesis is related to the performance of an application executing in GPUs. Such performance can be significantly degraded by irregular data accesses and by control-flow divergence. Both of these performance issues arise only in the presence of thread-divergent expressions - an expression that evaluates to different values for different threads. This thesis introduces GPUCheck: a static analysis tool that detects branch divergence and non-coalesceable
    memory accesses in GPU programs. GPUCheck relies on a static dataflow analysis to find thread-dependent expressions and on a novel symbolic analysis to determine when such expressions could
    lead to performance issues. Kernels taken from the Rodinia benchmark suite and repaired by GPUCheck execute up to 30% faster than the original kernels.

    The fourth contribution of this thesis focuses on data transmission in a heterogeneous computing system. GPUs can be used as specialized accelerators to improve network connectivity. We present Run-Length Base-Delta~(RLBD) encoding, a very high-speed compression format and algorithm capable of improving throughput of 40GbE up to 57% on datasets taken from the UCI Machine Learning Repository.

  • Subjects / Keywords
  • Graduation date
    Fall 2018
  • Type of Item
    Thesis
  • Degree
    Master of Science
  • DOI
    https://doi.org/10.7939/R3Z892X2M
  • License
    Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of Alberta will advise potential users of the thesis of these terms. The author reserves all other publication and other rights in association with the copyright in the thesis and, except as herein before provided, neither the thesis nor any substantial portion thereof may be printed or otherwise reproduced in any material form whatsoever without the author's prior written permission.