{"@context":"http://iiif.io/api/presentation/2/context.json","@id":"https://repo.library.stonybrook.edu/cantaloupe/iiif/2/manifest.json","@type":"sc:Manifest","label":"Manual, Automatic, and Semi-Automatic Performance Optimization","metadata":[{"label":"dc.description.sponsorship","value":"This work is sponsored by the Stony Brook University Graduate School in compliance with the requirements for completion of degree"},{"label":"dc.format","value":"Monograph"},{"label":"dc.format.medium","value":"Electronic Resource"},{"label":"dc.format.mimetype","value":"Application/PDF"},{"label":"dc.identifier.uri","value":"http://hdl.handle.net/11401/78358"},{"label":"dc.language.iso","value":"en_US"},{"label":"dcterms.abstract","value":"As we approach the post-Moore\u2019s law era, the burden of achieving good performance is shifting from hardware architects to software engineers. Programmers can no longer expect a doubling of performance for each hardware generation. The trend has moved from faster CPU\u2019s to increased parallelism through multiple cores and specialized instructions (e.g. SIMD intrinsics). Taking advantage of the parallelism inherent in modern hardware, however, is a non-trivial task. The programmer is required to either spend significant time to optimize the code for a parallel architecture, or use automatic methods to generate the code (e.g. through compiler optimizations). This dissertation addresses each of these methods of obtaining optimized code and presents novel methods for easing the burden of optimization on the programmer.\nIn the first part of this dissertation, we explore three case studies where we present successful optimizations for different architectures. In the first two case studies, optimizations are presented for GPGPU (General Purpose GPU) accelerators using the CUDA programming language. In the first case study, we present an optimized version of the FDK back-projection algorithm which is commonly used in CT (Computed Tomography) reconstruction. The second case study details how we map an inherently sequential clustering algorithm to a parallel architecture. In the third case study, we detail our optimizations for the LQCD (Lattice Quantum Chromo-Dynamics) physics simulation algorithm. In this study, we describe how we get good performance by combining AVX (Advanced Vector Extension) intrinsics, OpenMP, and MPI.\nIn the second part of this dissertation, we describe an iterative optimization framework based on the ant colony optimization algorithm. Optimization decisions are formulated as a directed acyclic graph. The ant colony optimization algorithm is then used to identify a path through this graph that corresponds to an optimal GPU implementation. This framework is then extended by tightly integrating it with a compiler based on the polyhedral model; thus, allowing it to identify optimal skewing and permutation transformations as well as loop and GPU kernel fusion. Several optimizations not incorporated into previous GPU auto-tuners are also evaluated \u2013 including texture memory and parallel reduction. The framework also extends the traditional ant colony optimization algorithm to include performance metrics as well as a regression tree analysis to segment the search space into regions with promising performance. Results show significant speed-up over a GPU code generator based on the polyhedral model alone.\nIn part 3 of this dissertation, we present a visualization interface which allows users to quickly explore a number of optimizations. This interface is constructed on top of R-Stream \u2013 an optimizing compiler based on the polyhedral model. This interface visualizes performance through heuristics and run time performance data and allows users to identify optimization opportunities and apply common loop transformations in a visually intuitive way. User studies show that users were able to identify optimizations that outperformed R-Stream alone. This interface also greatly improved the user\u2019s understanding of the transformations made by R-Stream and the reasons behind them."},{"label":"dcterms.available","value":"2018-07-09T14:14:21Z"},{"label":"dcterms.contributor","value":"Meister, Benoit."},{"label":"dcterms.creator","value":"Papenhausen, Eric"},{"label":"dcterms.dateAccepted","value":"2018-07-09T14:14:21Z"},{"label":"dcterms.dateSubmitted","value":"2018-07-09T14:14:21Z"},{"label":"dcterms.description","value":"Department of Computer Science."},{"label":"dcterms.extent","value":"214 pg."},{"label":"dcterms.format","value":"Monograph"},{"label":"dcterms.identifier","value":"http://hdl.handle.net/11401/78358"},{"label":"dcterms.issued","value":"2017-08-01"},{"label":"dcterms.language","value":"en_US"},{"label":"dcterms.provenance","value":"Made available in DSpace on 2018-07-09T14:14:21Z (GMT). No. of bitstreams: 1\nPapenhausen_grad.sunysb_0771E_13395.pdf: 7109738 bytes, checksum: 755cf975e90f67cd899a47b057433b5e (MD5)\n  Previous issue date: 2017-08-01"},{"label":"dcterms.subject","value":"Computer science"},{"label":"dcterms.title","value":"Manual, Automatic, and Semi-Automatic Performance Optimization"},{"label":"dcterms.type","value":"Dissertation"},{"label":"dc.type","value":"Dissertation"}],"description":"This manifest was generated dynamically","viewingDirection":"left-to-right","sequences":[{"@type":"sc:Sequence","canvases":[{"@id":"https://repo.library.stonybrook.edu/cantaloupe/iiif/2/canvas/page-1.json","@type":"sc:Canvas","label":"Page 1","height":1650,"width":1275,"images":[{"@type":"oa:Annotation","motivation":"sc:painting","resource":{"@id":"https://repo.library.stonybrook.edu/cantaloupe/iiif/2/15%2F60%2F20%2F156020409179794809566632549039533397559/full/full/0/default.jpg","@type":"dctypes:Image","format":"image/jpeg","height":1650,"width":1275,"service":{"@context":"http://iiif.io/api/image/2/context.json","@id":"https://repo.library.stonybrook.edu/cantaloupe/iiif/2/15%2F60%2F20%2F156020409179794809566632549039533397559","profile":"http://iiif.io/api/image/2/level2.json"}},"on":"https://repo.library.stonybrook.edu/cantaloupe/iiif/2/canvas/page-1.json"}]}]}]}