Profile cover photo
Profile photo
HGPU group
108 followers -
High performance computing on graphics processing units
High performance computing on graphics processing units

108 followers
About
Posts

Post has attachment
[Thesis]: Cross-Compiling Shading Languages

(Lukas Hermanns)

#GPU #OpenGL #GLSL #HLSL #Vulkan #Rendering #Thesis

Shading languages are the major class of programming languages for a modern mainstream Graphics Processing Unit (GPU). The programs of those languages are called "Shaders" as they were originally used to describe shading characteristics for computer graphics applications. To make use of GPU accelerated shaders a sophisticated rendering Application Programming Interface (API) is required and the available rendering APIs at the present time are OpenGL, Direct3D, Vulkan, and Metal. While Direct3D and Metal are only supported on a limited set of platforms, OpenGL and Vulkan are for the most part platform independent. On the one hand, Direct3D is the leading rendering API for many real-time graphics applications, especially in the video game industry. But on the other hand, OpenGL and Vulkan are the prevalent rendering APIs on mobile devices, especially for Android with the largest market share. Each rendering API has its own shading language which are very similar to each other but varying enough to make it difficult for developers to write a single shader to be used across multiple APIs. However, since the enormous appearance of mobile devices many graphics systems are reliant on being platform independent. Therefore, several rendering technologies must be provided as back ends. The naive approach is to write all shaders multiple times, i.e. once for each shading language which is errorprone, highly redundant, and difficult to maintain. This thesis investigates different approaches to automatically transform shaders from one high-level language into another, so called "cross-compilation" (sometimes also referred to as "trans-compilation"). High-level to high-level translation is reviewed as well as algorithms with an Intermediate Representation (IR) such as Standard Portable Intermediate Representation (SPIR-V). We are focusing the two most prevalent shading languages, which are firstly OpenGL Shading Language (GLSL) and secondly DirectX High Level Shading Language (HLSL), while Metal Shading Language (MSL) is only briefly examined. The benefits and failings of state-of-the-art approaches are clearly separated and a novel algorithm for generic shader cross-compilation is presented.

https://hgpu.org/?p=18377
Add a comment...

Post has attachment
Revisiting Online Autotuning for Sparse-Matrix Vector Multiplication Kernels on High-Performance Accelerators

(Simon Garcia De Gonzalo, Wen-Mei Hwu, Simon D. Hammond, Christian R. Trott)

#GPU #Intel #XeonPhi #KNL #MIC #Sparse

Kokkos [1], [2] is a C++ programming model that offers the ability to write portable code that targets a wide degree of parallelism found in current HPC systems. It works by providing abstractions for parallel execution and data layouts that are mapped to different hardware resources during compilation. Some parameters, such as the size of thread teams and vector width, are available for the application scientist to tune their code to a particular hardware platform. For many applications, choosing the right parameters can be highly dependent on the data input or application configuration, that the device characteristics themselves. Sparse-Matrix Vector products (SpMV) are highly irregular computational kernels that can be found in a diverse collection of high-performance science applications. Performance for this important kernel is often highly correlated with the associated matrix sparsity as this ultimately governs the granularity, and therefore, efficiency of the memory system being used. In this paper, we propose to extend the current set of Kokkos profiling tools with an autotuner that can iterate over possible choices for thread-team size and vector width, taking advantage of runtime information to choose the optimal parameters for a particular input. This approach allows an iterative application that calls the same kernel multiple times to continue to progress towards a solution while, at the same time, alleviating the burden from the application programmer of knowing details of the underlying hardware and accounting for variable inputs. We compare an approach using the autotuner against a fixed approach, that attempts to use all the hardware resources all the time, and show that the optimal choice made by the autotuner is significantly different among the two latest classes of accelerator architectures. After 100 iterations we identify which subset of the matrices benefit from improved performance, while others are near the break-even point, where the overhead of the tool has been completely hidden. We highlight the properties of sparse matrices that can help determine when autotuning will be of benefit. Finally, we connect the overhead of the autotuner to specific sparsity patterns and hardware resources.

https://hgpu.org/?p=18376
Add a comment...

Post has attachment
Multicore architecture and cache optimization techniques for solving graph problems

(Alvaro Tzul)

#GPU #CUDA #Graphs

With the advent of era of Big Data and Internet of Things, there has been an exponential increase in the availability of large data sets. These data sets require in-depth analysis that provides intelligence for improvements in methods for academia and industry. Majority of the data sets are represented and available in the form of graphs. Therefore, the problem at hand is to address solving graph problems. Since the data sets are large, the time it takes to analyze the data is significant. Hence, in this paper, we explore techniques that can exploit existing multicore architecture to address the issue. Currently, most Central Processing Units have incorporated multicore design; in addition, co-processors such as Graphics Processing Units have large number of cores that can used to gain significant speedup. Therefore, in this paper techniques to exploit the advantages of multicore architecture is studied.

https://hgpu.org/?p=18374
Add a comment...

Post has attachment
CloudCL: Single-Paradigm Distributed Heterogeneous Computing for Cloud Infrastructures

(Max Plauth, Florian Roesler, Andreas Polze)

#GPU #OpenCL #Cloud #MPI #Java #Package

The ever-growing demand for compute resources has reached a wide range of application domains, and with that has created a larger audience for compute-intensive tasks. In this paper, we present the CloudCL framework, which empowers users to run compute-intensive tasks without having to face the total cost of ownership of operating an extensive high-performance compute infrastructure. CloudCL enables developers to tap the ubiquitous availability of cloudbased heterogeneous resources using a single-paradigm compute framework, without having to consider dynamic resource management and inter-node communication. In an extensive performance evaluation, we demonstrate the feasibility of the framework, yielding close-to-linear scale-out capabilities for certain workloads.

https://hgpu.org/?p=18375
Add a comment...

Post has attachment
Data-Parallel Hashing Techniques for GPU Architectures

(Brenton Lessley)

#GPU #CUDA #Hashing

Hash tables are one of the most fundamental data structures for effectively storing and accessing sparse data, with widespread usage in domains ranging from computer graphics to machine learning. This study surveys the state-of-the-art research on data-parallel hashing techniques for emerging massively-parallel, many-core GPU architectures. Key factors affecting the performance of different hashing schemes are discovered and used to suggest best practices and pinpoint areas for further research.

https://hgpu.org/?p=18373
Add a comment...

Post has attachment
[Thesis]: Application of Deep-Learning to Compiler-Based Graphs

(Tristan Vanderbruggen)

#GPU #OpenCL #Compilers #Graphs #DeepLearning #DL #MachineLearning #ML #Thesis

Graph-structured data is used in many domains to represent complex objects, such as the molecular structure of chemicals or interactions between members of a social network. However, extracting meaningful information from these graphs is a difficult task, which is often undertaken on a case by case basis. Devising automated methods to mine information from graphs has become increasingly important as the use of graphs becomes more prevalent. Techniques have been developed that adapt algorithms, like support vector machine, to extract information from graphs with minimal preprocessing. Unfortunately, none of these techniques permit the use of deep neural networks (DNNs) to learn from graphs. Given the potential of DNNs to learn from large amounts of data, this has become an important area of interest. Recently, a technique based on graph spectral analysis was proposed to characterize graphs in a way that allows them to be used as input by DNNs. We used this technique to apply DNNs to two different systems problems, i.e., 1) classifying malicious applications based on graph-structured representations of executable code and 2) developing prediction models that assist in iterative compilation to optimize and parallelize scientific code. Our results on malicious application classification show that graph-based characterizations increase the ability of DNN to distinguish malware from different families. We performed a detailed evaluation of deep learning applied to state-of-the-art and graph-based malware characterizations. The graph-based characterizations are obtained by reverse engineering potentially malicious applications. For performance prediction, the graphs represent versions of optimized code. We use machine learning to rank these versions and inform an iterative compilation process. The models are trained using only five percent of the search space. Our work shows that graph structured data can be used to build powerful deep learning models. The techniques developed for this dissertation shows great potential in a diverse pair of systems.

https://hgpu.org/?p=18371
Add a comment...

Post has attachment
[Thesis]: Application of Deep-Learning to Compiler-Based Graphs

(Tristan Vanderbruggen)

#GPU #OpenCL #Compilers #Graphs #DeepLearning #DL #MachineLearning #ML #Thesis

Graph-structured data is used in many domains to represent complex objects, such as the molecular structure of chemicals or interactions between members of a social network. However, extracting meaningful information from these graphs is a difficult task, which is often undertaken on a case by case basis. Devising automated methods to mine information from graphs has become increasingly important as the use of graphs becomes more prevalent. Techniques have been developed that adapt algorithms, like support vector machine, to extract information from graphs with minimal preprocessing. Unfortunately, none of these techniques permit the use of deep neural networks (DNNs) to learn from graphs. Given the potential of DNNs to learn from large amounts of data, this has become an important area of interest. Recently, a technique based on graph spectral analysis was proposed to characterize graphs in a way that allows them to be used as input by DNNs. We used this technique to apply DNNs to two different systems problems, i.e., 1) classifying malicious applications based on graph-structured representations of executable code and 2) developing prediction models that assist in iterative compilation to optimize and parallelize scientific code. Our results on malicious application classification show that graph-based characterizations increase the ability of DNN to distinguish malware from different families. We performed a detailed evaluation of deep learning applied to state-of-the-art and graph-based malware characterizations. The graph-based characterizations are obtained by reverse engineering potentially malicious applications. For performance prediction, the graphs represent versions of optimized code. We use machine learning to rank these versions and inform an iterative compilation process. The models are trained using only five percent of the search space. Our work shows that graph structured data can be used to build powerful deep learning models. The techniques developed for this dissertation shows great potential in a diverse pair of systems.

https://hgpu.org/?p=18371
Add a comment...

Post has attachment
Calamari – A High-Performance Tensorflow-based Deep Learning Package for Optical Character Recognition

(Christoph Wick, Christian Reul, Frank Puppe)

#OCR #DeepLearning #DL #TensorFlow #Python #Package

Optical Character Recognition (OCR) on contemporary and historical data is still in the focus of many researchers. Especially historical prints require book specific trained OCR models to achieve applicable results (Springmann and L"udeling, 2016, Reul et al., 2017a). To reduce the human effort for manually annotating ground truth (GT) various techniques such as voting and pretraining have shown to be very efficient (Reul et al., 2018a, Reul et al., 2018b). Calamari is a new open source OCR line recognition software that both uses state-of-the art Deep Neural Networks (DNNs) implemented in Tensorflow and giving native support for techniques such as pretraining and voting. The customizable network architectures constructed of Convolutional Neural Networks (CNNS) and Long-ShortTerm-Memory (LSTM) layers are trained by the so-called Connectionist Temporal Classification (CTC) algorithm of Graves et al. (2006). Optional usage of a GPU drastically reduces the computation times for both training and prediction. We use two different datasets to compare the performance of Calamari to OCRopy, OCRopus3, and Tesseract 4. Calamari reaches a Character Error Rate (CER) of 0.11% on the UW3 dataset written in modern English and 0.18% on the DTA19 dataset written in German Fraktur, which considerably outperforms the results of the existing softwares.

https://hgpu.org/?p=18369
Add a comment...

Post has attachment
Energy Consumption of Algorithms for Solving the Compressible Navier-Stokes Equations on CPU’s, GPU’s and KNL’s

(Satya P. Jammy, Christian T. Jacobs, David J. Lusher, Neil D. Sandham)

#GPU #CUDA #Intel #XeonPhi #KNL #CFD #FluidDynamics #NSE

In addition to the hardware wall-time restrictions commonly seen in high-performance computing systems, it is likely that future systems will also be constrained by energy budgets. In the present work, finite difference algorithms of varying computational and memory intensity are evaluated with respect to both energy efficiency and runtime on an Intel Ivy Bridge CPU node, an Intel Xeon Phi Knights Landing processor, and an NVIDIA Tesla K40c GPU. The conventional way of storing the discretised derivatives to global arrays for solution advancement is found to be inefficient in terms of energy consumption and runtime. In contrast, a class of algorithms in which the discretised derivatives are evaluated on-the-fly or stored as thread-/process-local variables (yielding high compute intensity) is optimal both with respect to energy consumption and runtime. On all three hardware architectures considered, a speed-up of ~2 and an energy saving of ~2 are observed for the high compute intensive algorithms compared to the memory intensive algorithm. The energy consumption is found to be proportional to runtime, irrespective of the power consumed and the GPU has an energy saving of ~5 compared to the same algorithm on a CPU node.

https://hgpu.org/?p=18370
Add a comment...

Post has attachment
FluidFFT: common API (C++ and Python) for Fast Fourier Transform HPC libraries

(Ashwin Vishnu Mohanan, Cyrille Bonamy, Pierre Augier)

#GPU #OpenCL #CUDA #CFD #FluidDynamics #FFT #MPI #HPC #Python #Package

The Python package fluidfft provides a common Python API for performing Fast Fourier Transforms (FFT) in sequential, in parallel and on GPU with different FFT libraries (FFTW, P3DFFT, PFFT, cuFFT). fluidfft is a comprehensive FFT framework which allows Python users to easily and efficiently perform FFT and the associated tasks, such as as computing linear operators and energy spectra. We describe the architecture of the package composed of C++ and Cython FFT classes, Python "operator" classes and Pythran functions. The package supplies utilities to easily test itself and benchmark the different FFT solutions for a particular case and on a particular machine. We present a performance scaling analysis on three different computing clusters and a microbenchmark showing that fluidfft is an interesting solution to write efficient Python applications using FFT.

https://hgpu.org/?p=18368
Add a comment...
Wait while more posts are being loaded