Profile cover photo
Profile photo
HGPU group
High performance computing on graphics processing units
High performance computing on graphics processing units

HGPU group's posts

Post has attachment
An Efficient Parallel Data Clustering Algorithm Using Isoperimetric Number of Trees

(Ramin Javadi, Saleh Ashkboos)

#GPU #CUDA #Clustering

We propose a parallel graph-based data clustering algorithm using CUDA GPU, based on exact clustering of the minimum spanning tree in terms of a minimum isoperimetric criteria. We also provide a comparative performance analysis of our algorithm with other related ones which demonstrates the general superiority of this parallel algorithm over other competing algorithms in terms of accuracy and speed.

Post has attachment
Trie Compression for GPU Accelerated Multi-Pattern Matching

(Xavier Bellekens, Amar Seeam, Christos Tachtatzis, Robert Atkinson)

#GPU #CUDA #Compression #Algorithms

Graphics Processing Units allow for running massively parallel applications offloading the CPU from computationally intensive resources, however GPUs have a limited amount of memory. In this paper a trie compression algorithm for massively parallel pattern matching is presented demonstrating 85% less space requirements than the original highly efficient parallel failure-less aho-corasick, whilst demonstrating over 22 Gbps throughput. The algorithm presented takes advantage of compressed row storage matrices as well as shared and texture memory on the GPU.

Post has attachment
MapSQ: A MapReduce-based Framework for SPARQL Queries on GPU

(Jiaying Feng, Xiaowang Zhang, Zhiyong Feng)

#GPU #CUDA #MapReduce #Databases

In this paper, we present a MapReduce-based framework for evaluating SPARQL queries on GPU (named MapSQ) to large-scale RDF datesets efficiently by applying both high performance. Firstly, we develop a MapReduce-based Join algorithm to handle SPARQL queries in a parallel way. Secondly, we present a coprocessing strategy to manage the process of evaluating queries where CPU is used to assigns subqueries and GPU is used to compute the join of subqueries. Finally, we implement our proposed framework and evaluate our proposal by comparing with two popular and latest SPARQL query engines gStore and gStoreD on the LUBM benchmark. The experiments demonstrate that our proposal MapSQ is highly efficient and effective (up to 50% speedup).

Post has attachment
Best Practice Guide Intel Xeon Phi v2.0

(Emanouil Atanassov, Michaela Barth, Mikko Byckling, Vali Codreanu, Nevena Ilieva, Tomas Karasek, Jorge Rodriguez, Sami Saarinen, Ole Widar Saastad, Michael Schliephake, Martin Stachon, Janko Strassburg, Volker Weinberg (Editor))

#XeonPhi #MIC #KNC #Intel #OpenMP #OpenACC

This Best Practice Guide provides information about Intel’s Many Integrated Core (MIC) architecture and programming models for the first generation Intel Xeon Phi coprocessor named Knights Corner (KNC) in order to enable programmers to achieve good performance out of their applications. The guide covers a wide range of topics from the description of the hardware of the Intel Xeon Phi coprocessor through information about the basic programming models as well as information about porting programs up to tools and strategies how to analyse and improve the performance of applications. Through the highly parallel architecture and the use of high bandwidth memory, the MIC architecture allows higher performance than traditional CPUs for many types of scientific applications. The guide was created based on the PRACE-3IP Intel Xeon Phi Best Practice Guide. New is the inclusion of information about applications, benchmarks and European Intel Xeon Phi based systems.

Post has attachment
Improving the Performance of Fully Connected Neural Networks by Out-of-Place Matrix Transpose

(Shaohuai Shi, Pengfei Xu, Xiaowen Chu)

#GPU #CUDA #Performance #DeepLearning #BLAS #LinearAlgebra #MatrixMultiplication #Caffe #Package

Fully connected network has been widely used in deep learning, and its computation efficiency is highly benefited from the matrix multiplication algorithm with cuBLAS on GPU. However, We found that, there exist some drawbacks of cuBLAS in calculating matrix $textbf{A}$ multiplies the transpose of matrix $textbf{B}$ (i.e., NT operation). To reduce the impact of NT operation by cuBLAS, we exploit the out-of-place transpose of matrix $textbf{B}$ to avoid using NT operation, and then we apply our method to Caffe, which is a popular deep learning tool. Our contribution is two-fold. First, we propose a naive method (TNN) and model-based method (MTNN) to increase the performance in calculating $textbf{A}times textbf{B}^T$, and it achieves about 4.7 times performance enhancement in our tested cases on GTX1080 card. Second, we integrate MTNN method into Caffe to enhance the efficiency in training fully connected networks, which achieves about 70% speedup compared to the original Caffe in our configured fully connected networks on GTX1080 card.

Post has attachment
Improved Lossless Image Compression Model Using Coefficient Based Discrete Wavelet Transform

(T. Velumani, S. Sukumaran)

#GPU #OpenCL #ImageProcessing #Compression #Algorithms

Compression is used for storage related applications that offers compression of audio/video, executable program, text, source code and so on. While compressing images into smallest space as possible, the constraint lies in the multispectral form of data with continuous images. In such a scenario, efficient lossless image compression is required such that the compression ratio can be improved and reduces the computational complexity. In this paper, we proposed a model called, Coefficient-based Discrete Wavelet Transform (CDWT) for lossless image compression which improves the compression ratio and reduces the computational complexity involved during transformation. The Coefficient-based Discrete Wavelet Transform initially partitions the image into coefficients to decide upon which coefficient value to be considered for encoding. Next, Probability-based Transformation for lossless image compression for continuous images follows Probability-based encoding to reduce the computational complexity involved during transformation. Extensive experiments carried out on the Waterloo color images have revealed the outstanding performance of the proposed CDWT model when benchmarked with various well established state-of-the-art schemes. The results obtained by CDWT witness a significant increase in compression ratio by reducing the total error while compressing with minimized computational complexity when compared with the results produced by the other state-of-the art methods considered.

Post has attachment
cellGPU: massively parallel simulations of dynamic vertex models

(Daniel M. Sussman)

#GPU #CUDA #Physics #Biology #Package

Vertex models represent confluent tissue by polygonal or polyhedral tilings of space, with the individual cell interacting via force laws that depend on both the geometry of the cells and the topology of the tessellation. This dependence on the connectivity of the cellular network introduces several complications to performing molecular-dynamics-like simulations of vertex models, and in particular makes parallelizing the simulations difficult. cellGPU addresses this difficulty and lays the foundation for massively parallelized, GPU-based simulations of these models. This article discusses its implementation for a pair of two-dimensional models, and compares the typical performance that can be expected between running cellGPU entirely on the CPU versus its performance when running on a range of commercial and server-grade graphics cards. By implementing the calculation of topological changes and forces on cells in a highly parallelizable fashion, cellGPU enables researchers to simulate time- and length-scales previously inaccessible via existing single-threaded CPU implementations.

Post has attachment
Accelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC

(Eriko Nurvitadhi, David Sheffield, Jaewoong Sim, Asit Mishra, Ganesh Venkatesh, Debbie Marr)

#GPU #CUDA #FPGA #ASIC #NeuralNetworks #DeepLearning #Performance

Deep neural networks (DNNs) are widely used in data analytics, since they deliver state-of-the-art accuracies. Binarized neural networks (BNNs) are recently proposed optimized variant of DNNs. BNNs constraint network weight and/or neuron value to either +1 or -1, which is representable in 1 bit. This leads to dramatic algorithm efficiency improvement, due to reduction in the memory and computational demands. This paper evaluates the opportunity to further improve the execution efficiency of BNNs through hardware acceleration. We first proposed a BNN hardware accelerator design. Then, we implemented the proposed accelerator on Aria 10 FPGA as well as 14-nm ASIC, and compared them against optimized software on Xeon server CPU, Nvidia Titan X server GPU, and Nvidia TX1 mobile GPU. Our evaluation shows that FPGA provides superior efficiency over CPU and GPU. Even though CPU and GPU offer high peak theoretical performance, they are not as efficiently utilized since BNNs rely on binarized bit-level operations that are better suited for custom hardware. Finally, even though ASIC is still more efficient, FPGA can provide orders of magnitudes in efficiency improvements over software, without having to lock into a fixed ASIC solution.
Wait while more posts are being loaded