HOME    »    SCIENTIFIC RESOURCES    »    Volumes
Abstracts and Talk Materials
High Performance Computing and Emerging Architectures
January 10-14, 2011

Lorena Barba - Boston University

Reproducible results and open source code
Keller Hall 3-180

January 12, 2011 2:00 pm - 3:00 pm


The basis and perspectives of an exascale algorithm: our ExaFMM project
January 12, 2011 11:00 am - 12:00 pm

Linearly scaling algorithms will be crucial for the problem sizes that will be tackled in capability exascale systems. It is interesting to note that many of the most successful algorithms are hierarchical in nature, such as multi-grid methods and fast multipole methods (FMM). We have been leading development efforts for open-source FMM software for some time, and recently produced GPU implementations of the various computational kernels involved in the FMM algorithm. Most recently, we have produced a multi-GPU code, and performed scalability studies showing high parallel efficiency in strong scaling. These results have pointed to several features of the FMM that make it a particularly favorable algorithm for the emerging heterogeneous, many-core architectural landscape. We propose that the FMM algorithm offers exceptional opportunities to enable exascale applications. Among its exascale-suitable features are: (i) it has intrinsic geometric locality, and access patterns are made local via particle indexing techniques; (ii) we can achieve temporal locality via an efficient queuing of GPU tasks before execution, and at a fine level by means of memory coalescing based on the natural index-sorting techniques; (iii) global data communication and synchronization, often a significant impediment to scalability, is a soft barrier for the FMM, where the most time-consuming kernels are, respectively, purely local (particle-to-particle interactions) and "hierarchically synchronized" (multipole-to-local interactions, which happen simultaneously at every level of the tree). In addition, we suggest a strategy for achieving the best algorithmic performance, based on two key ideas: (i) hybridize the FMM with treecode by choosing on-the-fly between particle-particle, particle-box, and box-box interactions, according to a work estimate; (ii) apply a dynamic error-control technique, effected on the treecode by means of a variable "box-opening angle" and on the FMM by means of a variable order of the multipole expansion. We have carried out preliminary implementation of these ideas/techniques, achieving a 14x speed-up with respect to our current published version of the FMM. Considering that this effort was only exploratory, we are certain to possess the potential for unprecedented performance with these algorithms.

Richard Brower - Boston University

Algorithms for Lattice Field Theory at Extreme Scales
January 11, 2011 4:30 pm - 6:00 pm

Increases in computational power allow lattice field theories to resolve smaller scales, but to realize the full benefit for scientific discovery, new multi-scale algorithms must be developed to maximize efficiency. Examples of new trends in algorithms include adaptive multigrid solvers for the quark propagator and an improved symplectic Force Gradient integrator for the Hamiltonian evolution used to include the quark contribution to vacuum fluctuations in the quantum path integral. Future challenges to algorithms and software infrastructure targeting many-core GPU accelerators and heterogeneous extreme scale computing are discussed.


Mixed precision computing
Lind Hall 401

January 12, 2011 2:00 pm - 3:00 pm

Cris Cecka - Stanford University

Application of Assembly of Finite Element Methods on Graphics Processors for Real-Time Elastodynamics
January 12, 2011 8:30 am - 9:30 am

Keywords of the presentation: Finite Elements, FEM, GPGPU, GPU, CUDA, High Performance Computing, HPC, Mechanics

We discuss multiple strategies to perform general computations on unstructured grids using a GPU, with specific application to the assembly of systems of equations in finite element methods (FEMs). For each method, we discuss the GPU hardware's limiting resources, optimizations, key data structures, and dependence of the performance with respect to problem size, element size, and GPU hardware generation. These methods are applied to a nonlinear hyperelastic material model to develop a large-scale real-time interactive elastodynamic visualization. By performing the assembly, solution, update, and visualization stages solely on the GPU, the similuation benefits from speed-ups in each stage and avoids costly GPU-CPU transfers of data.


Jonathan Cohen - NVIDIA Corporation

Thinking parallel: sparse iterative solvers with CUDA
January 11, 2011 8:30 am - 9:30 am

Keywords of the presentation: domain decomposition, data parallelism, multi-level methods, gpu computing, linear solvers

Iterative sparse linear solvers are a critical component of a scientific computing platform.  Developing effective preconditioning strategies is the main challenge in developing iterative sparse solvers on massively parallel systems. As computing systems become increasingly power-constrained, memory hierarchies for massively parallel systems will become deeper and  more hierarchical.  Parallel algorithms with all-to-all communication patterns that assume uniform memory access times will be inefficient on these systems.  In this talk, I will outline the challenges of developing good parallel preconditioners, and demonstrate that domain decomposition methods have communication patterns that match emerging parallel platforms.  I will present recent work to develop restricted additive Schwarz (RAS) preconditioners as part of the open source 'cusp' library of sparse parallel algorithms.  On 2d Poisson problems, a RAS preconditioner is consistently faster than diagonal preconditioning in time-to-solution.  Detailed analysis demonstrates that the communication pattern of RAS matches the on-chip bandwidths of a Fermi GPU.  Line smoothing, which requires solving a large number of small tridiagonal linears systems in local memory, is another preconditioning approach with similar communication patterns.  I will conclude with a roadmap for devoping a range of preconditioners, smoothers, and linear solvers on massively parallel hardware based on the domain decomposition and line smoothing approaches.


Jack Dongarra - University of Tennessee

Architecture-aware Algorithms and Software for Scalable Performance and Resilience on Heterogeneous Architectures
January 10, 2011 11:00 am - 12:00 pm

In this talk we examine how high performance computing has changed over the last 10-year and look toward the future in terms of trends. These changes have had and will continue to have a major impact on our software.  Some of the software and algorithm challenges have already been encountered, such as management of communication and memory hierarchies through a combination of compile--time and run--time techniques, but the increased scale of computation, depth of memory hierarchies, range of latencies, and increased run--time environment variability will make these problems much harder.  We will look at five areas of research that will have an importance impact in the development of software and algorithms.   We will focus on following themes:
  • Redesign of software to fit multicore architectures
  • Automatically tuned application software
  • Exploiting mixed precision for performance
  • The importance of fault tolerance
  • Communication avoiding algorithms

Anne Elster - Norwegian University of Science and Technology (NTNU)

Medical Imaging on the GPU Using OpenCL: 3D Surface Extraction and 3D Ultrasound Reconstruction
January 11, 2011 4:30 pm - 6:00 pm

Collaborators: Frank Linseth, Holger Ludvigsen, Erik Smistad and Thor Kristian Valgerhaug

GPUs offer a lot of compute power enabling real-time processing of images. This poster depict som our of group's recent work on image processing for medical applications on GPUs including 3D surface extraction using marching cubes and 3D ultrasound reconstruction. We have previously developed Cg and CUDA codes for wavelet transforms and CUDA codes for surface extraction for seismic images.


Real-Time Medical and Geological Processing on GPU-based Systemss: Experiences and Challenges
January 14, 2011 9:30 am - 10:30 am

Keywords of the presentation: GPUs, real-time applications, medical imaging, geological imaging, seismic processing

GPUs are now massive floating-point stream processors that offer a source of energy-efficient compute power on our laptops and desktops. Recent development of tools such as CUDA and OpenCL have made it much easier to utilize the computational power these systems offer. However, in order to optimally harness the the power of these GPU-based systems, there still are many challenges to overcome. In this talk, several issues related to our experiences with medical and geological processing applications that can benefit from real-time processing of data on GPUs, will be discussed. These include real-time medical imaging, e.g. for ultrasound-guided discovery and surgery, real-time seismic CT image enhancement, and using GPUs for real-time compression of seismic data in order to lower I/O latency. This talk will highlight work our research group has been involved dating back from 2006 through today.


Allan Engsig-Karup - Technical University of Denmark

Development of a new massively parallel tool for nonlinear free surface wave simulation
January 11, 2011 4:30 pm - 6:00 pm

The research objective of this work is to develop a new dedicated and massively parallel tool for efficient simulation of unsteady nonlinear free surface waves. The tool will be used for applications in coastal and offshore engineering, e.g. in connection with prediction of wave kinematics and forces at or near human-made structures. The tool is based on a unified potential flow formulation which can account for fully nonlinear and dispersive wave motion over uneven depths under the assumptions of nonbreaking waves, irrotational and inviscid flow. This work is a continuation of earlier work and will continue to contribute to advancing state-of-the-art for efficient wave simulation. The tool is expected to be orders of magnitude faster than current tools due to efficient algorithms and utilization of available hardware resources.


Development of Desktop Computing Applications and Engineering Tools on GPUs
January 11, 2011 4:30 pm - 6:00 pm

GPULab - A competence center and laboratory for research and collaboration within academia and partners in industry has been established in 2008 at section for Scientific Computing, DTU informatics, Technical University of Denmark. In GPULab we focus on the utilization of Graphics Processing Units (GPUs) for high-performance computing applications and software tools in science and engineering, inverse problems, visualization, imaging, dynamic optimization. The goals are to contribute to the development of new state-of-the-art mathematical models and algorithms for maximum throughout performance, improved performance profiling tools and assimilation of results to academic and industrial partners in our network. Our approaches calls for multi-disciplinary skills and understanding of hardware, software development, profiling tools and tuning techniques, analytical methods for analysis and development of new approaches, together with expert knowledge in specific application areas within science and engineering. We anticipate that our research in a near future will bring new algorithms and insight in engineering and science applications targeting practical engineering problems.

Geoffrey Fox - Indiana University

Clouds MapReduce and HPC
January 13, 2011 3:00 pm - 4:00 pm

Keywords of the presentation: MPI MapReduce Clouds Grids

1) We analyze the different tradeoffs and goals of Grid, Cloud and parallel (cluster/supercomputer) computing. 2) They tradeoff performance, fault tolerance, ease of use (elasticity), cost, interoperability. 3) Different application classes (characteristics) fit different architectures and we describe a hybrid model with Grids for data, traditional supercomputers for large scale simulations and clouds for broad based "capacity computing" including many data intensive problems. 4) We discuss the impressive features of cloud computing platforms and compare MapReduce and MPI. 5) We take most of our examples from the life science area. 6) We conclude with a description of FutureGrid -- a TeraGrid system for prototyping new middleware and applications.


Martin Gander - Universite de Geneve

A Domain Decomposition Method that Converges in Two Iterations for any Subdomain Decomposition and PDE
January 11, 2011 4:30 pm - 6:00 pm

Joint work with Felix Kwok.

All domain decomposition methods are based on a decomposition of the physical domain into many subdomains and an iteration, which uses subdomain solutions only (and maybe a coarse grid), in order to compute an approximate solution of the problem on the entire domain. We show in this poster that it is possible to formulate such an iteration, only based on subdomain solutions, which converges in two steps to the solution of the underlying problem, independently of the number of subdomains and the PDE solved. This method is mainly of theoretical interest, since it contains sophisticated non-local operators (and a natural coarse grid component), which need to be approximated in order to obtain a practical method.

Gaurav Gaurav - University of Minnesota, Twin Cities

Efficient Uncertainty Quantification using GPUs
January 11, 2011 4:30 pm - 6:00 pm

Joint work with Steven F. Wojtkiewicz (Department of Civil Engineering, University of Minnesota, Minneapolis, MN 55414, USA. bykvich@umn.edu).

Graphics processing units (GPUs) have emerged as a much economical and a highly competitive alternative to CPU-based parallel computing. Recent studies have shown that GPUs consistently outperform their best corresponding CPU-based parallel computing equivalents by up to two orders of magnitude in certain applications. Moreover, the portability of the GPUs enables even a desktop computer to provide a teraflop (1012 floating point operations per second) of computing power. This study presents the gains in computational efficiency obtained using the GPU-based implementations of five types of algorithms frequently used in uncertainty quantification problems arising in the analysis of dynamical systems with uncertain parameters and/or inputs.

Mike Giles - University of Oxford

OP2: an open-source library for unstructured grid applications
January 13, 2011 11:00 am - 12:00 pm

Keywords of the presentation: parallel computing, GPU, CUDA, unstructured grids

Based on an MPI library written over 10 years ago, OP2 is a new open-source library which is aimed at application developers using unstructured grids. Using a single API, it targets a variety of backend architectures, including both manycore GPUs and multicore CPUs with vector units. The talk will cover the API design, key aspects of the parallel implementation on the different platforms, and preliminary performance results on a small but representative CFD test code.


Leopold Grinberg - Brown University

Brain Perfusion: Multi-scale Simulations and Visualization
January 11, 2011 4:30 pm - 6:00 pm

Joint work with J. Insley, M. Papka, and G. E. Karniadakis.

Interactions of blood flow in the human brain occur between different scales, determined by flow features in the large arteries (above 0.5mm diameter), arterioles, and the capillaries (of 5E-3 mm). To simulate such multi-scale flow we develop mathematical models, numerical methods, scalable solvers and visualization tools. Our poster will present NektarG - a research code developed at Brown University for continuum and atomistic simulations. NektarG is based on a high-order spectral/hp element discretization featuring multi-patch domain decomposition for continuum flow simulations, and modified DPD-LAMMPS for mesoscopic simulations. The continuum and atomistic solvers are coupled via Multi-level Communicating Interface to exchange data required by interface conditions. The visualization software is based on ParaView and NektarG utilities accessed through the ParaView GUI. The new visualization software allows to simultaneously present data computed in coupled (multi-scale) simulations. The software automatically synchronizes the display of time evolution of solutions at multiple scales.


Ultraparallel solvers for multi-scale brain blood flow simulations on exascale computers
January 13, 2011 2:00 pm - 3:00 pm

Keywords of the presentation: Multi-scale and multi-physics solvers; heterogeneous computer architecture; exaflop computing; blood flow simulations

Solvers for coupled multi-scale (multi-physics) may be constructed by coupling an array of existing and well tested parallel numerical solvers, each designed to tackle a problem at different spatial and temporal scale. Each solver can be optimized/designed for different computer architecture. Future supercomputers may be composed of heterogeneous processing units, i.e., CPU/GPU. To make an efficient use of computational recourses, the coupled solvers must support topology-aware mapping of tasks to the processing units were the best parallel efficiency could be achieved.

Arterial blood circulation is a multi-scale process where time and space scales range from nanoseconds (nanometers) to seconds (meters), reciprocally. The macro-vascular scales describing the flow dynamics in larger vessels are coupled to the meso-vascular scales unfolding dynamics of individual blood cells. The meso- vascular events are coupled to the micro-vascular ones accounting for blood perfusion, clot formation, adhesion of the blood cells to the arterial walls, etc. Besides the multi-scale nature of the problem, its size often presents a substantial computational challenge even for simulations considering a single scale.

In this talk we will try to envision the design of a multi-scale solver for blood flow simulations, tailored to heterogeneous computer architecture.

Dominik Göddeke - Universität Dortmund
Robert Strzodka - Max-Planck-Institut für Informatik

Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers
January 11, 2011 4:30 pm - 6:00 pm

We present efficient fine-grained parallelization techniques for robust multigrid solvers and Krylov subspace schemes, in particular for numerically strong smoothing and preconditioning operators. We apply them to sparse ill-conditioned linear systems of equations that arise from grid-based discretization techniques like finite differences, volumes and elements; the systems are notoriously hard to solve due to severe anisotropies in the underlying mesh and differential operator. These strong smoothers are characterized by sequential data dependencies, and do not parallelize in a straightforward manner. For linewise preconditioners, exact parallel algorithms exist, and we present a novel, efficient implementation of a cyclic reduction tridiagonal solver. For other preconditioners, traditional wavefront techniques can be applied, but their irregular and limited parallelism makes them a bad match for GPUs. Therefore, we discuss multicoloring techniques to recover parallelism in these preconditioners, by decoupling some of the dependencies at the expense of at first reduced numerical performance. However, by carefully balancing the coupling strength (more colors) with the parallelization benefits, the multicolored variants retain almost all of the sequential numerical performance. Further improvements are achieved by merging the tridiagonal and Gauß-Seidel approach into a smoothing operator that combines their advantages, and by employing an alternating direction implicit scheme to gain independence of the numbering of the unknowns. Due to their advantageous numerical properties, multigrid solvers equipped with strong smoothers are between four and eight times more efficient than with simple Gauß-Seidel preconditioners, and we achieve speedups factors between six and 18 with the GPU implementations over carefully tuned CPU variants.

Michael Heroux - Sandia National Laboratories

Exascale programming models
Lind Hall 409

January 12, 2011 2:00 pm - 3:00 pm


Emerging Programming and Machine Models: Opportunities for Numerical Algorithms R&D
January 14, 2011 11:00 am - 12:00 pm

Keywords of the presentation: high performance computing, numerical linear algebra, parallel computing, manycore architectures, iterative methods

After 15-20 years of architectural stability, we are in the midst of a dramatic change in high performance computing systems design. In this talk we discuss the commonalities across the viable systems of today, and look at opportunities for numerical algorithms research and development. In particular, we explore possible programming and machine abstractions and how we can develop effective algorithms based on these abstractions, addressing, among other things, robustness issues for preconditioned iterative methods and resilience of algorithms in the presence of soft errors.

Elizabeth Jessup - University of Colorado

The Build to Order Compiler for Matrix Algebra Optimization
January 11, 2011 4:30 pm - 6:00 pm

The performance of many high performance computing applications is limited by data movement from memory to the processor. Often their cost is more accurately expressed in terms of memory traffic rather than floating-point operations and, to improve performance, data movement must be reduced. One technique to reduce memory traffic is the fusion of loops that access the same data. We have built the Build to Order (BTO) compiler to automate the fusion of loops in matrix algebra kernels. Loop fusion often produces speedups proportional to the reduction in memory traffic, but it can also lead to negative effects in cache and register use. We present the results of experiments with BTO that help us to understand the workings of loop fusion.

David Keyes - King Abdullah University of Science & Technology, Columbia University

The Exascale: Why and How
January 10, 2011 9:30 am - 10:30 am

Sustained floating-point computation rates on real applications, as tracked by the ACM Gordon Bell Prize, increased by three orders of magnitude from 1988 (1 Gigaflop/s) to 1998 (1 Teraflop/s), and by another three orders of magnitude to 2008 (1 Petaflop/s). Computer engineering provided only a couple of orders of magnitude of improvement for individual cores over that period; the remaining factor came from concurrency, which is approaching one million-fold.

Algorithmic improvements contributed meanwhile to making each flop more valuable scientifically. As the semiconductor industry now slips relative to its own roadmap for silicon-based logic and memory, concurrency, especially on-chip many-core concurrency and GPGPU SIMD-type concurrency, will play an increasing role in the next few orders of magnitude, to arrive at the ambitious target of 1 Exaflop/s, extrapolated for 2018. An important question is whether today's best algorithms are efficiently hosted on such hardware and how much co-design of algorithms and architecture will be required.

From the applications perspective, we illustrate eight reasons why today's computational scientists have an insatiable appetite for such performance: resolution, fidelity, dimension, artificial boundaries, parameter inversion, optimal control, uncertainty quantification, and the statistics of ensembles.

The paths to the exascale summit are debated, but all are narrow and treacherous, constrained by fundamental laws of physics, cost, power consumption, programmability, and reliability. Drawing on recent reports, workshops, vendor projections, and experiences with scientific codes on contemporary platforms, we propose roles for today's researchers in one of the great global scientific quests of the next decade.

Andreas Klöckner - New York University

High-order DG Wave Propagation on GPUs: Infrastructure, Implementation, Method Improvements
January 12, 2011 3:00 pm - 4:00 pm

Having recently shown that high-order unstructured discontinuous Galerkin (DG) methods are a discretization method for systems of hyperbolic conservation laws that is well-matched to execution on GPUs, in this talk I will explore both core and supporting components of high-order DG solvers for their suitability for and performance on modern, massively parallel architectures. Components examined range from software components facilitating implementation to strategies for automated tuning and, time permitting, numerical tweaks to the method itself. In concluding, I will present a selection of further design considerations and performance data.

Matthew Knepley - University of Chicago

GPU programming from higher level representations
January 10, 2011 4:30 pm - 5:30 pm

We discuss the construction and execution of GPU kernels from higher level specifications. Examples will be shown using low-order finite elements and fast multipole method.

Hugo Leclerc - École Normale Supérieure de Cachan

Global symbolic manipulations and code generation for Finite Elements on SIM[DT] hardware
January 11, 2011 4:30 pm - 6:00 pm

Tools have been developed to generate code to solve partial differential equations from high level descriptions (manipulation of files, global operators, ...). The successive symbolic transformations lead to a macroscopic description of the code to be executed, which can thus be translated into x86 (SSEx), C++ or cuda code. The point emphasized here is that the different processes can be adapted to the target hardware, taking into account the ratio gflops / gbps (making e.g. the choice between re-computations or cache), the SIM[DT] abilities, ... The poster will present the gains (compared to classical CPU/GPU implementations) for two implementation of a 3D unstructured FEM solver,using respectively a conjugate gradient and a domain decomposition method with repetitive patterns.

Miriam Leeser - Northeastern University

GPU Acceleration in a Modern Problem Solving Environment: SCIRun's Linear System Solvers
January 11, 2011 4:30 pm - 6:00 pm

This research demonstrates the incorporation of GPU's parallel processing architecture into the SCIRun biomedical problem solving environment with minimal changes to the environment or user experience. SCIRun, developed at the University of Utah, allows scientists to interactively construct many different types of biomedical simulations. We use this environment to demonstrate the effectiveness of the GPU by accelerating time consuming algorithms present in these simulations. Specifically, we target the linear solver module, which contains multiple solvers that benefit from GPU hardware. We have created a class to accelerate the conjugate gradient, Jacobi and minimal residual linear solvers; the results demonstrate that the GPU can provide acceleration in this environment. A principal focus was to remain transparent by retaining the user friendly experience to the scientist using SCIRun's graphical user interface. NVIDIA's CUDA C language is used to enable performance on NVIDIA GPUs. Challenges include manipulating the sparse data processed by these algorithms and communicating with the SCIRun interface amidst computation. Our solution makes it possible to implement GPU versions of the existing SCIRun algorithms easily and can be applied to other parallel algorithms in the application. The GPU executes the matrix and vector arithmetic to achieve acceleration performance of up to 16x on the algorithms in comparison to SCIRun's existing multithreaded CPU implementation. The source code will contain single and double precision versions to utilize a wide variety of GPU hardware and will be incorporated and publicly available in future versions of SCIRun.


The Challenges of Writing Portable, Correct and High Performance Libraries for GPUs or How to Avoid the Heroics of GPU Programming
January 10, 2011 3:00 pm - 4:00 pm

Keywords of the presentation: GPU programming, CUDA, correctness, Biomedical Imaging, Floating Point, GPU Libraries

We live in the age of heroic programming for scientific applications on Graphics Processing Units (GPUs).  Typically a scientist chooses an application to accelerate and a target platform, and through great effort maps their application to that platform.   If they are a true hero, they achieve two or three orders of magnitude speedup for that application and target hardware pair.  The effort required includes a deep understanding of the application,  its implementation and the target architecture.  When a new, higher performance architecture becomes available additional heroic acts are required.  There is another group of scientists who prefer to spend their time focused on the application level rather than lower levels.  These scientists would like to use GPUs for their applications, but would prefer to have parameterized library components available that deliver high performance without requiring heroic efforts on their part.  The library components should be easy to use and should support a wide range of user input parameters.  They should exhibit good performance on a range of different GPU platforms, including future architectures.  Our research focuses on creating such libraries.  We have been investigating parameterized library components for use with Matlab/Simulink and with the SCIRun Biomedical Problem Solving Environment from the University of Utah.  In this talk I will discuss our library development efforts and challenges to achieving high performance across a range of both application and architectural parameters. I will also focus on issues that arise in achieving correct behavior of GPU kernels.  One issue is  correct behavior with respect to thread synchronization.  Another is knowing whether or not your scientific application that uses floating point is correct when the results differ depending on the target architecture and order of computation.  


David Mayhew - Advanced Micro Devices

I See GPU Shapes in the Clouds
January 14, 2011 8:30 am - 9:30 am

Fusion (the integration of CPU and GPU into a single processing entity) is here. Cloud based software services are here. Large processing clusters are running massively parallel Hadoop programs now. Can large-scale, commercial, enterprise, server solutions be dynamically repurposed to run HPC problem sets? The future of HPC may well be a massive set of virtual machines running in "curve of the earth" sized data centers. The cost of HPC processing sponges (HPC problem sets that consume otherwise wasted processing cycles in scale-out server clusters) will probably make all but the most extreme purpose-built HPC systems obsolete.

Dan Negrut - University of Wisconsin, Madison

Large Scale Frictional Contact Dynamics on the GPU
January 11, 2011 11:00 am - 12:00 pm

Keywords of the presentation: many-body dynamics, friction and contact, GPU computing, engineering applications

This talk summarizes an effort at the Modeling, Simulation and Visualization Center at the University of Wisconsin-Madison to model and simulate large scale discrete dynamics problems. This effort is motivation by a desire to address unsolved challenges posed by granular dynamics problems, mobility of tracked and wheeled vehicle on granular terrain, and digging into granular material, to name a few. In the context of simulating the dynamics of large systems of interacting rigid bodies, we briefly outline a method for solving large cone complementarity problems by means of a fixed-point iteration algorithm. The method is an extension of the Gauss-Jacobi algorithms with over-relaxation for symmetric convex complementarity problems. Convergent under fairly standard assumptions, the method is implemented in a scalable parallel computational framework by using a single instruction multiple data (SIMD) execution paradigm supported by the Compute Unified Device Architecture (CUDA) library for programming on the graphical processing unit (GPU). The simulation framework developed supports the analysis of problems with more than one million rigid bodies that interact through contact and friction forces, and whose dynamics are constrained by either unilateral or bilateral kinematic constraints. Simulation thus becomes a viable tool for investigating in the near future the dynamics of complex systems such as the Mars Rover operating on granular terrain, powder composites, and granular material flow. The talk concludes with a short summary of other applications that stand to benefit from the computational power available on today’s GPUs.


Nayda Santiago - University of Puerto Rico

Hyperspectral Image Analysis for Abundance Estimation using GPUs
January 11, 2011 4:30 pm - 6:00 pm

Hyperspectral images can be used for abundance estimation and anomaly detection, however, the algorithms involved tend to be I/O intensive. Parallelizing these algorithms can enable their use in real-time applications. A method of overcoming these limitations involves selecting parallelizable algorithms and implementing them using GPUs. GPUs are designed as throughput engines, built to process large amounts of dense data in a parallel fashion. RX's detectors and estimators of abundance will be parallelized and tested for correctness and performance.

Olaf Schenk - Universität Basel

A Code Generation and Autotuning Framework For Parallel Iterative Stencil Computations on Modern Microarchitectures
January 11, 2011 9:30 am - 10:30 am

Keywords of the presentation: stencil operators, manycores, autotuning

Stencil calculations comprise an important class of kernels in many scientific computing applications ranging from simple PDE solvers to constituent kernels in multigrid methods as well as image processing applications. In such types of solvers, stencil kernels are often the dominant part of the computation, and an efficient parallel implementation of the kernel is therefore crucial in order to reduce the time to solution. However, in the current complex hardware microarchitectures, meticulous architecture-specific tuning is required to elicit the machine's full compute power. We present a code generation and auto-tuning framework PATUS for stencil computations targeted at multi- and manycore processors, such as multicore CPUs and graphics processing units, which makes it possible to generate compute kernels from a specification of the stencil operation and a parallelization and optimization strategy, and leverages the autotuning methodology to optimize strategy-dependent parameters for the given hardware architecture.

Bertil Schmidt - Nanyang Technological University

Algorithms and Tools for Bioinformatics on GPUs
January 13, 2011 8:30 am - 9:30 am

Keywords of the presentation: CUDA, Bioinformatics, Next-generation sequencing

The enormous growth of biological sequence data has caused bioinformatics to be rapidly moving towards a data-intensive, computational science. As a result, the computational power needed by bioinformatics applications is growing rapidly as well. The recent emergence of parallel accelerator technologies such as GPUs has made it possible to significantly reduce the execution times of many bioinformatics applications. In this talk I will present the design and implementation of scalable GPU algorithms based on the CUDA programming model in order to accelerate important bioinformatics applications. In particular, I will focus on algorithms and tools for next-generation sequencing (NGS) using error correction as an example.Detection and correction of sequencing errors is an important but time-consuming pre-processing step for de-novo genome assembly or read mapping. In this talk, I discuss the parallel algorithm design used for the CUDA-EC and DecGPU tools. I will also give an overview of other CUDA-enabled tools developed by my research group.

Mark Stock - Applied Scientific Research

A GPU-accelerated Boundary Element Method and Vortex Particle Method
January 11, 2011 4:30 pm - 6:00 pm

Vortex particle methods, when combined with multipole-accelerated boundary element methods (BEM), become a complete tool for direct numerical simulation (DNS) of internal or external vortex-dominated flows. In previous work, we presented a method to accelerate the vorticity-velocity inversion at the heart of vortex particle methods by performing a multipole treecode N-body method on parallel graphics hardware. The resulting method achieved a 17-fold speedup over a dual-core CPU implementation. In the present work, we will demonstrate both an improved algorithm for the GPU vortex particle method that outperforms an 8-core CPU by a factor of 43, but also a GPU-accelerated multipole treecode method for the boundary element solution. The new BEM solves for the unknown source, dipole, or combined strengths over a triangulated surface using all available CPU cores and GPUs. Problems with up to 1.4 million unknowns can be solved on a single commodity desktop computer in one minute, and at that size the hybrid CPU/GPU outperforms a quad-core CPU alone by 22.5 times. The method is exercised on DNS of impulsively-started flow over spheres at Re=500, 1000, 2000, and 4000.


Algorithmic Fluid Art – Influences, Process, and Works
January 13, 2011 9:30 am - 10:30 am

In addition to my research into vortex particle methods, parallel N-body methods, and GPU programming, I create artwork using these same computer programs. The work consists of imagery and animations of fluid forms and other shapes and patterns in nature. Using relatively simple algorithms reflecting the origins of their underlying processes, many of these patterns can be recreated and their inherent beauty exposed. In this talk, I will discuss the technical aspects of my work, but mainly plan to distract attention with the works themselves.


Mark Stock earned his PhD from Aerospace Engineering at the University of Michigan in 2006, and has been working for Applied Scientific Research in Santa Ana, CA since then. He has been creating computer imagery and numerical simulations for over 25 years, and started exhibiting his artwork in 2001.

Robert Strzodka - Max-Planck-Institut für Informatik

Everyday Parallelism
January 10, 2011 2:00 pm - 3:00 pm

Parallelism is largely seen as a necessary evil to cope with the power restrictions on a chip and most programmers would prefer to continue writing sequential programs rather than dealing with the alien and error-prone parallel programming. This talk will question this view and point out how the allegedly unfamiliar parallel processing is utilized by millions of people everyday. Parallelism appears as a course only when looking at it from the crooked illusion of sequential processing. Admittedly, there are critical decisions associated with specialization, data movement or synchronization, but we also have lots of experience in taking them because they are performed everyday. Presented results will demonstrate that the drawn analogies are not just theoretic.

Keita Teranishi - CRAY Inc

Locally-Self-Consistent Multiple-Scattering code (LSMS) for GPUs
January 11, 2011 4:30 pm - 6:00 pm

Locally-Self-Consistent Multiple-Scattering (LSMS) is one of the major petascale applications and highly tuned for supercomputer systems like Cray XT5 Jaguar. We present our recent effort on porting and tuning the major computational routine of LSMS to GPU based systems to demonstrate the feasibility of LSMS beyond petaflops. In particular, we discuss the techniques, including auto-tuning of dense matrix kernels and computation-communication overlap.

Jonas Tölke - Ingrain

Digital rocks physics: fluid flow in rocks
January 11, 2011 4:30 pm - 6:00 pm

We show how Ingrain's digital rock physics technology works to predict fluid flow properties in rocks. NVIDIA CUDA technology delivers significant acceleration for this technology. The simulator on NVIDIA hardware enables us to perform pore scale multi-phase (oil-water-matrix) simulations in natural porous media and to predict important rock properties like absolute permeability, relative permeabilites, and capillary pressure.


Lattice Boltzmann Multi-Phase Simulations in Porous Media using GPUs
January 12, 2011 9:30 am - 10:30 am

Keywords of the presentation: CUDA, lattice Boltzmann

We present a very efficient implementation of a multiphase lattice Boltzmann methods (LBM) based on CUDA. This technology delivers significant benefits for predictions of properties in rocks. The simulator on NVIDIA hardware enables us to perform pore scale multi-phase (oil-water-matrix) simulations in natural porous media and to predict important rock properties like absolute permeability, relative permeabilites, and capillary pressure. We will show videos of these simulations in complex real world porous media and rocks.

Ulrike Yang - Lawrence Livermore National Laboratory

Preparing Algebraic Multigrid for Exascale
January 11, 2011 4:30 pm - 6:00 pm

Algebraic Multigrid (AMG) solvers are an essential component of many large-scale scientific simulation codes. Their continued numerical scalability and efficient implementation is critical for preparing these codes for exascale. Our experiences on modern multi-core machines show that significant challenges must be addressed for AMG to perform well on such machines. We discuss our experiences and describe the techniques we have used to overcome scalability challenges for AMG on hybrid architectures in preparation for exascale.

Rio Yokota - Boston University

Fast Multipole Methods on large cluster of GPUs
January 11, 2011 4:30 pm - 6:00 pm

The combination of algorithmic acceleration and hardware acceleration can have tremendous impact. The FMM is a fast algorithm for calculating matrix vector multiplications in O(N) time, and it runs very fast on GPUs. Its combination of high degree of parallelism and O(N) complexity make it an attractive solver for the Peta-scale and Exa-scale era. It has a wide range of applications, e.g. quantum mechanics, molecular dynamics, electrostatics, acoustics, structural mechanics, fluid mechanics, and astrophysics.