<span class=strong>Reception and Poster Session</span><br><br/><br/><b>Poster submissions welcome from all participants</b><br><br/><br/><a<br/><br/>href=/visitor-folder/contents/workshop.html#poster><b>Instructions</b></a>
Tuesday, January 11, 2011 - 4:30pm - 6:00pm
- Algorithms for Lattice Field Theory at Extreme Scales
Richard Brower (Boston University)
Increases in computational power allow lattice field theories to resolve
smaller scales, but to realize the full benefit for scientific discovery,
new multi-scale algorithms must be developed to maximize efficiency.
Examples of new trends in algorithms include adaptive multigrid solvers
for the quark propagator and an improved symplectic Force Gradient
integrator for the Hamiltonian evolution used to include the quark
contribution to vacuum fluctuations in the quantum path integral. Future
challenges to algorithms and software infrastructure targeting many-core
GPU accelerators and heterogeneous extreme scale computing are discussed.
- Medical Imaging on the GPU Using OpenCL:
3D Surface Extraction and 3D Ultrasound Reconstruction
Anne Elster (Norwegian University of Science and Technology (NTNU))
Collaborators: Frank Linseth, Holger Ludvigsen, Erik Smistad and Thor Kristian Valgerhaug
GPUs offer a lot of compute power enabling real-time processing of images.
This poster depict som our of group's recent work on image processing for medical
applications on GPUs including 3D surface extraction using marching cubes and 3D ultrasound
reconstruction. We have previously developed Cg and CUDA codes for wavelet transforms and
CUDA codes for surface extraction for seismic images.
- Fast Multipole Methods on large cluster of GPUs
Rio Yokota (Boston University)
The combination of algorithmic acceleration and hardware acceleration can have tremendous impact. The FMM is a fast algorithm for calculating matrix vector multiplications in O(N) time, and it runs very fast on GPUs. Its combination of high degree of parallelism and O(N) complexity make it an attractive solver for the Peta-scale and Exa-scale era. It has a wide range of applications, e.g. quantum mechanics, molecular dynamics, electrostatics, acoustics, structural mechanics, fluid mechanics, and astrophysics.
- A Domain Decomposition Method that Converges in Two Iterations for any
Subdomain Decomposition and PDE
Martin Gander (Universite de Geneve)
Joint work with Felix Kwok.
All domain decomposition methods are based on a decomposition of the
physical domain into many subdomains and an iteration, which uses
subdomain solutions only (and maybe a coarse grid), in order to
compute an approximate solution of the problem on the entire domain.
We show in this poster that it is possible to formulate such an
iteration, only based on subdomain solutions, which converges in two
steps to the solution of the underlying problem, independently of the
number of subdomains and the PDE solved.
This method is mainly of theoretical interest, since it contains
sophisticated non-local operators (and a natural coarse grid
component), which need to be approximated in order to obtain a
- Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers
Dominik Göddeke (Universität Dortmund)Robert Strzodka (Max-Planck-Institut für Informatik)
We present efficient fine-grained parallelization techniques for robust multigrid solvers and Krylov subspace schemes, in particular for numerically strong smoothing and preconditioning operators. We apply them to sparse ill-conditioned linear systems of equations that arise from grid-based discretization techniques like finite differences, volumes and elements; the systems are notoriously hard to solve due to severe anisotropies in the underlying mesh and differential operator. These strong smoothers are characterized by sequential data dependencies, and do not parallelize in a straightforward manner. For linewise preconditioners, exact parallel algorithms exist, and we present a novel, efficient implementation of a cyclic reduction tridiagonal solver. For other preconditioners, traditional wavefront techniques can be applied, but their irregular and limited parallelism makes them a bad match for GPUs. Therefore, we discuss multicoloring techniques to recover parallelism in these preconditioners, by decoupling some of the dependencies at the expense of at first reduced numerical performance. However, by carefully balancing the coupling
strength (more colors) with the parallelization benefits, the multicolored variants retain almost all of the sequential numerical
performance. Further improvements are achieved by merging the tridiagonal and Gauß-Seidel approach into a smoothing operator that
combines their advantages, and by employing an alternating direction implicit scheme to gain independence of the numbering of the unknowns. Due to their advantageous numerical properties, multigrid solvers equipped with strong smoothers are between four and eight times more efficient than with simple Gauß-Seidel preconditioners, and we achieve speedups factors between six and 18 with the GPU implementations over carefully tuned CPU variants.
- Global symbolic manipulations and code generation for Finite Elements on SIM[DT] hardware
Hugo Leclerc (École Normale Supérieure de Cachan)
Tools have been developed to generate code to solve partial differential equations from high level descriptions (manipulation of files, global operators, ...). The successive symbolic transformations lead to a macroscopic description of the code to be executed, which can thus be translated into x86 (SSEx), C++ or cuda code. The point emphasized here is that the different processes can be adapted to the target hardware, taking into account the ratio gflops / gbps (making e.g. the choice between re-computations or cache), the SIM[DT] abilities, ... The poster will present the gains (compared to classical CPU/GPU implementations) for two implementation of a 3D unstructured FEM solver,using respectively a conjugate gradient and a domain decomposition method with repetitive patterns.
- Efficient Uncertainty Quantification using GPUs
Gaurav Gaurav (University of Minnesota, Twin Cities)
Joint work with Steven F. Wojtkiewicz (Department of Civil Engineering, University of Minnesota, Minneapolis, MN 55414, USA. firstname.lastname@example.org).
Graphics processing units (GPUs) have emerged as a much economical and a highly competitive alternative to CPU-based parallel computing. Recent studies have shown that GPUs consistently outperform their best corresponding CPU-based parallel computing equivalents by up to two orders of magnitude in certain applications. Moreover, the portability of the GPUs enables even a desktop computer to provide a teraflop (1012 floating point operations per second) of computing power. This study presents the gains in computational efficiency obtained using the GPU-based implementations of five types of algorithms frequently used in uncertainty quantification problems arising in the analysis of dynamical systems with uncertain parameters and/or inputs.
- Brain Perfusion: Multi-scale Simulations and Visualization
Leopold Grinberg (Brown University)
Joint work with J. Insley, M. Papka, and G. E. Karniadakis.
Interactions of blood flow in the human brain occur between different
scales, determined by flow features in the large arteries (above 0.5mm
diameter), arterioles, and the capillaries (of 5E-3 mm). To
multi-scale flow we develop mathematical models, numerical methods,
scalable solvers and visualization tools. Our poster will present NektarG
- a research code developed at Brown University for continuum and
atomistic simulations. NektarG is based on a high-order spectral/hp
element discretization featuring multi-patch domain decomposition for
continuum flow simulations, and modified DPD-LAMMPS for mesoscopic
simulations. The continuum and atomistic solvers are coupled via
Multi-level Communicating Interface to exchange data required by interface
conditions. The visualization software is based on ParaView and NektarG
utilities accessed through the ParaView GUI. The new visualization
software allows to simultaneously present data computed in coupled
(multi-scale) simulations. The software automatically synchronizes the
display of time evolution of solutions at multiple scales.
- The Build to Order Compiler for Matrix Algebra Optimization
Elizabeth Jessup (University of Colorado)
The performance of many high performance computing applications is
limited by data movement from memory to the processor. Often their cost is more
accurately expressed in terms of memory traffic rather than
floating-point operations and, to improve performance, data movement
must be reduced. One technique to reduce memory traffic is the fusion of loops
that access the same data. We have built the Build to Order (BTO) compiler to automate the
fusion of loops in matrix algebra kernels. Loop fusion often produces speedups
proportional to the reduction in memory traffic, but it can also lead to
negative effects in cache and register use. We present the results of experiments
with BTO that help us to understand the workings of loop fusion.
- Digital rocks physics: fluid flow in rocks
Jonas Tölke (Ingrain)
We show how Ingrain's digital rock physics technology works to predict fluid flow properties in rocks.
NVIDIA CUDA technology delivers significant acceleration for this technology.
The simulator on NVIDIA hardware enables us to perform pore scale multi-phase (oil-water-matrix) simulations
in natural porous media and to predict important rock properties like absolute permeability, relative permeabilites, and capillary pressure.
- Hyperspectral Image Analysis for Abundance Estimation using GPUs
Nayda Santiago (University of Puerto Rico)
Hyperspectral images can be used for abundance estimation and anomaly
detection, however, the algorithms involved tend to be I/O intensive.
Parallelizing these algorithms can enable their use in real-time
applications. A method of overcoming these limitations involves
selecting parallelizable algorithms and implementing them using GPUs.
GPUs are designed as throughput engines, built to process large
amounts of dense data in a parallel fashion. RX's detectors and
estimators of abundance will be parallelized and tested for
correctness and performance.
- Locally-Self-Consistent Multiple-Scattering code (LSMS) for GPUs
Keita Teranishi (CRAY Inc)
Locally-Self-Consistent Multiple-Scattering (LSMS) is one of the major petascale applications and highly tuned for supercomputer systems like Cray XT5 Jaguar. We present our recent effort on porting and tuning the major computational routine of LSMS to GPU based systems to demonstrate the feasibility of LSMS beyond petaflops. In particular, we discuss the techniques, including auto-tuning of dense matrix kernels and computation-communication overlap.
- GPU Acceleration in a Modern Problem Solving Environment: SCIRun's Linear System Solvers
Miriam Leeser (Northeastern University)
This research demonstrates the incorporation of GPU's parallel processing architecture into the SCIRun biomedical problem solving environment with minimal changes to the environment or user experience. SCIRun, developed at the University of Utah, allows scientists to interactively construct many different types of biomedical simulations. We use this environment to demonstrate the effectiveness of the GPU by accelerating time consuming algorithms present in these simulations. Specifically, we target the linear solver module, which contains multiple solvers that benefit from GPU hardware. We have created a class to accelerate the conjugate gradient, Jacobi and minimal residual linear solvers; the results demonstrate that the GPU can provide acceleration in this environment. A principal focus was to remain transparent by retaining the user friendly experience to the scientist using SCIRun's graphical user interface. NVIDIA's CUDA C language is used to enable performance on NVIDIA GPUs. Challenges include manipulating the sparse data processed by these algorithms and communicating with the SCIRun interface amidst computation. Our solution makes it possible to implement GPU versions of the existing SCIRun algorithms easily and can be applied to other parallel algorithms in the application. The GPU executes the matrix and vector arithmetic to achieve acceleration performance of up to 16x on the algorithms in comparison to SCIRun's existing multithreaded CPU implementation. The source code will contain single and double precision versions to utilize a wide variety of GPU hardware and will be incorporated and publicly available in future versions of SCIRun.
- Development of Desktop Computing Applications and Engineering Tools on GPUs
Allan Engsig-Karup (Technical University of Denmark)
GPULab - A competence center and laboratory for research and collaboration
within academia and partners in industry has been established in 2008 at
section for Scientific Computing, DTU informatics, Technical University of
Denmark. In GPULab we focus on the utilization of Graphics Processing
Units (GPUs) for high-performance computing applications and software
tools in science and engineering, inverse problems, visualization,
imaging, dynamic optimization. The goals are to contribute to the
development of new state-of-the-art mathematical models and algorithms for
maximum throughout performance, improved performance profiling tools and
assimilation of results to academic and industrial partners in our
network. Our approaches calls for multi-disciplinary skills and
understanding of hardware, software development, profiling tools and
tuning techniques, analytical methods for analysis and development of new
approaches, together with expert knowledge in specific application areas
within science and engineering. We anticipate that our research in a near
future will bring new algorithms and insight in engineering and science
applications targeting practical engineering problems.
- Development of a new massively parallel tool for nonlinear free surface
Allan Engsig-Karup (Technical University of Denmark)
The research objective of this work is to develop a new dedicated and
massively parallel tool for efficient simulation of unsteady nonlinear
free surface waves. The tool will be used for applications in coastal and
offshore engineering, e.g. in connection with prediction of wave
kinematics and forces at or near human-made structures. The tool is based
on a unified potential flow formulation which can account for fully
nonlinear and dispersive wave motion over uneven depths under the
assumptions of nonbreaking waves, irrotational and inviscid flow.
This work is a continuation of earlier work and will continue to
contribute to advancing state-of-the-art for efficient wave simulation.
The tool is expected to be orders of magnitude faster than current tools
due to efficient algorithms and utilization of available hardware
- Preparing Algebraic Multigrid for Exascale
Ulrike Yang (Lawrence Livermore National Laboratory)
Algebraic Multigrid (AMG) solvers are an essential component of many large-scale
scientific simulation codes. Their continued numerical scalability and efficient
implementation is critical for preparing these codes for exascale.
Our experiences on modern multi-core machines show that significant challenges
must be addressed for AMG to perform well on such machines. We discuss our
experiences and describe the techniques we have used to overcome scalability
challenges for AMG on hybrid architectures in preparation for exascale.
- A GPU-accelerated Boundary Element Method and Vortex Particle Method
Mark Stock (Applied Scientific Research)
Vortex particle methods, when combined with multipole-accelerated
boundary element methods (BEM), become a complete tool for direct
numerical simulation (DNS) of internal or external vortex-dominated
flows. In previous work, we presented a method to accelerate the
vorticity-velocity inversion at the heart of vortex particle methods by
performing a multipole treecode N-body method on parallel graphics
hardware. The resulting method achieved a 17-fold speedup over a
dual-core CPU implementation. In the present work, we will demonstrate
both an improved algorithm for the GPU vortex particle method that
outperforms an 8-core CPU by a factor of 43, but also a GPU-accelerated
multipole treecode method for the boundary element solution. The new BEM
solves for the unknown source, dipole, or combined strengths over a
triangulated surface using all available CPU cores and GPUs. Problems
with up to 1.4 million unknowns can be solved on a single commodity
desktop computer in one minute, and at that size the hybrid CPU/GPU
outperforms a quad-core CPU alone by 22.5 times. The method is exercised
on DNS of impulsively-started flow over spheres at Re=500, 1000, 2000,