<span class=strong>Reception and Poster Session</span><br><br/><br/><b>Poster submissions welcome from all participants</b><br><br/><br/><a<br/><br/>href=/visitor-folder/contents/workshop.html#poster><b>Instructions</b></a>

Tuesday, January 11, 2011 - 4:30pm - 6:00pm
Lind 400
  • Algorithms for Lattice Field Theory at Extreme Scales
    Richard Brower (Boston University)
    Increases in computational power allow lattice field theories to resolve
    smaller scales, but to realize the full benefit for scientific discovery,
    new multi-scale algorithms must be developed to maximize efficiency.
    Examples of new trends in algorithms include adaptive multigrid solvers
    for the quark propagator and an improved symplectic Force Gradient
    integrator for the Hamiltonian evolution used to include the quark
    contribution to vacuum fluctuations in the quantum path integral. Future
    challenges to algorithms and software infrastructure targeting many-core
    GPU accelerators and heterogeneous extreme scale computing are discussed.
  • Medical Imaging on the GPU Using OpenCL:

    3D Surface Extraction and 3D Ultrasound Reconstruction

    Anne Elster (Norwegian University of Science and Technology (NTNU))
    Collaborators: Frank Linseth, Holger Ludvigsen, Erik Smistad and Thor Kristian Valgerhaug

    GPUs offer a lot of compute power enabling real-time processing of images.
    This poster depict som our of group's recent work on image processing for medical
    applications on GPUs including 3D surface extraction using marching cubes and 3D ultrasound
    reconstruction. We have previously developed Cg and CUDA codes for wavelet transforms and
    CUDA codes for surface extraction for seismic images.
  • Fast Multipole Methods on large cluster of GPUs
    Rio Yokota (Boston University)
    The combination of algorithmic acceleration and hardware acceleration can have tremendous impact. The FMM is a fast algorithm for calculating matrix vector multiplications in O(N) time, and it runs very fast on GPUs. Its combination of high degree of parallelism and O(N) complexity make it an attractive solver for the Peta-scale and Exa-scale era. It has a wide range of applications, e.g. quantum mechanics, molecular dynamics, electrostatics, acoustics, structural mechanics, fluid mechanics, and astrophysics.
  • A Domain Decomposition Method that Converges in Two Iterations for any

    Subdomain Decomposition and PDE

    Martin Gander (Universite de Geneve)
    Joint work with Felix Kwok.

    All domain decomposition methods are based on a decomposition of the
    physical domain into many subdomains and an iteration, which uses
    subdomain solutions only (and maybe a coarse grid), in order to
    compute an approximate solution of the problem on the entire domain.
    We show in this poster that it is possible to formulate such an
    iteration, only based on subdomain solutions, which converges in two
    steps to the solution of the underlying problem, independently of the
    number of subdomains and the PDE solved.
    This method is mainly of theoretical interest, since it contains
    sophisticated non-local operators (and a natural coarse grid
    component), which need to be approximated in order to obtain a
    practical method.
  • Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers
    Dominik Göddeke (Universität Dortmund)Robert Strzodka (Max-Planck-Institut für Informatik)
    We present efficient fine-grained parallelization techniques for robust multigrid solvers and Krylov subspace schemes, in particular for numerically strong smoothing and preconditioning operators. We apply them to sparse ill-conditioned linear systems of equations that arise from grid-based discretization techniques like finite differences, volumes and elements; the systems are notoriously hard to solve due to severe anisotropies in the underlying mesh and differential operator. These strong smoothers are characterized by sequential data dependencies, and do not parallelize in a straightforward manner. For linewise preconditioners, exact parallel algorithms exist, and we present a novel, efficient implementation of a cyclic reduction tridiagonal solver. For other preconditioners, traditional wavefront techniques can be applied, but their irregular and limited parallelism makes them a bad match for GPUs. Therefore, we discuss multicoloring techniques to recover parallelism in these preconditioners, by decoupling some of the dependencies at the expense of at first reduced numerical performance. However, by carefully balancing the coupling
    strength (more colors) with the parallelization benefits, the multicolored variants retain almost all of the sequential numerical
    performance. Further improvements are achieved by merging the tridiagonal and Gauß-Seidel approach into a smoothing operator that
    combines their advantages, and by employing an alternating direction implicit scheme to gain independence of the numbering of the unknowns. Due to their advantageous numerical properties, multigrid solvers equipped with strong smoothers are between four and eight times more efficient than with simple Gauß-Seidel preconditioners, and we achieve speedups factors between six and 18 with the GPU implementations over carefully tuned CPU variants.
  • Global symbolic manipulations and code generation for Finite Elements on SIM[DT] hardware
    Hugo Leclerc (École Normale Supérieure de Cachan)
    Tools have been developed to generate code to solve partial differential equations from high level descriptions (manipulation of files, global operators, ...). The successive symbolic transformations lead to a macroscopic description of the code to be executed, which can thus be translated into x86 (SSEx), C++ or cuda code. The point emphasized here is that the different processes can be adapted to the target hardware, taking into account the ratio gflops / gbps (making e.g. the choice between re-computations or cache), the SIM[DT] abilities, ... The poster will present the gains (compared to classical CPU/GPU implementations) for two implementation of a 3D unstructured FEM solver,using respectively a conjugate gradient and a domain decomposition method with repetitive patterns.
  • Efficient Uncertainty Quantification using GPUs
    Gaurav Gaurav (University of Minnesota, Twin Cities)
    Joint work with Steven F. Wojtkiewicz (Department of Civil Engineering, University of Minnesota, Minneapolis, MN 55414, USA.

    Graphics processing units (GPUs) have emerged as a much economical and a highly competitive alternative to CPU-based parallel computing. Recent studies have shown that GPUs consistently outperform their best corresponding CPU-based parallel computing equivalents by up to two orders of magnitude in certain applications. Moreover, the portability of the GPUs enables even a desktop computer to provide a teraflop (1012 floating point operations per second) of computing power. This study presents the gains in computational efficiency obtained using the GPU-based implementations of five types of algorithms frequently used in uncertainty quantification problems arising in the analysis of dynamical systems with uncertain parameters and/or inputs.
  • Brain Perfusion: Multi-scale Simulations and Visualization
    Leopold Grinberg (Brown University)
    Joint work with J. Insley, M. Papka, and G. E. Karniadakis.

    Interactions of blood flow in the human brain occur between different
    scales, determined by flow features in the large arteries (above 0.5mm
    diameter), arterioles, and the capillaries (of 5E-3 mm). To
    simulate such
    multi-scale flow we develop mathematical models, numerical methods,
    scalable solvers and visualization tools. Our poster will present NektarG
    - a research code developed at Brown University for continuum and
    atomistic simulations. NektarG is based on a high-order spectral/hp
    element discretization featuring multi-patch domain decomposition for
    continuum flow simulations, and modified DPD-LAMMPS for mesoscopic
    simulations. The continuum and atomistic solvers are coupled via
    Multi-level Communicating Interface to exchange data required by interface
    conditions. The visualization software is based on ParaView and NektarG
    utilities accessed through the ParaView GUI. The new visualization
    software allows to simultaneously present data computed in coupled
    (multi-scale) simulations. The software automatically synchronizes the
    display of time evolution of solutions at multiple scales.
  • The Build to Order Compiler for Matrix Algebra Optimization
    Elizabeth Jessup (University of Colorado)
    The performance of many high performance computing applications is
    limited by data movement from memory to the processor. Often their cost is more
    accurately expressed in terms of memory traffic rather than
    floating-point operations and, to improve performance, data movement
    must be reduced. One technique to reduce memory traffic is the fusion of loops
    that access the same data. We have built the Build to Order (BTO) compiler to automate the
    fusion of loops in matrix algebra kernels. Loop fusion often produces speedups
    proportional to the reduction in memory traffic, but it can also lead to
    negative effects in cache and register use. We present the results of experiments
    with BTO that help us to understand the workings of loop fusion.

  • Digital rocks physics: fluid flow in rocks
    Jonas Tölke (Ingrain)
    We show how Ingrain's digital rock physics technology works to predict fluid flow properties in rocks.
    NVIDIA CUDA technology delivers significant acceleration for this technology.
    The simulator on NVIDIA hardware enables us to perform pore scale multi-phase (oil-water-matrix) simulations
    in natural porous media and to predict important rock properties like absolute permeability, relative permeabilites, and capillary pressure.
  • Hyperspectral Image Analysis for Abundance Estimation using GPUs
    Nayda Santiago (University of Puerto Rico)
    Hyperspectral images can be used for abundance estimation and anomaly
    detection, however, the algorithms involved tend to be I/O intensive.
    Parallelizing these algorithms can enable their use in real-time
    applications. A method of overcoming these limitations involves
    selecting parallelizable algorithms and implementing them using GPUs.
    GPUs are designed as throughput engines, built to process large
    amounts of dense data in a parallel fashion. RX's detectors and
    estimators of abundance will be parallelized and tested for
    correctness and performance.
  • Locally-Self-Consistent Multiple-Scattering code (LSMS) for GPUs
    Keita Teranishi (CRAY Inc)
    Locally-Self-Consistent Multiple-Scattering (LSMS) is one of the major petascale applications and highly tuned for supercomputer systems like Cray XT5 Jaguar. We present our recent effort on porting and tuning the major computational routine of LSMS to GPU based systems to demonstrate the feasibility of LSMS beyond petaflops. In particular, we discuss the techniques, including auto-tuning of dense matrix kernels and computation-communication overlap.
  • GPU Acceleration in a Modern Problem Solving Environment: SCIRun's Linear System Solvers
    Miriam Leeser (Northeastern University)
    This research demonstrates the incorporation of GPU's parallel processing architecture into the SCIRun biomedical problem solving environment with minimal changes to the environment or user experience. SCIRun, developed at the University of Utah, allows scientists to interactively construct many different types of biomedical simulations. We use this environment to demonstrate the effectiveness of the GPU by accelerating time consuming algorithms present in these simulations. Specifically, we target the linear solver module, which contains multiple solvers that benefit from GPU hardware. We have created a class to accelerate the conjugate gradient, Jacobi and minimal residual linear solvers; the results demonstrate that the GPU can provide acceleration in this environment. A principal focus was to remain transparent by retaining the user friendly experience to the scientist using SCIRun's graphical user interface. NVIDIA's CUDA C language is used to enable performance on NVIDIA GPUs. Challenges include manipulating the sparse data processed by these algorithms and communicating with the SCIRun interface amidst computation. Our solution makes it possible to implement GPU versions of the existing SCIRun algorithms easily and can be applied to other parallel algorithms in the application. The GPU executes the matrix and vector arithmetic to achieve acceleration performance of up to 16x on the algorithms in comparison to SCIRun's existing multithreaded CPU implementation. The source code will contain single and double precision versions to utilize a wide variety of GPU hardware and will be incorporated and publicly available in future versions of SCIRun.
  • Development of Desktop Computing Applications and Engineering Tools on GPUs
    Allan Engsig-Karup (Technical University of Denmark)
    GPULab - A competence center and laboratory for research and collaboration
    within academia and partners in industry has been established in 2008 at
    section for Scientific Computing, DTU informatics, Technical University of
    Denmark. In GPULab we focus on the utilization of Graphics Processing
    Units (GPUs) for high-performance computing applications and software
    tools in science and engineering, inverse problems, visualization,
    imaging, dynamic optimization. The goals are to contribute to the
    development of new state-of-the-art mathematical models and algorithms for
    maximum throughout performance, improved performance profiling tools and
    assimilation of results to academic and industrial partners in our
    network. Our approaches calls for multi-disciplinary skills and
    understanding of hardware, software development, profiling tools and
    tuning techniques, analytical methods for analysis and development of new
    approaches, together with expert knowledge in specific application areas
    within science and engineering. We anticipate that our research in a near
    future will bring new algorithms and insight in engineering and science
    applications targeting practical engineering problems.
  • Development of a new massively parallel tool for nonlinear free surface

    wave simulation

    Allan Engsig-Karup (Technical University of Denmark)
    The research objective of this work is to develop a new dedicated and
    massively parallel tool for efficient simulation of unsteady nonlinear
    free surface waves. The tool will be used for applications in coastal and
    offshore engineering, e.g. in connection with prediction of wave
    kinematics and forces at or near human-made structures. The tool is based
    on a unified potential flow formulation which can account for fully
    nonlinear and dispersive wave motion over uneven depths under the
    assumptions of nonbreaking waves, irrotational and inviscid flow.
    This work is a continuation of earlier work and will continue to
    contribute to advancing state-of-the-art for efficient wave simulation.
    The tool is expected to be orders of magnitude faster than current tools
    due to efficient algorithms and utilization of available hardware

  • Preparing Algebraic Multigrid for Exascale
    Ulrike Yang (Lawrence Livermore National Laboratory)
    Algebraic Multigrid (AMG) solvers are an essential component of many large-scale
    scientific simulation codes. Their continued numerical scalability and efficient
    implementation is critical for preparing these codes for exascale.
    Our experiences on modern multi-core machines show that significant challenges
    must be addressed for AMG to perform well on such machines. We discuss our
    experiences and describe the techniques we have used to overcome scalability
    challenges for AMG on hybrid architectures in preparation for exascale.
  • A GPU-accelerated Boundary Element Method and Vortex Particle Method
    Mark Stock (Applied Scientific Research)
    Vortex particle methods, when combined with multipole-accelerated
    boundary element methods (BEM), become a complete tool for direct
    numerical simulation (DNS) of internal or external vortex-dominated
    flows. In previous work, we presented a method to accelerate the
    vorticity-velocity inversion at the heart of vortex particle methods by
    performing a multipole treecode N-body method on parallel graphics
    hardware. The resulting method achieved a 17-fold speedup over a
    dual-core CPU implementation. In the present work, we will demonstrate
    both an improved algorithm for the GPU vortex particle method that
    outperforms an 8-core CPU by a factor of 43, but also a GPU-accelerated
    multipole treecode method for the boundary element solution. The new BEM
    solves for the unknown source, dipole, or combined strengths over a
    triangulated surface using all available CPU cores and GPUs. Problems
    with up to 1.4 million unknowns can be solved on a single commodity
    desktop computer in one minute, and at that size the hybrid CPU/GPU
    outperforms a quad-core CPU alone by 22.5 times. The method is exercised
    on DNS of impulsively-started flow over spheres at Re=500, 1000, 2000,
    and 4000.