BEGIN:VCALENDAR
VERSION:2.0
X-WR-TIMEZONE:America/Chicago
PRODID:-//Apple Inc.//iCal 3.0//EN
CALSCALE:GREGORIAN
X-WR-CALNAME:3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs
METHOD:PUBLISH
BEGIN:VTIMEZONE
TZID:America/Chicago
BEGIN:DAYLIGHT
TZOFFSETFROM:-0600
TZOFFSETTO:-0500
DTSTART:20070311T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
TZNAME:CDT
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0500
TZOFFSETTO:-0600
DTSTART:20071104T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
TZNAME:CST
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
SEQUENCE:2
DTSTART;TZID=America/Chicago:20101116T140000
DESCRIPTION:ABSTRACT: Stencil computation sweeps over a spatial grid over multiple time steps to perform nearest neighbor computations. The bandwidth-compute requirement for a large class of stencil kernels is very high\, and their performance is bound by the available memory bandwidth.  Since memory bandwidth grows slower than compute\, the performance of stencil kernels will not scale with increasing compute density. We present a novel 3.5D-blocking algorithm that performs a 2.5D-spatial and a 1D-temporal blocking of the input grid into on-chip memory for both CPUs and GPUs. The resultant algorithm is amenable to both thread-level and data-level parallelism\, and scales near-linearly with the SIMD width and multiple-cores. We are faster or comparable to state-of-the-art-stencil implementations on CPUs and GPUs. For the case of 7-point-stencil\, we are 1.5X-faster on CPUs\, and 1.8X faster on GPUs for single-precision floating point inputs than previously reported numbers. For Lattice Boltzmann methods\, we are 2.1X faster on CPUs.
UID:pap249@sc10.supercomputing.org
SUMMARY:3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs
DTEND;TZID=America/Chicago:20101116T143000
LOCATION:391-392
END:VEVENT
END:VCALENDAR
