ABSTRACT: Performance per Watt is the new performance. In today’s power-limited regime, GPU Computing offers significant advantages in performance and energy efficiency. In this regime, performance derives from parallelism and efficiency derives from locality. Current GPUs provide both, with up to 512 cores per chip and an explicitly-managed memory hierarchy. This talk will review the current state of GPU computing and discuss how we plan to address the challenges of ExaScale computing. Achieving ExaFLOPS of sustained performance in a 20MW power envelope requires significant power reduction beyond what will be provided by technology scaling. Efficient processor design along with aggressive exploitation of locality is expected to address this power challenge. A focus on vertical rather than horizontal locality simplifies many issues including load balance, placement, and dynamic workloads. Efficient mechanisms for communication, synchronization, and thread management will be required to achieve the strong scaling required to achieve the 1010-thread parallelism needed to sustain an ExaFLOPS on reasonable-sized problems. Resilience will be achieved through a combination of hardware mechanisms and an API that allows programs to specify when and where protection is required. Programming systems will evolve to improve programmer productivity with a global address space and global data abstractions while improving efficiency via machine independent abstractions for locality.