Multidimensional Reduction Algorithm on Multicore Architectures
SESSION: Student Poster Reception
EVENT TYPE: Poster, ACM Student Poster
TIME: 5:15PM - 7:00PM
ABSTRACT: Today’s applications are driven toward attempting larger simulations and working with larger datasets. This has pushed the size of the high performance computers from tens of thousands of cores into the realm of hundreds of thousands. More and more of the algorithms that these simulations use are dependent upon a highly efficient collective of communications. Another trend is that the power- and thermic-barrier has caused the CPU vendors to leave behind the traditional frequency improvements and to use multicore architecture to uphold the expected performance progression. This type of architecture has imposed another burden upon the programmers.
The MPI library provides the programmer not only with the tools necessary for passing messages inside the system; MPI also creates a mapping between a process and a physical core in the system during initialization. Even though this is a convenient feature and relieves the programmer from the work of mapping, something most programmers are glad they do not have to consider, problems do remain. MPI is not only unaware of structure of the computation that will take place but also the physical aspects of the system topology. The provided mapping can therefore, at times, prevent the HPC application from reaching its optimal potential.
It is easy to understand that the physical location of the cores will influence the performance of the communication. For example two cores that share cache and are located on the same processor should have faster communication than two cores located on different machines and connected through some network. Depending on the size of the messages and the structure of the communication needed for the computation, it might be advantageous to create a new mapping of the processes,
An integral part of our research in diskless checkpointing has been focused on optimizing two forms of collective communications (reduce and broadcast) using MPI on systems with multicore architecture. We have further narrowed it to the pipeline version under the assumption that the nature of the diskless checkpointing provides messages of considerable size. Testing has revealed that the standard solution provided by the MPI initialization on a multicore system is not optimized. The reason is that the standard mapping does not take into account how many times a specific message is passed beyond the boundary of a single node. Our implementation suggests that by making a distinction between inter- and intra-communication, and by minimizing the number of times a message has to cross the boarder of the node, significant improvements can be achieved. In the one dimensional case the solution to map the processes such that they fill-up the cores of one node at the time is fairly obvious and provides the best performance. In the case that the reduction or the broadcast is performed over an n-dimensional grid, the mapping becomes more complex.
We have introduced the concept of tiling, where each node should be viewed and interpreted as a tile. The shape of the tile is dependent not only of the type of communication but also the size and shape of the grid in which the communication will occur. We then reduce the problem to determining a tile’s best shape such that the grid can be covered creating an optimal performance. We have implemented an algorithm that automatically re-maps the processes providing an optimal communicator given the information regarding both the size and structure of the grid and the number of processes. The implementation automatically determines the number of cores per node and finds an optimal shape for a single tile. Our implementation has been done in OpenMPI and the experimental evaluation demonstrates significant improvements. The research has so far only explored a very limited case. It needs to be expanded to include other collective communications within the MPI as well as additional structures of message passing (aside from the plain linear case) such as the binomial tree.
The poster accentuates the problem with the MPI initialization showing a non-optimal process mapped on a two dimensional grid. It further illustrates the algorithm of tiling and process re-mapping. Initial results for the re-mapping of the two dimensional case will be presented together with ideas for its implementation in other cases.