Program-level performance optimizations for multicore-multiprocessors

The memory system of many of today's multicore multiprocessors (systems built with multiple multicore processors) has two main problems. The first performance problem is related to main memory accesses. As in these system each processor has an on-chip memory controller (MC), each processor has a part of the main memory directly connected to it. As each processor must be able to access the main memory of the other processor as well, the processors are connected by a cross-chip interconnect (IC). Transferring data through the interconnect adds overhead relative to transferring data on the on-chip memory controller, thus remote main memory accesses have a much higher memory access latency than local main memory accesses. Caching is a problem as well on today's multicore-multiprocessors. As cores have access to all last-level caches available in the machine, obtaining the latest copy of a piece of data from a remote cache has, however, a significant overhead relative to obtaining the data from the local cache (remote cache accesses are transferred through the cross-chip interconnect).

It is clear that for good performance of multithreaded programs all factors causing increased memory access latencies must be reduced, that is, both the number of remote main memory accesses and the number of remote cache accesses must be kept low. A good way to reduce the number of remote memory accesses of a program is by programmer intervention: If the programmer distributes data and computations in the system so that each computation accesses only data that are local to it, high-latency remote memory accesses are avoided and performance improves. Unfortunately, current system software support for programming multicore-multiprocessors is limited, therefore the following two projects target enhancing memory allocation and task scheduling for these systems so that better performance is achieved.

The goal of this project is to extend an existing task-parallel framework to include support for today's multicore-multiprocessors. The extension allows programmers to express the preferred processor where parallel tasks are executed. The task-parallel framwork can then decide at run time whether to honor the programmer preferences or balance load instead (or both). A possible starting point for this project is the Intel Threading Building Blocks (TBB) task-parallel framework.

Environment: C, C++
Keywords: task parallelism, NUMA, multicore
Contacts: Zoltan Majo, 63 23103, RZ H 3, zoltan.majo at inf.ethz.ch
Supervisors: Prof. Th. Gross, 63 27342, RZ H1.2, thomas.gross at inf.ethz.ch