Although my previous CUDA posts have looked at the comparative performance between the CPU and GPU, an interesting approach is to split your problem up and use both chips at the same time. Even though your task might not be ideally suited to the GPU, you can still cut your total elapsed time by offloading some work to it. The trick is to figure out what parts to do where, and do it automatically – similar to the problem of Automatic parallelization, which the Wikipedia describes as “a grand challenge”, but with the added complication of heterogeneous systems.
Below is the abstract and a link to the full paper on an attempt to automatically do exactly that:
Qilin: Exploiting Parallelism on Heterogenous Multiprocessors with Adaptive Mapping
Chi-Keung Luk
Sunpyo Hong
Hyesoon KimAbstract
Heterogenous multiprocessors are increasingly important in the multi-core era due to their potential for high peformance and energy efficiency. In order for software to fully realize this potential, the step that maps computations to processing elements must be as automated as possible. However, the state-of-the-art approach is to rely on the programmer to specify the mapping manually and statically. This approach is not only labor intensive but also not adaptable to changes in runtime environments like problem sizes and hardware/software configurations. In this study, we propose adaptive mapping, a fully automatic technique to map computations to processing elements on a CPU+GPU machine. We have implemented it in our experimental programming system called Qilin. Our results show that, by judiciously distributing works over the CPU and GPU, automatic adaptive mapping achieves a 25% reduction in execution time and a 20% reduction in energy consumption than static mappings on average for a set of important computation benchmarks. We also demonstrate that our technique is able to adapt to changes in the input problem size and system configuration.