clOpenCL - cluster OpenCL |
|
The project "Portability and Performance in Heterogeneous Many Core Systems", Ref. PTDC/EIA-EIA/100035/2008, was a R&D project, funded by the portuguese Foundation for Science and Technology (FCT), developed from May 2010 to June 2013. The main contributions initially planned included: i) a better understanding of the addressed computing paradigms (for CPUs and specialized coprocessors) and their suitability to perform interactive global illumination (IGI); ii) development and assessment of a performance model and scheduling mechanism on heterogeneous many-core distributed systems; iii) a specification and support for a distributed memory model on top of OpenCL; iv) design, evaluation and dissemination of an efficient, adaptive and portable OpenCL IGI engine. Contribution iii) was the main target of a specific task of the project (T5 - Distributed Memory), and clOpenCL (cluster OpenCL) is the main byproduct of that task.
|
Clusters of heterogeneous computing nodes provide an opportunity to significantly increase the performance of parallel and High-Performance Computing (HPC) applications, by combining traditional multi-core CPUs coupled with accelerator devices, interconnected by high throughput and low latency networking technologies. However, developing efficient applications to run in clusters that integrate GPUs and other accelerators often requires a great effort, demanding programmers to follow complex development methodologies in order to suit algorithms and applications to the new heterogeneous parallel environment. OpenCL is an open programming standard for heterogeneous computing. It suffers, however, from a major limitation: applications can only make use of the local compute devices, present on a single machine. clOpenCL (cluster OpenCL) expands the original single-node OpenCL model, by enabling the deployment and execution of OpenCL applications in clusters of heterogeneous nodes. Compared with similar projects, clOpenCL has two advantages: i) it is able to take full advantage of commodity networking hardware through Open-MX, and ii) programmers/users do not need special privileges neither exclusive access to scarce resources (e.g., accelerators) to deploy the desired running environment.
|
Figure 1 - clOpenCL (a) operation model, (b) host application layers.
An OpenCL application comprises an host program and a set of kernels intended to run on compute devices. Figure 1 presents the a) clOpenCL operation model, where a single host program interacts with multiple compute devices (local or remote), and b) the software/hardware layers present in the host program. Every call to an OpenCL primitive is intercepted by the clOpenCL wrapper library which redirects its execution to a specific clOpenCL daemon at a cluster node or to the local OpenCL runtime. clOpenCL daemons are simple OpenCL programs that handle remote calls and interact with local devices. Each user spawns its own set of daemons, eventually sharing particular compute devices with other cluster users. A typical clOpenCl application starts at a particular cluster node and will create OpenCl objects in cluster nodes. For each object, the clOpenCL wrapper library returns a ``fake pointer" used as a global identifier, and stores the real pointer along with the corresponding daemon location. The exchange of data between the wrapper library and remote daemons uses Open-MX, an open-source message passing stack over generic Ethernet, which provides low-level communication mechanisms at user-level space and allows to achieve low latency communication and low CPU overhead.
|
Using clOpenCL
Evaluation TestBed All cluster systems run Linux Rocks [Gro12d] (version 5.4.3). The specific OpenCL platform and GPU driver versions used were AMD SDK 2.6 with driver 11.12, and CUDA 4.1.28 with driver 285.05.33. Open-MX 1.5.2 was used with the SysKonnect NICs (that provide better performance than the on-board Intel NICs).
Case Study Figure 2 - deployment scenario for the MPI-with-OpenCL approach.
Figure 3 - deployment scenario for the clOpenCL approach.
Figure 4 - multiplication times (seconds) for square matrices of order 8k, 16k and 24k.
|
|
|
|
Last Revision: July 2013
[ Home ]