CPU-to-GPU data transfers exceed 15GB/s using APU zero copy path
At the AMD Fusion Developer Summit (AFDS), AMD introduced the AMD A-Series processors, formerly known as the “Sabine” platform and “Llano” APU. The AMD A-Series APU is a capable GPU compute device, the A8-3850 with Radeon™ HD 6550D Graphics is capable of up to480 GFLOPS of performance. In order to fully benefit from this performance, it is essential to be able to transfer data efficiently between the host CPU and the GPU. In this blog I discuss the zero copy data transfer path that has been introduced to support the tighter coupling between CPU and GPU memory afforded by the APU architecture.
As you may be aware, in the AMD A-Series APU the CPU and GPU share a common DDR3 memory, partitioning the memory between the CPU and the GPU. Unlike a discrete GPU, this means that data may be transferred more efficiently between the CPU and the GPU as there is no longer a constraint imposed by a PCIe bus. In AMD APP SDK 2.5 we introduce a zero copy transfer path for buffers defined using the standard OpenCLTM memory buffer flags CL_MEM_ALLOC_HOST_PTR and CL_MEM_READ_ONLY when used on an APU.
For maximum performance transfers from CPU to GPU, you can create a memory buffer using the clCreateBuffer command with the flags CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_ONLY. This buffer will be created in host accessible GPU memory and provides a true zero copy path for transferring data between CPU and GPU. As host accessible memory, this buffer may be mapped to the host CPU and then written to efficiently. To make this buffer accessible by the host CPU use the clMapBuffer command to map the buffer into host memory; control over the memory buffer is transferred logically between the devices. The CPU may now write directly to this buffer, and when done, return its contents to the GPU using the clEnqueueUnmapMemObject command. Now, when the GPU accesses this buffer, as initiated by a clEnqueueNDRangeKernel command, it is read directly by the GPU. In this way data is transferred between the CPU and the GPU substantially faster than is possible to a discrete GPU over a PCIe bus; using multiple CPU cores, transfer rates of over 15 GB/s (1) have been achieved for mid-range AMD A-Series platforms. When the GPU no longer needs the buffer, it may be returned to the CPU using the clEnqueueMapMemObject command, so that additional input data can be provided to the GPU.
In my next blog I will discuss our BufferBandwidth example with Marc Romankewicz, one of our OpenCL performance engineers, and how it leverages this enhancement among others to illustrate how to effectively move data between the CPU and GPU.
For more details on best practice in creating high performing code on AMD APUs and GPUs I refer you to the AMD APP SDK OpenCL Programming guide (http://goo.gl/f1HZj).
1 A8-3800 with Radeon™ HD 6550D graphics, 8GB DDR3-1333 Share this post on CPU-to-GPU data transfers exceed 15GB/s using APU zero copy path