Global memory - pinned/unpinned

Goals

Framework

Algorithms

On-the-fly parametrisation

Pinned Host Memory

Host (CPU) data allocations are pageable by default. The GPU cannot access data directly from pageable host memory, so when a data transfer from pageable host memory to device memory is invoked, the CUDA driver must first allocate a temporary page-locked, or“pinned”, host array, copy the host data to the pinned array, and then transfer the data from the pinned array to device memory, as illustrated below.

As you can see in the figure, pinned memory is used as a staging area for transfers from the device to the host. We can avoid the cost of the transfer between pageable and pinned host arrays by directly allocating our host arrays in pinned memory. Allocate pinned host memory in CUDA C/C++ usingcudaMallocHost()orcudaHostAlloc(), and deallocate it withcudaFreeHost(). It is possible for pinned memory allocation to fail, so you should always check for errors. The following code excerpt demonstrates allocation of pinned memory with error checking.

cudaError_tstatus=cudaMallocHost((void**)&h_aPinned,bytes);

if(status!=cudaSuccess)

printf("Error allocating pinned host memoryn");

Data transfers using host pinned memory use the samecudaMemcpy()syntax as transfers with pageable memory. We can use the following“bandwidthtest”program (also available on Github) to compare pageable and pinned transfer rates.

#include<stdio.h>

#include<assert.h>

// Convenience function for checking CUDA runtime API results

// can be wrapped around any runtime API call. No-op in release builds.

inline

cudaError_tcheckCuda(cudaError_tresult)

{

#if defined(DEBUG) || defined(_DEBUG)

if(result!=cudaSuccess){

fprintf(stderr,"CUDA Runtime Error: %sn",

cudaGetErrorString(result));

assert(result==cudaSuccess);

}

#endif

returnresult;

}

voidprofileCopies(float*h_a,

float*h_b,

float*d,

unsignedint n,

char*desc)

{

printf("n%s transfersn",desc);

unsignedintbytes=n*sizeof(float);

// events for timing

cudaEvent_tstartEvent,stopEvent;

checkCuda(cudaEventCreate(&startEvent));

checkCuda(cudaEventCreate(&stopEvent));

checkCuda(cudaEventRecord(startEvent,0));

checkCuda(cudaMemcpy(d,h_a,bytes,cudaMemcpyHostToDevice));

checkCuda(cudaEventRecord(stopEvent,0));

checkCuda(cudaEventSynchronize(stopEvent));

floattime;

checkCuda(cudaEventElapsedTime(&time,startEvent,stopEvent));

printf(" Host to Device bandwidth (GB/s): %fn",bytes*1e-6/time);

checkCuda(cudaEventRecord(startEvent,0));

checkCuda(cudaMemcpy(h_b,d,bytes,cudaMemcpyDeviceToHost));

checkCuda(cudaEventRecord(stopEvent,0));

checkCuda(cudaEventSynchronize(stopEvent));

checkCuda(cudaEventElapsedTime(&time,startEvent,stopEvent));

printf(" Device to Host bandwidth (GB/s): %fn",bytes*1e-6/time);

for(inti=0;i<n;++i){

if(h_a[i]!=h_b[i]){

printf("*** %s transfers failed ***",desc);

break;

}

// clean up events

checkCuda(cudaEventDestroy(startEvent));

checkCuda(cudaEventDestroy(stopEvent));

}

intmain()

{

unsignedintnElements=4*1024*1024;

constunsignedintbytes=nElements*sizeof(float);

// host arrays

float*h_aPageable,*h_bPageable;

float*h_aPinned,*h_bPinned;

// device array

float*d_a;

// allocate and initialize

h_aPageable=(float*)malloc(bytes);// host pageable

h_bPageable=(float*)malloc(bytes);// host pageable

checkCuda(cudaMallocHost((void**)&h_aPinned,bytes));// host pinned

checkCuda(cudaMallocHost((void**)&h_bPinned,bytes));// host pinned

checkCuda(cudaMalloc((void**)&d_a,bytes));// device

for(inti=0;i<nElements;++i)h_aPageable[i]=i;

memcpy(h_aPinned,h_aPageable,bytes);

memset(h_bPageable,0,bytes);

memset(h_bPinned,0,bytes);

// output device info and transfer size

cudaDeviceProp prop;

checkCuda(cudaGetDeviceProperties(&prop,0));

printf("nDevice: %sn",prop.name);

printf("Transfer size (MB): %dn",bytes/(1024*1024));

// perform copies and report bandwidth

profileCopies(h_aPageable,h_bPageable,d_a,nElements,"Pageable");

profileCopies(h_aPinned,h_bPinned,d_a,nElements,"Pinned");

printf("n");

// cleanup

cudaFree(d_a);

cudaFreeHost(h_aPinned);

cudaFreeHost(h_bPinned);

free(h_aPageable);

free(h_bPageable);

return0;

}

The data transfer rate can depend on the type of host system (motherboard, CPU, and chipset) as well as the GPU. On my laptop which has an Intel Core i7-2620M CPU (2.7GHz, 2 Sandy Bridge cores, 4MB L3 Cache) and an NVIDIA NVS 4200M GPU (1 Fermi SM, Compute Capability 2.1, PCI-e Gen2 x16), runningBandwidthTestproduces the following results. As you can see, pinned transfers are more than twice as fast as pageable transfers.

Device: NVS 4200M

Transfer size (MB): 16

Pageable transfers

Host to Device bandwidth (GB/s): 2.308439

Device to Host bandwidth (GB/s): 2.316220

Pinned transfers

Host to Device bandwidth (GB/s): 5.774224

Device to Host bandwidth (GB/s): 5.958834

On my desktop PC with a much faster Intel Core i7-3930K CPU (3.2 GHz, 6 Sandy Bridge cores, 12MB L3 Cache) and an NVIDIA GeForce GTX 680 GPU (8 Kepler SMs, Compute Capability 3.0) we see much faster pageable transfers, as the following output shows. This is presumably because the faster CPU (and chipset) reduces the host-side memory copy cost.

Device: GeForce GTX 680

Transfer size (MB): 16

Pageable transfers

Host to Device bandwidth (GB/s): 5.368503

Device to Host bandwidth (GB/s): 5.627219

Pinned transfers

Host to Device bandwidth (GB/s): 6.186581

Device to Host bandwidth (GB/s): 6.670246

You should not over-allocate pinned memory. Doing socan reduce overall system performance because it reduces the amount of physical memory available to the operating system and other programs. How much is too much is difficult to tell in advance, so as with all optimizations, test your applications and the systems they run on for optimal performance parameters.

Batching Small Transfers

Due to the overhead associated with each transfer, it is preferable to batch many small transfers together into a single transfer. This is easy to do by using a temporary array, preferably pinned, and packing it with the data to be transferred.

For two-dimensional array transfers, you can usecudaMemcpy2D().

cudaMemcpy2D(dest,dest_pitch,src,src_pitch,w,h,cudaMemcpyHostToDevice)

The arguments here are a pointer to the first destination element and the pitch of the destination array, a pointer to the first source element and pitch of the source array, the width and height of the submatrix to transfer, and the memcpy kind. There is also acudaMemcpy3D()function for transfers of rank three array sections.

Global memory - pinned/unpinned

Global memory - pinned/unpinned

results matching ""

No results matching ""