Wednesday, July 16, 2014

Testing unified memory

I just tested CUDA6's unified memory. This code worked as I expected.

#include <iostream>
#include <string.h>

#include <cuda_runtime.h>

__global__ void set(char* buf, unsigned int num)
{
 int i = blockDim.x * blockIdx.x + threadIdx.x;
 if (i < num)
 {
  buf[i] = 1;
 }
}

int main(void)
{
 unsigned int num = 16384;
 char *buf;
 cudaMallocManaged(&buf, num);
 memset(buf, 0, num);                         //##1## D->H(16384bytes). page fault.
 set<<<4, 128>>>(buf, num);                   //##2## H->D(16384bytes).
 cudaDeviceSynchronize();
 std::cout << (int)buf[0] << std::endl;       //##3## D->H(16384bytes). page fault.
 std::cout << (int)buf[10000] << std::endl;   //No data transfer.
 set<<<4, 128>>>(buf, num);
 cudaDeviceSynchronize();
 buf[0] = 5;                                  //##4## D->H(16384bytes). page fault.
 set<<<4, 128>>>(buf, num);                   //##5## H->D(4096bytes).
 cudaFree(buf);
 cudaDeviceReset();
   return 0;
}

I used a profiler and added comments where data transfers have occurred. Whenever an un-transferred host memory is accessed for either read or write, Data transfer D->H occurs, not only before read but before write. In my understanding it is because unless it copies the memory D->H on write, once you wrote a value to a memory and then read a memory near the address where you have modified the value, data transfer D->H on read would overwrite your modification. It produces a bit of unnecessary data transfer (##1##).
You can also see when ##5##, only 4096 bytes of data transfer occurs, not full 16384 bytes. It is because it keep tracks of the page you have written (I have asked it to a NVIDIA guy).

I modified the main function a little bit and created another buffer buf2.
int main(void)
{
 unsigned int num = 16384;

 //Below code adds D->H(16384bytes), H->D(16384bytes). 1 page fault.
 char *buf2;
 cudaMallocManaged(&buf2, num);
 memset(buf2, 0, num);

 //Same as above.
 char *buf;
 cudaMallocManaged(&buf, num);
 memset(buf, 0, num);                         //D->H(16384bytes). page fault.
 set<<<4, 128>>>(buf, num);                   //H->D(16384bytes).
 cudaDeviceSynchronize();
 std::cout << (int)buf[0] << std::endl;       //D->H(16384bytes). page fault.
 std::cout << (int)buf[10000] << std::endl;   //No data migration.
 set<<<4, 128>>>(buf, num);
 cudaDeviceSynchronize();
 buf[0] = 5;                                  //D->H(16384bytes). page fault.
 set<<<4, 128>>>(buf, num);                   //H->D(4096bytes).
 cudaFree(buf);
 cudaFree(buf2);
 cudaDeviceReset();
   return 0;
}

In this code I allocated a unified memory to buf2 and called memset() but buf2 is not used elsewhere. It adds D->H(16384bytes) when memset() is called and it is natural, but it also adds H->D(16384bytes). We know the kernel doesn't use buf2 but it is too complicated for CUDA to know that. So before the first kernel call it also transfers buf2, which is another unnecessary data transfer. I have concluded unified memory makes the program simpler but not as efficient as I have expected. I heard Pascal has cleverer memory management. I will wait and see if is good enough.

No comments: