Sunday, May 18, 2014

CPU/GPU memory abstraction

When you write a program that uses GPU (either CUDA of OpenCL), you may want to implement both CPU code and GPU code, and use only one of them depending on the user's choice. For this reason I wanted to know how to treat different kinds of memories generically, i.e. the way  to abstract host and device memory.

 The easiest one is to follow the approach of (or just use) Thrust. It has host_vector and device_vector which interfaces are quite similar to std::vector. When you have one of them, you can copy it to another with assignment. Host/device copy will be done automatically under the hood.

 I wanted another way of abstraction to debug the program easily. What interested me was to have a mechanism where input data was either in host or device memory, and the the code does not have to know where it is. After several try and error I implemented one like this;

template < typename T >
class Buffer
{
public:
    T* get(MemoryType mType);
    const T* get(MemoryType mType) const;
    void setClean(MemoryType mType, bool isClean=true);
    void sync(MemoryType mType) const;
    void allocate(MemoryType mType) const;
    void free(MemoryType mType);

private:
    mutable MemoryType m_cleanState; //Clean state. Bitwise OR of HOST and DEVICE.
    mutable void* m_addrs[2]; //Host and device.
};

It can have both host and device memory, and knows if the data stored in the host/device memory is up to date or not. If the one in the host memory is up to date and one in the device memory is not, it copies data from the host memory to the device by calling sync(DEVICE). if the device memory is already up to date, sync() does nothing.
 You can use this class like this,

someCalculationCpu(const Buffer < float > * input, Buffer < float > * output)
{
    input.sync(HOST);
    float* ip = input.get(HOST);
    float* op = output.get(HOST);
    op[0] = ip[0]; /*Do some calculation with CPU*/
    output.setClean(HOST); //Tell the buffer that the data stored in the device memory is up to date.
}

someCalculationGpu(const Buffer < float > * input, Buffer < float > * output)
{
    input.sync(DEVICE);
    float* ip = input.get(DEVICE);
    float* op = output.get(DEVICE);
    op[0] = ip[0]; /*Do some calculation with GPU*/
    output.setClean(DEVICE); //Tell the buffer that the data stored in the device memory is up to date.
}

anotherCalculationCpu(const Buffer < float > * input, Buffer < float > * output)
{
    /*Same style as someCalculationCpu() with another calculation.*/
}

anotherCalculationGpu(const Buffer < float > * input, Buffer < float > * output)
{
    /*Same style as someCalculationGpu() with another calculation.*/
}

Now the these are all valid,

   someCalculationCpu(input, output);
   anotherCalculationCpu(input, output);

   someCalculationCpu(input, output);
   anotherCalculationGpu(input, output);

   someCalculationGpu(input, output);
   anotherCalculationCpu(input, output);

   someCalculationGpu(input, output);
   anotherCalculationGpu(input, output);

I've already implemented so I'll keep using it but just wonder if there is already a tool or a way with Thrust. Please leave a comment if you know.