I don't know if someone is reading this, but here is a tip for making CPU load lower when we do some processing with OpenCL on nVidia GPUs.
1. Always use pinned memory when you transfer data between a device and the host. Reading and writing buffer always block and waste CPU if you don't use pinned memory.
2. Never use a command that blocks. With nVidia GPUs, all blocking commands involve a busy wait that wastes CPU. Avoid them and you need to manually poll and sleep in order to wait until a command finishes.
3. If you don't use blocking commands, you need to manually flush the command queue.