I have been super busy recently with various things: including but not limited to a trip home for the first time in 6 years. But that doesn’t mean I haven’t had time for a little home coding.
I had a quick go at porting Ericsson’s ETC-Pack to use OpenCL. It really was a rushed hack job to see what kind of speed ups I could expect, and as such I have only ported the SLOW option when converting from .ppm to .ktx file formats. The results were quite good.
It takes 112.74 seconds to convert the supplied “elina.ktx” sample on my i7 920 using the standard version. Using the OpenCL version it takes 24.1 seconds on the same CPU and on a GTX 460 it comes down to only 6.0 seconds, roughly 18.5 times faster than the original.
I have attached the CL kernel, which needs to be run in void compressImageFile(…) in place of the 2D loop near the bottom, something like this:
void compressImageFile(uint8 *img,int width,int height,char *dstfile, int expandedwidth, int expandedheight, int action)
{
// original source here ...
int countblocks = 0;
#ifdef TRY_CL //new openCL stuff
//--------------------------------------
//OPENCL STUFF
//--------------------------------------
cl_int error;
cl_mem src = clCreateBuffer(m_context.m_handle, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 3 * width * height, img, &error);
int writesize = ((expandedheight/4) * (expandedwidth/4) * 2);
cl_mem dst = clCreateBuffer(m_context.m_handle, CL_MEM_WRITE_ONLY, sizeof(unsigned int) * writesize, NULL, &error);
unsigned int dimx = expandedwidth/4;
unsigned int dimy = expandedheight/4;
// In the next step we associate the GPU memory with the Kernel arguments
error = clSetKernelArg(m_kernel, 0, sizeof(cl_mem), (void*)&(src));
error |= clSetKernelArg(m_kernel, 1, sizeof(cl_int), (void*)&expandedwidth);
error |= clSetKernelArg(m_kernel, 2, sizeof(cl_int), (void*)&expandedheight);
error |= clSetKernelArg(m_kernel, 3, sizeof(cl_int), (void*)&dimx);
error |= clSetKernelArg(m_kernel, 4, sizeof(cl_int), (void*)&dimy);
error = clSetKernelArg(m_kernel, 5, sizeof(cl_mem), (void*)&(dst));
m_workSize[0] = expandedheight / 4;
m_workSize[1] = expandedwidth / 4;
error |= clEnqueueNDRangeKernel(m_context.m_cmdQueue, m_kernel, 2, NULL, m_workSize, NULL, 0, NULL, NULL);
unsigned int* op = new unsigned int[writesize];
clEnqueueReadBuffer(m_context.m_cmdQueue, dst, CL_TRUE, 0, sizeof(unsigned int) * writesize, op, 0, NULL, NULL);
for(y=0;y<writesize;y+=2)
{
write_big_endian_4byte_word(&op[y], f);
write_big_endian_4byte_word(&op[y+1], f);
}
//*/
#else //original code
/// xxx
for(y=0;y<expandedheight/4;y++)
{
for(x=0;x<expandedwidth/4;x++)
{
//cut for brevity
}
}
#endif
fclose(f);
//you get the idea
}
The kernel: compress.txt If you are going to be running it on the GPU in it’s current state you’ll probably need to change your Time Detection And Recovery Settings.