In my current project I am using the Prefix Sum to help me pack data processed on the GPU into an OpenGL buffer that is to be used as a position input for instanced drawing. I am doing this because it allows me to pack all the data I will use at the start of the buffer and render only the valid, on screen instances of data.
The basics of Prefix sum are covered in the link above, and a simple example of how to implement one in parallel can be found here. A more complex GPU efficient explanation can be found on the NVIDIA site. The NVIDIA link uses CUDA but if, like me, you are using OpenCL a version of it can be found in the OpenCL Code Examples in the NVidia GPU Computing SDK, under “Scan”.
Packing of the data is covered here Though I think the explanation is a little unclear “if P[i]==P[i+1]” should be “if P[i]!=P[i+1]” shouldn’t it? In fact I think there are some mistakes in link from the same site written above too, but it covers the basics in an easy to understand manner.
After processing the data on the GPU, counting the number of valid entries, and packing the data the number of instances I need to render must be passed to OpenGL. Currently I read back the number to the CPU and then pass it to OpenGL via glDrawElementsInstanced though it seem the need to read back the data is removed by glDrawElementsIndirect in GL4 or the GL_ARB_draw_indirect extension in GL3.1 which I will try out later (along with moving my instanced data from a Texture Buffer to a Vertex one see: GL_ARB_instanced_arrays
and Finally i read this interesting article on a newer extension to drawing using data on the GPU today : http://rastergrid.com/blog/2011/06/multi-draw-indirect-is-here/