glDrawElements vs glDrawArray efficiency

glDrawElements vs glDrawArray efficiency - iphone

I'm making a game for iOS and Android.
I've seen in a lot of places that drawing using indices is more efficient than just drawing triangles array.
The thing is that i'm using lossy compressed vertices (like the md2's file format) and it takes less than just the indices alone -
Array: N * 3 (xyz) * 1 (uchar) + translate (12 bytes) + scale (12 bytes).
Element: N * 3 (xyz) * 4 (uint) + Array / ~10
It seems like the array is even better choice than indexed and compressed element, altough apple' s OpenGL profiler tool says that I should use glDrawElements..
Does the OpenGL implementation prefer indexed array? or it's because that the indexed array contains less data than the regular uncompressed array?
p.s.
I'm using OpenGL es 2.0 and the vertex shader is the one who decompresses the vertices.

Are you using Vertex Arrays or Vertex Buffer Objects? If you are using Vertex Arrays, then you might want to consider looking into glDrawRangeElements rather than glDrawElements.
Unless all of your polys have distinct vertices (pretty unlikely), glDrawElements will be faster than glDrawArray because glDrawElements will take advantage of the fact that repeated vertices will be loaded into the gpu cache, and will not have to be loaded more than once for each poly that that vertex is a part of. However, with glDrawArray, it will iterate through the array and not be able to associate similar vertices, so it will have to load them multiple times.
However, it always depends on your specific situation. Try using a profiler or getting a framerate using each of these methods. It shouldn't be too hard to switch between them. If one is clearly better, use that one. If neither is outstandingly better, then it probably doesn't matter too much.
Remember, premature optimization is the root of all evil. Good luck! :)

Using glDrawElements() will save a substantial amount of memory and you can specify the array elements in any order.
Using glDrawArrays(), however, your only option is to iterate sequentially over the list.
You have to experiment.

Related

plans and subset of rows in pyfftw

I want to use pyfft to repeatedly compute the discrete Fourier transform of a subset of rows for a two-dimensional array. I do not know in advance which rows I need to transform, that depends on the output from the previous round. I do know that doing it for all rows is wasteful.
It is my understanding that a 'plan' in FFTW3 is associated with the type of transform (c2c, r2c, etc) and the input/output length, which is always a vector in the 1D case. In pyfftw it looks like a 'plan' is associated to the type of transform and the input/output shape, so my interpretation is that it uses the same FFTW3 plan for every row.
My question is: is it possible to use the same FFTW3 plan for some of the rows, without creating separate pyfftw.FFTW objects for all possible combinations of rows?
On a different note, I am wondering how pyfftw uses multiple cores: does it use multiple cores for each row (this appears natural in view of FFTW3 documentation) or does it farm out different rows to different cores (which was my initial assumption)?

If you can create a numpy array from a view, you can plan for it with pyFFTW - all valid numpy arrays should work just fine.
This means several things:
Your array needs to have regular strides, but those strides can be arbitrary.
ND arrays are planned as ND transforms, with the selected axes being used.
You can probably do something cunning with stride tricks and it will probably work (but might not do what you expect if you do something too nefarious like overlapping rows and then use threads).
One solution that I've used quite a bit is to copy the rows that you want to transform into an interim array, and transforming that. You might well find that's the fastest option (particularly when you can allow for getting the byte offset correct).
Obviously, this doesn't work if you always have a different number of rows. You might still find that if you plan for the largest number of rows that are transformed and then copy in a subset you still do faster than otherwise.
The problem you're going to come up against, even if you go down to the C level, is that the planning overhead might well dominate if you're changing your transform sizes often.
You could also try pyfftw.interfaces.numpy_fft which is normally faster that numpy and has the ability to cache repeated transform sizes.

Can an interleaved index buffer be used for skeletal animation and can the algorithm be optimized?

This is two questions, that deal with the same topic.
I recently made an obj loader that creates an interleaved index buffer from the obj data. It works fine, but with large models it can take minutes to load a single mesh since a key part of this buffer format is; non-unique indexes can be referenced multiple times in an index array, thus you don't need to add the same index data twice. The problem is that to test whether an index is unique or not involves testing it against other index data, and with large files this can take minutes to calculate. Is there a way to speed this up? Should I just skip the unique index checking? Or should I take this code and use it to create my own files based on the obj, so that I can just dump the data straight into the program?
In the future I'd like to adapt an animation element in my program (using a library to import collada data), although I'm having trouble getting my head around animated meshes, it was always my belief that based on n weights, a vertex is manipulated within the vertex shader, so couldn't we tell each index what bones it is influenced by and update it in the shader? Or am I misunderstanding the process?

If you're really sure that the slowdown is the unique index checking I suggest you have your pre-processing code write out its result as binary data to a file which you load instead of the OBJ file -- you can just copy-paste your pre-processing code to a Command-Line Utility, put the resulting file in your project, and then use NSData to get its contents and set your VBO to the data. Without code its hard to say if it could be sped up -- how do you perform the check? You could perhaps use a dictionary to get O(1) lookups on each check, or if the geometry is really big, maybe multithreading would help.
You are correct about how skeletal animation is commonly implemented in a shader. Below is some code from a shader I wrote which might help you. Note that it only supports 16 bones, and only three bones can influence each point.
As far as I know the best COLLADA animation tutorial on the web is this one (http://www.wazim.com/Collada_Tutorial_1.htm), though the author's english isn't the best, and his code is C++ and C#. As for the influence, you would probably make that a vertex attribute and add it to your interleaved VBO.
uniform mat4 modelViewProjectionMatrix;
attribute vec3 boneWeights;
attribute vec4 position
void main()
{
vec4 animatedPosition;
animatedPosition = ((position * boneMatrices[int(boneIndex[0])])) * boneWeights[0];
animatedPosition += ((position * boneMatrices[int(boneIndex[1])])) * boneWeights[1];
animatedPosition += ((position * boneMatrices[int(boneIndex[2])])) * boneWeights[2];
gl_Position = modelViewProjectionMatrix * animatedPosition;
}

How is a bitmapped vector trie faster than a plain vector?

It's supposedly faster than a vector, but I don't really understand how locality of reference is supposed to help this (since a vector is by definition the most locally packed data possible -- every element is packed next to the succeeding element, with no extra space between).
Is the benchmark assuming a specific usage pattern or something similar?
How this is possible?

bitmapped vector tries aren't strictly faster than normal vectors, at least not at everything. It depends on what operation you are considering.
Conventional vectors are faster, for example, at accessing a data element at a specific index. It's hard to beat a straight indexed array lookup. And from a cache locality perspective, big arrays are pretty good if all you are doing is looping over them sequentially.
However a bitmapped vector trie will be much faster for other operations (thanks to structural sharing) - for example creating a new copy with a single changed element without affecting the original data structure is O(log32 n) vs. O(n) for a traditional vector. That's a huge win.
Here's an excellent video well worth watching on the topic, which includes a lot of the motivation of why you might want these kind of structures in your language: Persistent Data Structures and Managed References (talk by Rich Hickey).

There is a lot of good stuff in the other answers but nobdy answers your question. The PersistenVectors are only fast for lots of random lookups by index (when the array is big). "How can that be?" you might ask. "A normal flat array only needs to move a pointer, the PersistentVector has to go through multiple steps."
The answer is "Cache Locality".
The cache always gets a range from memory. If you have a big array it does not fit the cache. So if you want to get item x and item y you have to reload the whole cache. That's because the array is always sequential in memory.
Now with the PVector that's diffrent. There are lots of small arrays floating around and the JVM is smart about that and puts them close to each other in memory. So for random accesses this is fast; if you run through it sequentially it's much slower.
I have to say that I'm not an expert on hardware or how the JVM handles cache locality and I have never benchmarked this myself; I am just retelling stuff I've heard from other people :)
Edit: mikera mentions that too.
Edit 2: See this talk about Functional Data-Structures, skip to the last part if you are only intrested in the vector. http://www.infoq.com/presentations/Functional-Data-Structures-in-Scala

A bitmapped vector trie (aka a persistent vector) is a data structure invented by Rich Hickey for Clojure, that has been implementated in Scala since 2010 (v 2.8). It is its clever bitwise indexing strategy that allows for highly efficient access and modification of large data sets.
From Understanding Clojure's Persistent Vectors :
Mutable vectors and ArrayLists are generally just arrays which grows
and shrinks when needed. This works great when you want mutability,
but is a big problem when you want persistence. You get slow
modification operations because you'll have to copy the whole array
all the time, and it will use a lot of memory. It would be ideal to
somehow avoid redundancy as much as possible without losing
performance when looking up values, along with fast operations. That
is exactly what Clojure's persistent vector does, and it is done
through balanced, ordered trees.
The idea is to implement a structure which is similar to a binary
tree. The only difference is that the interior nodes in the tree have
a reference to at most two subnodes, and does not contain any elements
themselves. The leaf nodes contain at most two elements. The elements
are in order, which means that the first element is the first element
in the leftmost leaf, and the last element is the rightmost element in
the rightmost leaf. For now, we require that all leaf nodes are at the
same depth2. As an example, take a look at the tree below: It has
the integers 0 to 8 in it, where 0 is the first element and 8 the
last. The number 9 is the vector size:
If we wanted to add a new element to the end of this vector and we
were in the mutable world, we would insert 9 in the rightmost leaf
node, like this:
But here's the issue: We cannot do that if we want to be persistent.
And this would obviously not work if we wanted to update an element!
We would need to copy the whole structure, or at least parts of it.
To minimize copying while retaining full persistence, we perform path
copying: We copy all nodes on the path down to the value we're about
to update or insert, and replace the value with the new one when we're
at the bottom. A result of multiple insertions is shown below. Here,
the vector with 7 elements share structure with a vector with 10
elements:
The pink coloured nodes are shared between the vectors, whereas the
brown and blue are separate. Other vectors not visualized may also
share nodes with these vectors.
More info
Besides Understanding Clojure's Persistent Vectors, the ideas behind this data structure and its use cases are also explained pretty well in David Nolen's 2014 lecture Immutability, interactivity & JavaScript, from which the screenshot below was taken. Or if you really want to dive deeply into the technical details, see also Phil Bagwell's Ideal Hash Trees, which was the paper upon which Hickey's initial Clojure implementation was based.

What do you mean by "plain vector"? Just a flat array of items? That's great if you never update it, but if you ever change a 1M-element flat-vector you have to do a lot of copying; the tree exists to allow you to share most of the structure.

Short explanation: it uses the fact that the JVM optimizes so hard on read/write/copy array data structures. The key aspect IMO is that if your vector grows to a certain size index management becomes a  bottleneck . Here comes the very clever algorithm from persisted vector into play, on very large collections it outperforms the standard variant. So basically it is a functional data-structure which only performed so well because it is built up on small mutable highly optimizes JVM datastructures.
For further details see here (at the end)
http://topsy.com/vimeo.com/28760673

Judging by the title of the talk, it's talking about Scala vectors, which aren't even close to "the most locally packed data possible": see source at https://lampsvn.epfl.ch/trac/scala/browser/scala/tags/R_2_9_1_final/src/library/scala/collection/immutable/Vector.scala.
Your definition only applies to Lisps (as far as I know).

Where do I find the memory requirements of a MATLAB function?

I have a 3D array of values (0 or 1), which is very large (approx 2300x2300x11). I want to fit a surface to these values using for example interp3, but when I try MATLAB runs out of memory. Thus, I've decided to reduce the size of my array enough for MATLAB to accomodate it in memory.
Now, the smaller I make the reduced array, the worse my results will be (the surface fitting is part of a measurement process with high precision requirements), so I want to reduce the array as little as possible.
Is there any way to determine on beforehand how much memory a certain array size will demand and how much memory is available, and then use this information to resize the array enough to avoid out of memory exceptions, but not more?

I don't know the answer to this, but I wonder if you can have your cake and eat it, too.
If your data set is too big, why not do a piecewise fit? Do it in chunks rather than omitting data points.
Or be smarter about how you omit data points. You want them in areas of high curvature - where your data is changing fastest. Leave out points in areas far away from the action, where nothing interesting is happening. You might have to do a fit, look at the surface, add and remove more points and try again.
It might an iterative process, but I'll bet you'll be able to get a nice fit with a little luck and effort.

You can look at the maximum array sizes that are supported on different platforms. In general, if you have a PxQxR sized 3D array of doubles, then the size of your array in bytes is P*Q*R*8. For your matrix, the size is ~ 444 MB. You can also try reducing it to a single, using single(A). single uses 4 bytes per element and you can reduce the size of your array by a factor 2.
I haven't really poked into the inner workings of interp3, but the exact memory requirements will depend on the interpolation option you choose. So, you can first try to convert it to single and see if it works. If not, try with 80% (90%) of the number of rows and columns. This way you have a good chunk of the original array, but the memory requirement is only 64% (81%) of the original.
If that doesn't help, duffymo's suggestion is what you should be looking into.

Is using Vertex Buffer Object's for very dynamic data a good idea performance-wise?

I have many particles who's vertices change every frame. The vertices are currently being drawn using a vertex array in 'client' memory. What performance characteristics can I expect if I use a vertex buffer object?
Since I have to use a number of glBuffersubData's to update the particle vertices, I am therefore transferring the vertices to video memory every frame anyway right(like i would if i use a regular vertex array)? Is there any benefit to VBO's in this case?
This is for iOS devices. The actual draw call: glDrawElements(GL_POINTS,num_particles,GL_UNSIGNED_SHORT,pindices);
Should I use GL_STREAM_DRAW or GL_DYNAMIC_DRAW?

Apple's documentation appears to recommend VBOs in all situations. If you're using ES 2.x then the GL_STREAM_DRAW vertex buffer type is explicitly for "when your application needs to create transient geometry that is rendered a small number of times and then discarded. This is most useful when your application must dynamically change vertex data every frame in a way that cannot be performed in a vertex shader." Use of glBufferSubData is then directly advocated.
Logically, I guess the only difference between supplying the data completely afresh and sending it to an existing GL_STREAM_DRAW or GL_DYNAMIC_DRAW buffer is that your space in the memory map (GPU or CPU, depending on the chip — MBXs don't really do VBOs but Apple supports them for other performance reasons) can be allocated once rather than allocated and released every frame.
Using the alignment and packing tips given in that document is likely to give a better improvement than a switch to VBOs, since otherwise the CPU just has to unpack and repack data upon glDrawElements. Though quite probably you're already aware of that and I appreciate that it isn't directly part of the question — I mainly throw it in as a comparative guess about performance benefits.

By setting VBOs properly, you are using optimal way of transferring data to the GPU. By doing so, you might skip some driver processing. The only way to see how much you get of improvement you get is to measure. It is different from card to card.
For VBO how-to, see this : VBO tutorial
EDIT
Forgot to answer the question : yes, it is a good idea. But first measure.