what is array-backed data structure? - scala

what is "array-backed" data structure?
I googled it, may be it is an array implemented as linked list
which could be easily appended and prepended.
Please correct me and share more updates about saddle library.

An array-backed data structure is any data structure where the underlying values are stored in an array. For example, a ring data structure of fixed sized can be backed by an array of that size (along with start and end indices). Image data can have pixel values packed into an array. Matrices (in the mathematical sense) fit naturally in arrays.
Other choices for data structures include linked lists, tries, maps (hashmaps or others); all of these have various tradeoffs. Array-backed data structures generally work well for going through sizable chunks of data sequentially, but not for random insertion and removal of elements.

Related

Multidimensional array or cell array in Matlab?

I want to create a multidimensional array A in Matlab of dimension NxMxG with N,M,G very large (e.g. 10^6).
Then I need to access Ain a loop as
for g=1:G
Atemp=A(:,:,g);
%etc etc
end
What is more convenient in terms of speed and memory between storing the values of A in a multidimensional array or in a cell array?
If you always loop on slices in the same way, and process them one at a time, as your bit of code seem to suggest, then the performance should be roughly equivalent.
If you really intend to store 1e6x1e6x1e6 double's, Matlab is definitely probably not your tool. However, if slices are sparse, then it's probably a bit more efficient to store them as a cell array, so Matlab does not have to search the full 3D space when "cutting" the slice, and Atemp=A{g}; simply copies a sparse matrix.
If you are working on full (nonsparse) slices then probably you should load/save your slice to disk and use instead a function/support class which loads from file: Atemp=A(g);. Mind that text loading takes up much more time than loading a binary file: so choose your file format carefully!
If you use numbers, a multidimensional array is the right thing to use. A cell array also allows other types, so is less optimised for numbers only. Because you are using very large arrays, maybe a sparse matrix may be appropriate for you.
First, note that neither pick will let you handle 10^18 values. You don't have exabytes of storage, let alone memory.
If you will ONLY ever use it as Atemp = A(:,:,g); with N and M always the same size for all g, having it multi-dimensional or cell shouldn't change anything meaningful as far as performance goes. N-D will be probably a bit faster, but nothing significant.
Obviously, if you ever want to have computation with different sizes of N, M depending on g, you need to pick cell array. And if you want to have computation with say Atemp = squeeze(A(:,g,:)); N-D array is clear choice here.
So, choice most likely depends if you prefer doing A(:,:,g) or A{g};, which depends on your meaning of data. Say if you have weather data and currently only care about what happens at specific height (not what happens between the layers), A(:,:,g) is clearly more sensible. It is possible you will require inter-layer calculations at some point. But if you have instead g meaning different measurement sites gathering data, A{g} should be used to pick the site. You will likely have some sites larger or smaller eventually.

Efficient way to store single matrices generated in a loop in Matlab?

I would like to know whether there is a way to reduce the amount of memory used by the following piece of code in Matlab:
n=3;
T=100;
r=T*2;
b=80;
BS=1000
bsuppostmp_=cell(1,BS);
bslowpostmp_=cell(1,BS);
bsuppnegtmp_=cell(1,BS);
bslownegtmp_=cell(1,BS);
for w=1:BS
bsuppostmp_{w}= randi([0,1],n*T,2^(n-1),r,b);
bslowpostmp_{w}=randi([0,3],n*T,2^(n-1),r,b);
bsuppnegtmp_{w}=randi([0,4],n*T,2^(n-1),r,b);
bslownegtmp_{w}=randi([0,2],n*T,2^(n-1),r,b);
end
I have decided to use cells of matrices because after this loop I need to call separately each single matrix in another loop.
If I run this code I get the message error "Your system has run out of application memory".
Do you know a more efficient (in terms of memory) way to store each single matrix?
Let's refer the page about Strategies for Efficient Use of Memory:
Because simple numeric arrays (comprising one mxArray) have the least overhead, you should use them wherever possible. When data is too complex to store in a simple array (or matrix), you can use other data structures.
Cell arrays are comprised of separate mxArrays for each element. As a result, cell arrays with many small elements have a large overhead.
I doubt that the overhead for cell array is really large ...
Let me give a possible explanation. What if Matlab cannot use the swap file in case of storing the 4D arrays into a cell array? When storing large numeric arrays, there is no out-of-memory error because Matlab uses the swap file for caching each variable when the used memory becomes too big. Whereas if each 4D array is stored in a super cell array, Matlab sees it as a single variable and cannot fragment it part in the RAM and part in the swap file. Ok I don't work at Mathworks so I don't know if I'm right or not, it's just an idea about this topic so I would be glad to know what is the real reason.
So my advice is the same as other comments: try to free matrices as soon as you've done with them. There is not so many possibilities to store many dense arrays: one big array (NOT recommended here, will reach out-of-memory sooner because Matlab makes it contiguous), cell array or struct array (and if I correctly understand the documentation, the overhead can be equivalent). In all cases, the data amount over all 4D arrays is really large, so the best thing to do is to care about keeping the memory constantly as low as possible by discarding some data once they are used and keep in memory only the results of computation (in case they take lower memory usage ...).

How is a bitmapped vector trie faster than a plain vector?

It's supposedly faster than a vector, but I don't really understand how locality of reference is supposed to help this (since a vector is by definition the most locally packed data possible -- every element is packed next to the succeeding element, with no extra space between).
Is the benchmark assuming a specific usage pattern or something similar?
How this is possible?
bitmapped vector tries aren't strictly faster than normal vectors, at least not at everything. It depends on what operation you are considering.
Conventional vectors are faster, for example, at accessing a data element at a specific index. It's hard to beat a straight indexed array lookup. And from a cache locality perspective, big arrays are pretty good if all you are doing is looping over them sequentially.
However a bitmapped vector trie will be much faster for other operations (thanks to structural sharing) - for example creating a new copy with a single changed element without affecting the original data structure is O(log32 n) vs. O(n) for a traditional vector. That's a huge win.
Here's an excellent video well worth watching on the topic, which includes a lot of the motivation of why you might want these kind of structures in your language: Persistent Data Structures and Managed References (talk by Rich Hickey).
There is a lot of good stuff in the other answers but nobdy answers your question. The PersistenVectors are only fast for lots of random lookups by index (when the array is big). "How can that be?" you might ask. "A normal flat array only needs to move a pointer, the PersistentVector has to go through multiple steps."
The answer is "Cache Locality".
The cache always gets a range from memory. If you have a big array it does not fit the cache. So if you want to get item x and item y you have to reload the whole cache. That's because the array is always sequential in memory.
Now with the PVector that's diffrent. There are lots of small arrays floating around and the JVM is smart about that and puts them close to each other in memory. So for random accesses this is fast; if you run through it sequentially it's much slower.
I have to say that I'm not an expert on hardware or how the JVM handles cache locality and I have never benchmarked this myself; I am just retelling stuff I've heard from other people :)
Edit: mikera mentions that too.
Edit 2: See this talk about Functional Data-Structures, skip to the last part if you are only intrested in the vector. http://www.infoq.com/presentations/Functional-Data-Structures-in-Scala
A bitmapped vector trie (aka a persistent vector) is a data structure invented by Rich Hickey for Clojure, that has been implementated in Scala since 2010 (v 2.8). It is its clever bitwise indexing strategy that allows for highly efficient access and modification of large data sets.
From Understanding Clojure's Persistent Vectors :
Mutable vectors and ArrayLists are generally just arrays which grows
and shrinks when needed. This works great when you want mutability,
but is a big problem when you want persistence. You get slow
modification operations because you'll have to copy the whole array
all the time, and it will use a lot of memory. It would be ideal to
somehow avoid redundancy as much as possible without losing
performance when looking up values, along with fast operations. That
is exactly what Clojure's persistent vector does, and it is done
through balanced, ordered trees.
The idea is to implement a structure which is similar to a binary
tree. The only difference is that the interior nodes in the tree have
a reference to at most two subnodes, and does not contain any elements
themselves. The leaf nodes contain at most two elements. The elements
are in order, which means that the first element is the first element
in the leftmost leaf, and the last element is the rightmost element in
the rightmost leaf. For now, we require that all leaf nodes are at the
same depth2. As an example, take a look at the tree below: It has
the integers 0 to 8 in it, where 0 is the first element and 8 the
last. The number 9 is the vector size:
If we wanted to add a new element to the end of this vector and we
were in the mutable world, we would insert 9 in the rightmost leaf
node, like this:
But here's the issue: We cannot do that if we want to be persistent.
And this would obviously not work if we wanted to update an element!
We would need to copy the whole structure, or at least parts of it.
To minimize copying while retaining full persistence, we perform path
copying: We copy all nodes on the path down to the value we're about
to update or insert, and replace the value with the new one when we're
at the bottom. A result of multiple insertions is shown below. Here,
the vector with 7 elements share structure with a vector with 10
elements:
The pink coloured nodes are shared between the vectors, whereas the
brown and blue are separate. Other vectors not visualized may also
share nodes with these vectors.
More info
Besides Understanding Clojure's Persistent Vectors, the ideas behind this data structure and its use cases are also explained pretty well in David Nolen's 2014 lecture Immutability, interactivity & JavaScript, from which the screenshot below was taken. Or if you really want to dive deeply into the technical details, see also Phil Bagwell's Ideal Hash Trees, which was the paper upon which Hickey's initial Clojure implementation was based.
What do you mean by "plain vector"? Just a flat array of items? That's great if you never update it, but if you ever change a 1M-element flat-vector you have to do a lot of copying; the tree exists to allow you to share most of the structure.
Short explanation: it uses the fact that the JVM optimizes so hard on read/write/copy array data structures. The key aspect IMO is that if your vector grows to a certain size index management becomes a  bottleneck . Here comes the very clever algorithm from persisted vector into play, on very large collections it outperforms the standard variant. So basically it is a functional data-structure which only performed so well because it is built up on small mutable highly optimizes JVM datastructures.
For further details see here (at the end)
http://topsy.com/vimeo.com/28760673
Judging by the title of the talk, it's talking about Scala vectors, which aren't even close to "the most locally packed data possible": see source at https://lampsvn.epfl.ch/trac/scala/browser/scala/tags/R_2_9_1_final/src/library/scala/collection/immutable/Vector.scala.
Your definition only applies to Lisps (as far as I know).

Storing characters in MATLAB arrays

I want to store a character along with numbers ? Is using Cells the only way ?
Yes, unless you store the ASCII values but I don't think it would be very useful.
Edit: Or an array of structures?
a.num = [1 2 3]
a.char = 'A'
I don't know exactly what you are trying to achieve…
This is a classic Computer Science 101 sort of question. An array holds 1 type of data traditionally. In matlab the term gets abused.
Here are some things to know:
An array of characters is called a string
An array can only store one data type
The size of an array can’t change
But matlab has an abstraction on top of all this so the engineer that didn't study programming for a year can still get the job done. Whilst matlab lets you change the size of a 1D matrix, it still won't let you have different types of data in the same array. Keep in mind that matlab 1D arrays aren't strictly arrays because this fact. Similarly with arrays of arrays with differing in size. Matlab doesn’t allow different data structures for optimization reasons.
This question arises from not knowing the containers which are available.
List: A container indexed elements (great for sorting and adding elements quickly)
Set: for a collection of unique elements (great for ensuring that there are no duplicates)
Map: Great for quickly retrieving elements based on a unique identifier
Java has implementations for these and you can use these within matlab if you want which is the general way it is done if you need a collection other than a matrix. I don’t think matlab bothered to wrap these classes themselves because they would be exactly the same anway.
In general its not a great idea to store different data types in these collections if you can avoid it, do so but otherwise so be it.
PS I don't think structs should ever be used because there is no way to know what members they have without debugging them.
If you do
a.num = [1 2 3]
a.char = 'A'
Unless you tell everyone a.num and a.char exist there is no way of knowing that a has char and num, without running code. Bad bad practice.

Learning decision trees on huge datasets

I'm trying to build a binary classification decision tree out of huge (i.e. which cannot be stored in memory) datasets using MATLAB. Essentially, what I'm doing is:
Collect all the data
Try out n decision functions on the data
Pick out the best decision function to separate the classes within the data
Split the original dataset into 2
Recurse on the splits
The data has k attributes and a classification, so it is stored as a matrix with a huge number of rows, and k+1 columns. The decision functions are boolean and act on the attributes assigning each row to the left or right subtree.
Right now I'm considering storing the data on files in chunks which can be held in memory and assigning an ID to each row so the decision to split is made by reading all the files sequentially and the future splits are identified by the ID numbers.
Does anyone know how to do this in a better fashion?
EDIT: The number of rows m is around 5e8 and k is around 500
At each split, you are breaking the dataset into smaller and smaller subsets. Start with the single data file. Open it as a stream and just process one row at a time to figure out which attribute you want to split on. Once you have your first decision function, split the original data file into 2 smaller data files that each hold one branch of the split data. Recurse. The data files should become smaller and smaller until you can load them in memory. That way, you don't have to tag rows and keep jumping around in a huge data file.