Storing characters in MATLAB arrays - matlab

I want to store a character along with numbers ? Is using Cells the only way ?

Yes, unless you store the ASCII values but I don't think it would be very useful.
Edit: Or an array of structures?
a.num = [1 2 3]
a.char = 'A'
I don't know exactly what you are trying to achieve…

This is a classic Computer Science 101 sort of question. An array holds 1 type of data traditionally. In matlab the term gets abused.
Here are some things to know:
An array of characters is called a string
An array can only store one data type
The size of an array can’t change
But matlab has an abstraction on top of all this so the engineer that didn't study programming for a year can still get the job done. Whilst matlab lets you change the size of a 1D matrix, it still won't let you have different types of data in the same array. Keep in mind that matlab 1D arrays aren't strictly arrays because this fact. Similarly with arrays of arrays with differing in size. Matlab doesn’t allow different data structures for optimization reasons.
This question arises from not knowing the containers which are available.
List: A container indexed elements (great for sorting and adding elements quickly)
Set: for a collection of unique elements (great for ensuring that there are no duplicates)
Map: Great for quickly retrieving elements based on a unique identifier
Java has implementations for these and you can use these within matlab if you want which is the general way it is done if you need a collection other than a matrix. I don’t think matlab bothered to wrap these classes themselves because they would be exactly the same anway.
In general its not a great idea to store different data types in these collections if you can avoid it, do so but otherwise so be it.
PS I don't think structs should ever be used because there is no way to know what members they have without debugging them.
If you do
a.num = [1 2 3]
a.char = 'A'
Unless you tell everyone a.num and a.char exist there is no way of knowing that a has char and num, without running code. Bad bad practice.

Related

How can I address the SHA3 state vector in programming terms?

I've been working on an implementation of SHA3, and I'm getting a bit muddled on this particular aspect of the algorithm. The addressing scheme of the state vector is given by the following diagram:
My issue with the above is: How does one go about addressing this in terms of actual code? I am using a 3 dimensional array to express the state vector, but this leads to obvious issues since the conventional mapping of an array (0 index is first) differs from the above convention used in SHA3.
For example, if I wanted to address the (0,0,0) bit in the SHA3 state array, the following expression would achieve this:
state_vector[2][2][0]
I find this highly cumbersome however because when implementing the actual round algorithms, the intended x and y values do not directly map to the array indices. Addressing state_vector[0][0][0] would return the very first index in the array instead of the (0,0,0) bit in the SHA3 state array.
Is there a way I can get around this in code?
Sorry, I know this is probably a stupid question.
The way this is customarily implemented is as a 5×5 array of 64-bit words, an array of 25 64-bit words or, if you believe your architecture (say, AArch64) will have a lot of registers, as 25 individual 64-bit words. (I prefer the second option because it's simpler to work with.) Typically they are indeed ordered in the typical order for arrays, and one simply rewrites things accordingly.
Usually this isn't a problem, because the operations are specified in terms of words in relation to each other, such as in the theta and chi steps. It's common to simply code rho and pi together such that it involves reading a word, rotating it, and storing it in the destination word, and in such a case you can simply just reorder the rotation constants as you need to.
If you want to get very fancy, you can write this as an SIMD implementation, but I think it's easier to see how it works in a practical implementation if you write it as a one- or two-dimensional array of words first.

Multidimensional array or cell array in Matlab?

I want to create a multidimensional array A in Matlab of dimension NxMxG with N,M,G very large (e.g. 10^6).
Then I need to access Ain a loop as
for g=1:G
Atemp=A(:,:,g);
%etc etc
end
What is more convenient in terms of speed and memory between storing the values of A in a multidimensional array or in a cell array?
If you always loop on slices in the same way, and process them one at a time, as your bit of code seem to suggest, then the performance should be roughly equivalent.
If you really intend to store 1e6x1e6x1e6 double's, Matlab is definitely probably not your tool. However, if slices are sparse, then it's probably a bit more efficient to store them as a cell array, so Matlab does not have to search the full 3D space when "cutting" the slice, and Atemp=A{g}; simply copies a sparse matrix.
If you are working on full (nonsparse) slices then probably you should load/save your slice to disk and use instead a function/support class which loads from file: Atemp=A(g);. Mind that text loading takes up much more time than loading a binary file: so choose your file format carefully!
If you use numbers, a multidimensional array is the right thing to use. A cell array also allows other types, so is less optimised for numbers only. Because you are using very large arrays, maybe a sparse matrix may be appropriate for you.
First, note that neither pick will let you handle 10^18 values. You don't have exabytes of storage, let alone memory.
If you will ONLY ever use it as Atemp = A(:,:,g); with N and M always the same size for all g, having it multi-dimensional or cell shouldn't change anything meaningful as far as performance goes. N-D will be probably a bit faster, but nothing significant.
Obviously, if you ever want to have computation with different sizes of N, M depending on g, you need to pick cell array. And if you want to have computation with say Atemp = squeeze(A(:,g,:)); N-D array is clear choice here.
So, choice most likely depends if you prefer doing A(:,:,g) or A{g};, which depends on your meaning of data. Say if you have weather data and currently only care about what happens at specific height (not what happens between the layers), A(:,:,g) is clearly more sensible. It is possible you will require inter-layer calculations at some point. But if you have instead g meaning different measurement sites gathering data, A{g} should be used to pick the site. You will likely have some sites larger or smaller eventually.

Efficient way to store single matrices generated in a loop in Matlab?

I would like to know whether there is a way to reduce the amount of memory used by the following piece of code in Matlab:
n=3;
T=100;
r=T*2;
b=80;
BS=1000
bsuppostmp_=cell(1,BS);
bslowpostmp_=cell(1,BS);
bsuppnegtmp_=cell(1,BS);
bslownegtmp_=cell(1,BS);
for w=1:BS
bsuppostmp_{w}= randi([0,1],n*T,2^(n-1),r,b);
bslowpostmp_{w}=randi([0,3],n*T,2^(n-1),r,b);
bsuppnegtmp_{w}=randi([0,4],n*T,2^(n-1),r,b);
bslownegtmp_{w}=randi([0,2],n*T,2^(n-1),r,b);
end
I have decided to use cells of matrices because after this loop I need to call separately each single matrix in another loop.
If I run this code I get the message error "Your system has run out of application memory".
Do you know a more efficient (in terms of memory) way to store each single matrix?
Let's refer the page about Strategies for Efficient Use of Memory:
Because simple numeric arrays (comprising one mxArray) have the least overhead, you should use them wherever possible. When data is too complex to store in a simple array (or matrix), you can use other data structures.
Cell arrays are comprised of separate mxArrays for each element. As a result, cell arrays with many small elements have a large overhead.
I doubt that the overhead for cell array is really large ...
Let me give a possible explanation. What if Matlab cannot use the swap file in case of storing the 4D arrays into a cell array? When storing large numeric arrays, there is no out-of-memory error because Matlab uses the swap file for caching each variable when the used memory becomes too big. Whereas if each 4D array is stored in a super cell array, Matlab sees it as a single variable and cannot fragment it part in the RAM and part in the swap file. Ok I don't work at Mathworks so I don't know if I'm right or not, it's just an idea about this topic so I would be glad to know what is the real reason.
So my advice is the same as other comments: try to free matrices as soon as you've done with them. There is not so many possibilities to store many dense arrays: one big array (NOT recommended here, will reach out-of-memory sooner because Matlab makes it contiguous), cell array or struct array (and if I correctly understand the documentation, the overhead can be equivalent). In all cases, the data amount over all 4D arrays is really large, so the best thing to do is to care about keeping the memory constantly as low as possible by discarding some data once they are used and keep in memory only the results of computation (in case they take lower memory usage ...).

Hash function for an array of integers

What would be a good hash function for an array of integers?
For example I have two arrays [1,2,3] and [1,5]. What hash functions should I adopt to seperate both these arrays?
I thought of adding up each element after raising it to the power of 2 but this has a large cost associated with it due to multiple multiplications. Is there any simple hashing function for this scenario?
For that particular data set, just subtract one from the second-to-last item, that will give you a perfect minimal hash, with buckets 0 and 1 produced :-)
More seriously, the choice of a good hashing function does depend a great deal on the sort of data so that should be taken into consideration. It's hard to suggest something without knowing the properties of the data you'll be storing.
I would start simply by choosing an arbitrary function such as adding all the items in the array then adding the array length to that, and reducing it modulo some value:
numbuckets = 97
bucket = array.length() % numbuckets
for index in range (array.length()):
bucket = (bucket + array[index]) % numbuckets
Then examine the results (across a great many real data sets) to make sure there's not too many collisions. If there are, choose another function.
It's the same as with optimisation: measure, don't guess! Actually monitor the collisions and usage and act if it gets bad.

How is a bitmapped vector trie faster than a plain vector?

It's supposedly faster than a vector, but I don't really understand how locality of reference is supposed to help this (since a vector is by definition the most locally packed data possible -- every element is packed next to the succeeding element, with no extra space between).
Is the benchmark assuming a specific usage pattern or something similar?
How this is possible?
bitmapped vector tries aren't strictly faster than normal vectors, at least not at everything. It depends on what operation you are considering.
Conventional vectors are faster, for example, at accessing a data element at a specific index. It's hard to beat a straight indexed array lookup. And from a cache locality perspective, big arrays are pretty good if all you are doing is looping over them sequentially.
However a bitmapped vector trie will be much faster for other operations (thanks to structural sharing) - for example creating a new copy with a single changed element without affecting the original data structure is O(log32 n) vs. O(n) for a traditional vector. That's a huge win.
Here's an excellent video well worth watching on the topic, which includes a lot of the motivation of why you might want these kind of structures in your language: Persistent Data Structures and Managed References (talk by Rich Hickey).
There is a lot of good stuff in the other answers but nobdy answers your question. The PersistenVectors are only fast for lots of random lookups by index (when the array is big). "How can that be?" you might ask. "A normal flat array only needs to move a pointer, the PersistentVector has to go through multiple steps."
The answer is "Cache Locality".
The cache always gets a range from memory. If you have a big array it does not fit the cache. So if you want to get item x and item y you have to reload the whole cache. That's because the array is always sequential in memory.
Now with the PVector that's diffrent. There are lots of small arrays floating around and the JVM is smart about that and puts them close to each other in memory. So for random accesses this is fast; if you run through it sequentially it's much slower.
I have to say that I'm not an expert on hardware or how the JVM handles cache locality and I have never benchmarked this myself; I am just retelling stuff I've heard from other people :)
Edit: mikera mentions that too.
Edit 2: See this talk about Functional Data-Structures, skip to the last part if you are only intrested in the vector. http://www.infoq.com/presentations/Functional-Data-Structures-in-Scala
A bitmapped vector trie (aka a persistent vector) is a data structure invented by Rich Hickey for Clojure, that has been implementated in Scala since 2010 (v 2.8). It is its clever bitwise indexing strategy that allows for highly efficient access and modification of large data sets.
From Understanding Clojure's Persistent Vectors :
Mutable vectors and ArrayLists are generally just arrays which grows
and shrinks when needed. This works great when you want mutability,
but is a big problem when you want persistence. You get slow
modification operations because you'll have to copy the whole array
all the time, and it will use a lot of memory. It would be ideal to
somehow avoid redundancy as much as possible without losing
performance when looking up values, along with fast operations. That
is exactly what Clojure's persistent vector does, and it is done
through balanced, ordered trees.
The idea is to implement a structure which is similar to a binary
tree. The only difference is that the interior nodes in the tree have
a reference to at most two subnodes, and does not contain any elements
themselves. The leaf nodes contain at most two elements. The elements
are in order, which means that the first element is the first element
in the leftmost leaf, and the last element is the rightmost element in
the rightmost leaf. For now, we require that all leaf nodes are at the
same depth2. As an example, take a look at the tree below: It has
the integers 0 to 8 in it, where 0 is the first element and 8 the
last. The number 9 is the vector size:
If we wanted to add a new element to the end of this vector and we
were in the mutable world, we would insert 9 in the rightmost leaf
node, like this:
But here's the issue: We cannot do that if we want to be persistent.
And this would obviously not work if we wanted to update an element!
We would need to copy the whole structure, or at least parts of it.
To minimize copying while retaining full persistence, we perform path
copying: We copy all nodes on the path down to the value we're about
to update or insert, and replace the value with the new one when we're
at the bottom. A result of multiple insertions is shown below. Here,
the vector with 7 elements share structure with a vector with 10
elements:
The pink coloured nodes are shared between the vectors, whereas the
brown and blue are separate. Other vectors not visualized may also
share nodes with these vectors.
More info
Besides Understanding Clojure's Persistent Vectors, the ideas behind this data structure and its use cases are also explained pretty well in David Nolen's 2014 lecture Immutability, interactivity & JavaScript, from which the screenshot below was taken. Or if you really want to dive deeply into the technical details, see also Phil Bagwell's Ideal Hash Trees, which was the paper upon which Hickey's initial Clojure implementation was based.
What do you mean by "plain vector"? Just a flat array of items? That's great if you never update it, but if you ever change a 1M-element flat-vector you have to do a lot of copying; the tree exists to allow you to share most of the structure.
Short explanation: it uses the fact that the JVM optimizes so hard on read/write/copy array data structures. The key aspect IMO is that if your vector grows to a certain size index management becomes a  bottleneck . Here comes the very clever algorithm from persisted vector into play, on very large collections it outperforms the standard variant. So basically it is a functional data-structure which only performed so well because it is built up on small mutable highly optimizes JVM datastructures.
For further details see here (at the end)
http://topsy.com/vimeo.com/28760673
Judging by the title of the talk, it's talking about Scala vectors, which aren't even close to "the most locally packed data possible": see source at https://lampsvn.epfl.ch/trac/scala/browser/scala/tags/R_2_9_1_final/src/library/scala/collection/immutable/Vector.scala.
Your definition only applies to Lisps (as far as I know).