Compute similarity between columns in a fast manner MATLAB - matlab

Given a (rather) big matrix A :
A [m*n] (m = 7, n = 15000)
where the columns are items and the rows are attributes, I would like to compute similarity between each two items and store it in an array. The similarity matrix could be sim where each row contains [item1_id,item2_id, similarity]. This process needs to be done very fast.
I am considering options like #bsxfun which uses multi-threading for this purpose. Others ideas are also similarly appreciated (I am totally open to other efficient approaches). It would be appreciated if you could suggest some approaches and report the timing it takes for you to perform the operation. This process may be scaled up in future. Thank you very much.

Related

Generate subset of data with known mean

I have a dataset of n observations (nx1 vector) and would like to create a subset of this data, whose mean is known in advance, by selecting at random only n/3 observations (or within some constraint, ie where the mean of the data subset is within a range about the known mean).
Can someone please help me with the code do this in matlab?
Note, I don't want to use the rand function to create random data as I already have my data collected.
For example on a smaller scale: If I had the following dataset of 12 observations:
data = [8;7;4;6;9;6;4;7;3;2;1;1];
but then wanted to randomly select a subset of this data containing only 4 observations with a mean of 4 (or with a mean between 3.5-4.5 for example):
Then the answer might be datasubset=[7;3;2;4] but the answer could also be datasubset=[6;4;2;4] or datasubset=[6;4;3;4].
It doesn't matter if there are several possible solutions, I just need one of them, though I'd like to know the alternative solutions also.

Sparse table in MATLAB, is it possible?

\ am dealing with a matrix in MATLAB which is sparse and has many rows and columns. In this case, the row and columns of the matrix are the ids for particular items. Let's assume them as id1 and id2.
It would be nice if the ids for rows and columns could be embedded so I can have access to them easily to them without the need for creating extra variables that keep the two ids.
The answer would be probably to use a table data type. Tables are very ideal answer for my need however I was wondering if I could create a table data type for a sparse matrix?
A [m*n] sparse matrix %% m & n are huge
id1 [1*m] , id2 [1*n] %% two vectors containing numeric ids for rows and column
Could we obtain?
T [m*n] sparse table matrix
Thanks for sharing your view with me.
I will address the question and the comments in order to clear some confusion.
The short answer
There is no sparse table class in Matlab. Cannot do. Use sparse() matrices.
The long answer
There is a reason why sparse tables make little sense:
Philosophically speaking, the advantage of having nice row and column labels, is completely lost if you are working with a big panel of data and/or if the data is sparse.
Scrolling through 246829 rows and 33336 columns? Can only be useful at very isolated times if you are debugging your code and a specific outlier is causing you results to go off. Also, all you might see is just a sea of zeros.
Technically a table can have more columns for the same variable, i.e. table(rand(10,2), rand(10,1)) is a valid table. How would you consider define sparsity on such table?
Fine, suppose you are working with a matrix-like table, i.e. one element per table cell and same numeric class. Still, none of the algebraic operators are defined on a table(). So you need to extract the content first, in order to be able to perform any operation that spans more than a single column of data. Just to be clear, once the data is extracted, then you have e.g. your double (full) matrix or in an ideal case a double sparse matrix.
Now, a few misconceptions to clear:
Less variables implies clearer/cleaner code. Not true. You are probably thinking about the extreme case (in bad practices) of how do I make a series of variables a1, a2, a3, etc..
There is a sweet spot between verbosity and number of variables, amount of comments, and code clarity/maintainability. Only with time and experience you find the right balance.
Control over data cannot go without visual inspection. This approach does NOT scale with big data and the sooner you abandon it, the faster your code will become more reliable. You need to verify your results systematically, rather than relying on visual inspection. Failure to (visually) spot a problem in the data, grows exponentially with its dimension, faster than with systematic tests.
Some background info on my work:
I work with high-frequency prices, that's terabytes of data. I also extended the table() class with additional methods and fixes to help me with my work (see https://github.com/okomarov/tableutils), but I do not see how sparsity is a useful feature to add to table().

Big numbers and long loops in matlab?

How can I store a matrix with 2^100 rows in MatLab! it is my search space and I really need to do it .
In your opinion, is it possible ? if yes, please help me that how can i do it?
2100 is about 1030, which is much too large for you to fit in memory - so you won't be able to store this matrix.
A couple of alternatives that you might want to think about -
Are many of the entries in the matrix zero? If so, you could consider using a sparse matrix which is much more memory efficient.
Do you need to be able to access the rows in an arbitrary order, or sequentially? If sequentially, you can generate the rows on an as-needed basis (perhaps in blocks of ten thousand at a time)
Do you need to look at all the rows at all? If not, perhaps you can define a function which generates the entries on the fly, as they are requested.

Efficient Access of elements in Matrix in Matlab

I have an m x n matrix of integers and where n is a fairly big number m and n ~1000. I want to iterate through all of these and perform a some operations, like accessing a particular cell and assigning a value of a particular cells.
However, at least in my implementation, this is rather inefficient as I have two for loops with Matrix(a,b) = Matrix(a,b+1) or something along these lines. Is there any other way to do this seeing as my current implementation takes a long time to traverse through about 100,000 cells and perform some operations.
Thank you
In matlab, it's almost always possible to avoid loops.
If you want to do Matrix(a,b)=Matrix(a,b+1), you should just do Matrix2=Matrix(:,2:end);
If you are more precise about what you do inside the loop, I can help you more.
Matlab uses column major ordering of matrixes in memory (unlike C). Are you sure you are iterating the indexes in the correct order? If not, try switching them and see if performance improves..
If you can't get rid of the for loops, one possibility would be to rewrite the expensive operations in C and create a MEX file as described here.

Merge sensor data for clustering/neural net usage

I have several datasets i.e. matrices that have a 2 columns, one with a matlab date number and a second one with a double value. Here an example set of one of them
>> S20_EavesN0x2DEAir(1:20,:)
ans =
1.0e+05 *
7.345016409722222 0.000189375000000
7.345016618055555 0.000181875000000
7.345016833333333 0.000177500000000
7.345017041666667 0.000172500000000
7.345017256944445 0.000168750000000
7.345017465277778 0.000166875000000
7.345017680555555 0.000164375000000
7.345017888888889 0.000162500000000
7.345018104166667 0.000161250000000
7.345018312500001 0.000160625000000
7.345018527777778 0.000158750000000
7.345018736111110 0.000160000000000
7.345018951388888 0.000159375000000
7.345019159722222 0.000159375000000
7.345019375000000 0.000160625000000
7.345019583333333 0.000161875000000
7.345019798611111 0.000162500000000
7.345020006944444 0.000161875000000
7.345020222222222 0.000160625000000
7.345020430555556 0.000160000000000
Now that I have those different sensor values, I need to get them together into a matrix, so that I could perform clustering, neural net and so on, the only problem is, that the sensor data was taken with slightly different timings or timestamps and there is nothing I can do about that from a data collection point of view.
My first thought was interpolation to make one sensor data set fit another one, but that seems like a messy approach and I was thinking maybe I am missing something, a toolbox or function that would enable me to do this quicker without me fiddling around. To even complicate things more, the number of sensors grew over time, therefore I am looking at different start dates as well.
Someone a good idea on how to go about this? Thanks
I think your first thought about interpolation was the correct one, at least if you plan to use NNs. Another option would be to use approaches which are designed to deal with missing data, like http://en.wikipedia.org/wiki/Dempster%E2%80%93Shafer_theory for example.
It's hard to give an answer for the clustering part, because I have no idea what you're looking for in the data.
For the neural network, beside interpolating there are at least two other methods that come to mind:
training separate networks for each matrix
feeding them all together to the same network, with a flag specifying which matrix the data is coming from, i.e. something like: input (timestamp, flag_m1, flag_m2, ..., flag_mN) => target (value) where the flag_m* columns are mutually exclusive boolean values - i.e. flag_mK is 1 iff the line comes from matrix K, 0 otherwise.
These are the only things I can safely say with the amount of information you provided.