Index storing in MatLab - matlab

I've got this code:
minimum_value_1000 = chl_september_ind;
minimum_value_logic = (chl_september_ind < 0.001);
minimum_value_1000(minimum_value_logic) = 0;
There are several raw and columns in it and what I really need is to know in which column it starts to be 0 for every raw, and store the number of a column where it starts to be 0 for every raw either. I know, it is a pretty simple operation but I'm kind of frustrating and tried several ways to do it.

Related

MATLAB spending an incredible amount of time writing a relatively small matrix

I have a small MATLAB script (included below) for handling data read from a CSV file with two columns and hundreds of thousands of rows. Each entry is a natural number, with zeros only occurring in the second column. This code is taking a truly incredible amount of time (hours) to run what should be achievable in at most some seconds. The profiler identifies that approximately 100% of the run time is spent writing a matrix of zeros, whose size varies depending on input, but in all usage is smaller than 1000x1000.
The code is as follows
function [data] = DataHandler(D)
n = size(D,1);
s = max(D,1);
data = zeros(s,s);
for i = 1:n
data(D(i,1),D(i,2)+1) = data(D(i,1),D(i,2)+1) + 1;
end
It's the data = zeros(s,s); line that takes around 100% of the runtime. I can make the code run quickly by just changing out the s's in this line for 1000, which is a sufficient upper bound to ensure it won't run into errors for any of the data I'm looking at.
Obviously there're better ways to do this, but being that I just bashed the code together to quickly format some data I wasn't too concerned. As I said, I fixed it by just replacing s with 1000 for my purposes, but I'm perplexed as to why writing that matrix would bog MATLAB down for several hours. New code runs instantaneously.
I'd be very interested if anyone has seen this kind of behaviour before, or knows why this would be happening. Its a little disconcerting, and it would be good to be able to be confident that I can initialize matrices freely without killing MATLAB.
Your call to zeros is incorrect. Looking at your code, D looks like a D x 2 array. However, your call of s = max(D,1) would actually generate another D x 2 array. By consulting the documentation for max, this is what happens when you call max in the way you used:
C = max(A,B) returns an array the same size as A and B with the largest elements taken from A or B. Either the dimensions of A and B are the same, or one can be a scalar.
Therefore, because you used max(D,1), you are essentially comparing every value in D with the value of 1, so what you're actually getting is just a copy of D in the end. Using this as input into zeros has rather undefined behaviour. What will actually happen is that for each row of s, it will allocate a temporary zeros matrix of that size and toss the temporary result. Only the dimensions of the last row of s is what is recorded. Because you have a very large matrix D, this is probably why the profiler hangs here at 100% utilization. Therefore, each parameter to zeros must be scalar, yet your call to produce s would produce a matrix.
What I believe you intended should have been:
s = max(D(:));
This finds the overall maximum of the matrix D by unrolling D into a single vector and finding the overall maximum. If you do this, your code should run faster.
As a side note, this post may interest you:
Faster way to initialize arrays via empty matrix multiplication? (Matlab)
It was shown in this post that doing zeros(n,n) is in fact slow and there are several neat tricks to initializing an array of zeros. One way is to accomplish this by empty matrix multiplication:
data = zeros(n,0)*zeros(0,n);
One of my personal favourites is that if you assume that data was not declared / initialized, you can do:
data(n,n) = 0;
If I can also comment, that for loop is quite inefficient. What you are doing is calculating a 2D histogram / accumulation of data. You can replace that for loop with a more efficient accumarray call. This also avoids allocating an array of zeros and accumarray will do that under the hood for you.
As such, your code would basically become this:
function [data] = DataHandler(D)
data = accumarray([D(:,1) D(:,2)+1], 1);
accumarray in this case will take all pairs of row and column coordinates, stored in D(i,1) and D(i,2) + 1 for i = 1, 2, ..., size(D,1) and place all that match the same row and column coordinates into a separate 2D bin, we then add up all of the occurrences and the output at this 2D bin gives you the total tally of how many values at this 2D bin which corresponds to the row and column coordinate of interest mapped to this location.

Comparing columns of matrix and giving boolean output

I have checked other questions. I didn't find my answer. I have a matrix of n * 2 size. I want to compare the 1st and 2nd column and based on which is greater I want to assign 0/1 to the respective index. Suppose I want an output as
a = 1 2
4 3
7 8
I want the output like this
out = 0 1
1 0
0 1
I did this :
o1 = a(:,1) > a (:,2)
o2 = not(o1)
out = [o1, o2]
This does the job but I am sure there's a better way to do this. Need suggestions on that/.
Forgot to mention, the datatype is float in the matrix.
A more generic solution that can handle matrices with more than two columns:
out = bsxfun(#eq, a, max(a,[],2));
What you did is good. The number of lines doesn't really matter, what matters is the complexity of the operation in each line. Following the comments, I think you could gain some time as well by avoiding copy and multiple allocations:
out = false(size(a)); out(:,1) = (a(:,1) > a(:,2)); out(:,2) = ~out(:,1);
It is good practice to preallocate in Matlab, and in general to avoid copies in any programming language.
Optimizing further the runtime of this by using different operations is pointless IMO. If you really need speed you could Mex it to spare one iteration through the rows (second assignment), it's literally a dozen C lines, although you'd have to be careful about how you write the loop (the naive way would cause cache-miss at each iteration).

recording 'bursts' of samples at 300 samples per sec

I am recording voltage changes over a small circuit- this records mouse feeding. When the mouse is eating, the circuit voltage changes, I convert that into ones and zeroes, all is well.
BUT- I want to calculate the number and duration of 'bursts' of feeding- that is, instances of circuit closing that occur within 250 ms (75 samples) of one another. If the gap between closings is larger than 250ms I want to count it as a new 'burst'
I guess I am looking for help in asking matlab to compare the sample number of each 1 in the digital file with the sample number of the next 1 down- if the difference is more than 75, call the first 1 the end of one bout and the second one the start of another bout, classifying the difference as a gap, but if it is NOT, keep the sample number of the first 1 and compare it against the next and next and next until there is a 75-sample difference
I can compare each 1 to the next 1 down:
n=1; m=2;
for i = 1:length(bouts4)-1
if bouts4(i+1) - bouts4(i) >= 75 %250 msec gap at a sample rate of 300
boutend4(n) = bouts4(i);
boutstart4(m)= bouts4(i+1);
m = m+1;
n = n+1;
end
I don't really want to iterate through i for both variables though...
any ideas??
-DB
You can try the following code
time_diff = diff(bouts4);
new_feeding = time_diff > 75;
boutend4 = bouts4(new_feeding);
boutstart4 = [0; bouts4(find(new_feeding) + 1)];
That's actually not too bad. We can actually make this completely vectorized. First, let's start with two signals:
A version of your voltages untouched
A version of your voltages that is shifted in time by 1 step (i.e. it starts at time index = 2).
Now the basic algorithm is really:
Go through each element and see if the difference is above a threshold (in your case 75).
Enumerate the locations of each one in separate arrays
Now onto the code!
%// Make those signals
bout4a = bouts4(1:end-1);
bout4b = bouts4(2:end);
%// Ensure column vectors - you'll see why soon
bout4a = bout4a(:);
bout4b = bout4b(:);
% // Step #1
loc = find(bouts4b - bouts4a >= 75);
% // Step #2
boutend4 = [bouts4(loc); 0];
boutstart4 = [0; bouts4(loc + 1)];
Aside:
Thanks to tail.b.lo, you can also use diff. It basically performs that difference operation with the copying of those vectors like I did before. diff basically works the same way. However, I decided not to use it so you can see how exactly your code that you wrote translates over in a vectorized way. Only way to learn, right?
Back to it!
Let's step through this slowly. The first two lines of code make those signals I was talking about. An original one (up to length(bouts) - 1) and another one that is the same length but shifted over by one time index. Next, we use find to find those time slots where the time index was >= 75. After, we use these locations to access the bouts array. The ending array accesses the original array while the starting array accesses the same locations but moved over by one time index.
The reason why we need to make these two signals column vector is the way I am appending information to the starting vector. I am not sure whether your data comes in rows or columns, so to make this completely independent of orientation, I'm going to make sure that your data is in columns. This is because if I try to append a 0, if I do it to a row vector I have to use a space to denote that I'm going to the next column. If I do it for a column vector, I have to use a semi-colon to go to the next row. To completely avoid checking to see whether it's a row or column vector, I'm going to make sure that it's a column vector no matter what.
By looking at your code m=2. This means that when you start writing into this array, the first location is 0. As such, I've artificially placed a 0 at the beginning of this array and followed that up with the rest of the values.
Hope this helps!

strcmp files - Very large file size output

I'm reading in a csv file that is about 80MB - data_O3. It's about 250,000 x 5 in size. I created E, which is a little bit larger because it has all the days (data_O3 is missing some days). I want to compare the two so that if the date (saved in variable d3) and siteID (d4) are the same, the data point (column 5) is placed in E.
for j = 1:size(data_O3,1)
E(strcmp(d3,data_O3{j,3})&d4 == data_O3{j,4},5) = data_O3(j,5);
end
This script works fine, but for some reason, running it takes longer than expected. I've run the same code for other data that were only slightly smaller with no problem. Is this an issue with the strcmp code or something else?
The script and files used can be found here: https://www.dropbox.com/sh/7bzq3m1ixfeuhu6/i4oOvxHPkn
There are certainly see a number of ways to speed this up significantly.
First of all, read in all numeric data in as numbers. Matlab is not optimized to work with strings, and even cells should generally be avoided as much as possible. If you want to keep everything as strings, use another language (python or perl)
Once you have the state, county and site read in as numbers, then create a number instead of a string for the siteID. One approach would be to use the formula:
siteID = siteNum + 1e4*countyCode + 1e7*stateCode
That would generate unique siteIDs for all sites.
Use datenum to convert the date field into a number.
You are now in a position where the data_O3 defined on line 79 can be a purely numeric array (no cells!), as can your E matrix. That alone will make the process many times faster.
You also might want to define the E as something other than NaN. Maybe give it values of -1.
There may be more optimizations you can do in the comparison, but do the above first and I expect you will see a huge improvement.

MATLAB decimal places in array

Consider the following example:
Bathymetry = [0,4134066;
3,3817906;
6,3343666;
9,2978725;
12,2742092;
14,2584337;
16,2415355;
18,2228054;
20,2040753;
23,1761373;
26,1514085];
Depth = [0;1;2;3;5;8;10;11.6;15];
newDepth = min(Bathymetry(:,1)):0.1:max(Bathymetry(:,1));
From this I want to find which column of 'newDepth' corresponds to 'Depth'. For example:
dd = find(newDepth==Depth(1))
dd =
1
Showing that Depth == 0, is located in the first column of newDepth. When I apply this to all of the entries of 'Depth'
for i = 1:length(Depth);
dd(i) = find(newDepth == Depth(i));
end
I receive an error:
Improper assignment with rectangular empty matrix.
Initially I couldn't understand why, but by looking at the array for newDepth, especially column 117 where newDepth == 11.6, I noticed that the value isnt equal to 11.6 but equal to 11.600000000000001 thus being different from Depth(8). How can I fix this? and why does MATLAB not just write the value as 11.6? nowhere have I specified to include the .000000000000001.
This is because there is no exact representation of 0.1 in binary. Read the wiki for more background. In binary, representing 0.1 is something like trying to write out all the decimals of one-third:
1/3 == 0.333333333333333333...
it will never be exact, no matter how many 3's you add.
For this (and many other) reasons, I'd suggest you do not use == (which is a very stringent demand), but rather use
for ii = 1:length(Depth);
[~,dd(ii)] = min( abs(newDepth-Depth(ii)) );
end
This problem is to to with floating point arithmetic which is quite complicated, i recommend you google it and read a bit, there is plenty out there explaining it. Here is a good start: http://blogs.mathworks.com/loren/2006/08/23/a-glimpse-into-floating-point-accuracy/
To solve it for your case I would suggest rounding
newDepth = round(newDepth * 10) / 10
The 11.600000000000001 is because the number 11.6 is not exactly representable in binary floating point notation. This is to do with the way the hardware works rather than any limitation of Matlab.
You want to change your compare to something like
dd(i) = find(abs(newDepth - Depth(i))<.0000001);