MATLAB - autocorrelation of a huge amount of values - matlab

I have a txt file containing a huge amount of values (~4 millions values, one for each line of the above mentioned txt file) and I would like to use the MATLAB function autocorr in order to calculate the autocorrelation of the above mentioned series of values.
My problem is that MATLAB does not allow me to create a vector that has as many elements as I need but instead the vector size is limited to something like ~25000 elements (on a 64bit OS).
What would be a clever way to proceed? Many thanks in advance!

The simplest approach is probably breaking file data into chunks, computing the autocorrelation for each chunk and then aggregate the results.
Expanding a little on this, you could also use moving windows instead of discrete chunks (for example: observations 1-30, then 2-31, then 3-32, ...).
But let's stick on the first approach for the moment. Here is a function that allows you to specify a block length and then read the file blocks:
function res = readFileChunks(file,chunkSize)
fid = fopen(file,'r');
if (fid < 0)
error('Cannot open file "%s".',file);
end
res = {};
while (~feof(fid))
res{end+1} = fscanf(fid,'%f',[1 chunkSize])';
end
fclose(fid);
end
Example:
res = readFileChunks('data.txt',2500);
Now, all you have to do is sanitize your result (clearing, for example, empty cells caused by empty file lines) and process your autocorrelation into a loop for each vector.
Since Matlab loops are quite expensive, you could also calculate your autocorrelation values directly into the loop that reads file chunks. This way you will directly receive the final result.

Related

Read large number of .h5 datasets

I'm working with these h5 files that have tens of thousands of datasets that contains vectors of numerical values and all of the same size. My goal is to read the datasets and create one large matrix from these vectors. The datasets are named from "0" to "xxxxx" (some large number) I was able to read them and get the matrix but it takes forever to do so. I was wondering if you can take a look at my code and suggest a way to make it run faster
here is how I do it right now
t =[];
for i = 0:40400 % there are 40401 datasets in this particular file
j = int2str(i);
p = '/mesh/'; % The parent group
s = strcat(p,j); % to create the full path of a dataset e.g. '/mesh/0'
r = h5read('temp.h5',s); % the file name is temp and s has the dataset path
t = [t;r];
end
in this particular case, there are 40401 datasets, each has 80802x1 vector of numerical values. Therefore eventually I want to create 80802x40401 matrix. This code takes over a day to finish. I think one of the reason it is slow because in every iteration, matlab access the h5 file. I would appreciate it if some of you have some tips in speeding up the code
When I copied you code in an editor, I get the red tilde under the t with the warning:
The variable t appears to change size on every loop iteration. Consider preallocating for speed.
You should allocate the final memory of t before starting the loop, with the function zeros:
t = zeros(80804,40401);
You should also read this: Programming Patterns: Maximizing Code Performance by Optimizing Memory Access:
Preallocate arrays before accessing them within loops
Store and access data in columns
Avoid creating unnecessary variables
Maybe p = '/mesh/'; is useless inside the loop and can be done outside the loop, since it doesn't change. It could be even better to not have p and directly do s = strcat('/mesh/',j);

Matlab preallocating arrays when final array-size is unknown [duplicate]

I am trying to speed up a script that I have written in Matlab that dynamically allocates memory to a matrix (basicallly reads a line of data from a file and writes it into a matrix, then reads another line and allocates more memory for a larger matrix to store the next line). The reason I did this instead of preallocating memory using zeroes() or something was that I don't know the exact size the matrix needs to be to hold all of the data. I also don't know the maximum size of the matrix, so I can't just preallocate a max size and then get rid of memory that I didn't use. This was fine for small amounts of data, but now I need to scale my script up to read many millions of data points and this implementation of dynamic allocation is just much too slow.
So here is my attempt to speed up the script: I tried to allocate memory in large blocks using the zeroes function, then once the block is filled I allocate another large block. Here is some sample code:
data = [];
count = 0;
for ii = 1:num_filelines
if mod(count, 1000) == 0
data = [data; zeroes(1000)]; %after 1000 lines are read, allocate another 1000 line
end
data(ii, :) = line_read(file); %line_read reads a line of data from 'file'
end
Unfortunately this doesn't work, when I run it I get an error saying "Error using vertcat
Dimensions of matrices being concatenated are not consistent."
So here is my question: Is this method of allocating memory in large blocks actually any faster than incremental dynamic allocation, and also why does the above code not run? Thanks for the help.
What I recommend doing, if you know the number of lines and can just guess a large enough number of acceptable columns, use a sparse matrix.
% create a sparse matrix
mat = sparse(numRows,numCols)
A sparse matrix will not store all of the zero elements, it only stores pointers to indices that are non-zero. This can help save a lot of space. They are used and accessed the same as any other matrix. That is only if you really need it in a matrix format from the beginning.
If not, you can just do everything as a cell. Preallocate a cell array with as many elements as lines in your file.
data = cell(1,numLines);
% get matrix from line
for i = 1:numLines
% get matrix from line
data{i} = lineData;
end
data = cell2mat(data);
This method will put everything into a cell array, which can store "dynamically" and then be converted to a regular matrix.
Addition
If you are doing the sparse matrix method, to trim up your matrix once you are done, because your matrix will likely be larger than necessary, you can trim this down easily, and then cast it to a regular matrix.
[val,~] = max(sum(mat ~= 0,2));
mat(:,val:size(mat,2)) = [];
mat = full(mat); % use this only if you really need the full matrix
This will remove any unnecessary columns and then cast it to a full matrix that includes the 0 elements. I would not recommend casting it to a full matrix, as this requires a ton more space, but if you truly need it, use it.
UPDATE
To get the number of lines in a file easily, use MATLAB's perl interpretter
create a file called countlines.pl and paste in the two lines below
while (<>) {};
print $.,"\n";
Then you can run this script on your file as follows
numLines = str2double(perl('countlines.pl','data.csv'));
Problem solved.
From MATLAB forum thread here
remember it is always best to preallocate everything before hand, because technically when doing shai's method you are reallocating large amounts a lot, especially if it is a large file.
To solve your error, simply use this syntax when allocating
data = [data; zeroes(1000, size(data,2))];
You might want to read the first line outside the loop so you'll know the number of columns and make the first allocation for data.
If you want to stick to your code as written I would substitute your initialization of data, data = [] to
data = zeros(1,1000);
Keep in mind though the warning from #MZimmerman6: zeros(1000) generates a 1000 x 1000 array. You may want to change all of your zeros statements to zeros( ... ,Nc), where Nc = length of line in characters.

Quickest way to search txt/bin/etc file for numeric data greater than specified value

I have a 37,000,000x1 double array saved in a matfile under a structure labelled r. I can point to this file using matfile(...) then just use the find(...) command to find all values above a threshold val
This finds all the values greater than/equal to 0.004 but given the size of my data, this takes some time.
I want to reduce the time and have considered using bin files (apparently they are better than txt files in terms of not losing precision?) etc, however I'm not knowledgable with the syntax/method
I've managed to save the data into the bin file, but what is the quickest way to search through this large file?
The only output data I want are the actually values greater than my specified value.
IS using a bin file the best? Or a matfile? Etc
I don't want to load the entire file into matlab. I want to conserve the matlab memory as other programs may need the space and I don't want memory errors again
As #OlegKomarov points out, a 37,000,000 element array of doubles is not very big. Your real problem may be that you don't have enough RAM and/or are using a 32-bit version of Matlab. The find function will require additional memory for the input and the out array of indices.
If you want to load and process your data in chunks, you can use the matfile function. Here's a small example:
fname = [tempname '.mat']; % Use temp directory file for example
matObj = matfile(fname,'Writable',true); % Create MAT-file
matObj.r = rand(37e4,1e2); % Write random date to r variable in file
szR = size(matObj,'r'); % Get dimensions of r variable in file
idx = [];
for i = 1:szR(2)
idx = [idx;find(matObj.r(:,i)>0.999)]; % Find indices of r greater than 0.999
end
delete(fname); % Delete example file
This will save you memory, but it definitely not faster than storing everything in memory and calling find once. File access is always slower (though it will help a bit if you have an SSD). The code above uses dynamic memory allocation for the idx variable, but the memory is only re-allocated a few times in large chunks, which can be quite fast in current versions of Matlab.

Splitting an audio file in Matlab

I'm trying to split an audio file into 30 millisecond disjoint intervals using Matlab. I have the following code at the moment:
clear all
close all
% load the audio file and get its sampling rate
[y, fs] = audioread('JFK_ES156.wav');
for m = 1 : 6000
[t(m), fs] = audioread('JFK_ES156.wav', [(m*(0.03)*fs) ((m+1)*(0.03)*fs)]);
end
But the problem is that I get the following error:
In an assignment A(I) = B, the number of elements in B and I
must be the same.
Error in splitting (line 12)
[t(m), fs] = audioread('JFK_ES156.wav', [(m*(0.03)*fs)
((m+1)*(0.03)*fs)]);
I don't see why there's a mismatch in the number of elements in B and I and how to solve this. How can I get past this error? Or is there just an easier way to split the audio file (maybe another function I don't know about or something)?
I think the easiest way to split audio is to just load it and use the vec2mat function. so you would have something like this;
[X,Fs] = audioread('JFK_ES156.wav');
%Calculate how many samples you need to capture 30ms of audio
matSize = Fs*0.3;
%Pay attention to that apostrophe. Makes sure samples are stored in columns
%rather than rows.
output = vec2mat(x,matSize)';
%You can now have your audio split up into the different columns of your matrix.
%You can call them by using the column calling command for matrices.
%Plot first 30ms of audio
plot(output(:,1));
%You can join the audio back together using this command.
output = output(:);
Hope that helps. Another good thing about this method is that it keeps all your data in one place!
Edit : One thing I thought of, you may get a problem with this depending on your vector size. But I think vec2mat actually zeroPads your vector. Not a big thing, but if you're moving back and forth between the two, then it might be a good idea to have another variable that stores the original length of your signal.
You should just use the variable y and reshape it to form your split audio. For example,
chunk_size = fs*0.03;
y_chunks = reshape(y, chunk_size, 6000);
That will give you a matrix with each column a 30 ms chunk. This code will also be faster than reading small segments from file in a loop.
As hiandbaii suggested you could also use cell array. Make sure you clear your existing variables before that. Not clearing the array t is probably the reason you got the error "Cell contents assignment to a non-cell array object."
Your original error is because you cannot assign a vector with scalar indexing. That is, 'm' is a scalar, but your audioread call is returning a vector. This is what the error says about mismatch in size of I and B. You could also fix that by making t a 2-D array and use an assignment like
[t(m,:), fs] =
It appears that each 30 ms segment is not equal to one sample. That would be the only case where your code works. i.e. 0.03*fs != 1.
You could try using cells instead.. i.e. replace t(m) with t{m}

Average of values from multiple matrices in Matlab

I have 50 matrices contained in one folder, all of dimension 181 x 360. How do I cycle through that folder and take an average of each corresponding data points across all 50 matrices?
If the matrices are contained within Matlab variables stored using save('filename','VariableName') then they can be opened using load('filename.mat').
As such, you can use the result of filesInDirectory = dir; to get a list of all your files, using a search pattern if appropriate, like files = dir('*.mat');
Next you can use your load command, and then whos to see which variables were loaded. You should consider storing these for ease clearing after each iteration of your loop.
Once you have your matrix loaded (one at a time), you can take averages as you need, probably summing a value across multiple loop iterations, then dividing by a total counter you've been measuring (using perhaps count = count + size(MatrixVar, dimension);).
If you need all of the matrices loaded at once, then you can modify the above idea, to load using a loop, then average outside of the loop. In this case, you may need to take care - but 50*181*360 isn't too bad I suspect.
A brief introduction to the load command can be found at this link. It talks mainly about opening one matrix, then plotting the values, but there are some comments about dealing with headers, if needed, and different ways in which you can open data, if load is insufficient. It doesn't talk about binary files, though.
Note on binary files, based on comment to OP's question:
If the file can be opened using
FID = fopen('filename.dat');
fread(FID, 'float');
then you can replace the steps referring to load above, and instead use a loop to find filenames using dir, open the matrices using fopen and fread, then average as needed, finally closing the files and clearing the matrices.
In this case, probably your file identifier is the only part you're likely to need to change during the loop (although your total will increase, if that's how you want to average your data)
Reshaping the matrix, or inverting it, might make the code clearer (which is good!), but might not be necessary depending on what you're trying to average - it may be that selecting only a subsection of the matrix is sufficient.
Possible example code?
Assuming that all of the files in the current directory are to be opened, and that no files are elsewhere, you could try something like:
listOfFiles = dir('*.dat');
for f = 1:size(listOfFiles,1)
FID = fopen(listOfFiles(f).name);
Data = fread(FID, 'float');
% Reshape if needed?
Total = Total + sum(Data(start:end,:)); % This might vary, depending on what you want to average etc.
Counter = Counter + (size(Data,1) * size(Data,2)); % This product will be the 181*360 you had in the matrix, in this case
end
Av = Total/Counter;