Matlab read heterogenious binary data - matlab

I'd like to read heterogenious binary data into matlab. I do know from the beginning how much it is and in which datatype each segment is. For example:
%double %double %int32 ...
and then this get repeated about a million times. Easy enoug to handle with fread since the know the number of bites for each segment and can therefore calculate the skip value for each row.
But now the data segment looks something like this :
%double %int32%*char %double %double ...
Whereby the int prior to the *char is the length of the said string. This brings the problem that I cannot calculate the skip anymore and I'm stuck be reading in the whole file line by line therefore needing to make a lot more file access and slowing everthing down.
In order to get at least some speed up I wan't to read in all %double %double ... (Around 30 elements) at a time and then use those from a buffer to fill up the array's. In C this would be a rather easy task here, without memcpy and not so direct access to pointers...
Do you know any way to achive this, not using mex files?

You can't solve the problem that the record size is unknown, and thus you don't know how much to read ahead of time. But you can batch up the reads, and if you have a reasonable max size for the string, you can always read that amount, and ignore the unneeded bytes at the end. typecast is the trick:
readlen = 1024;
buf = fread(fid, readlen, '*uint8'); % the asterisk keeps the returned array as uint8
rec.val1 = typecast(buf(1:8), 'double');
string_len = typecast(buf(9:12), 'int32');
rec.str1 = typecast(buf(13:13+string_len-1), 'uint8');
pos = 13+string_len;
rec.val2 = typecast(buf(pos:pos+8-1), 'double');
You might wrap a simple function around this technique to track the current offset automatically.

Related

MATLAB - autocorrelation of a huge amount of values

I have a txt file containing a huge amount of values (~4 millions values, one for each line of the above mentioned txt file) and I would like to use the MATLAB function autocorr in order to calculate the autocorrelation of the above mentioned series of values.
My problem is that MATLAB does not allow me to create a vector that has as many elements as I need but instead the vector size is limited to something like ~25000 elements (on a 64bit OS).
What would be a clever way to proceed? Many thanks in advance!
The simplest approach is probably breaking file data into chunks, computing the autocorrelation for each chunk and then aggregate the results.
Expanding a little on this, you could also use moving windows instead of discrete chunks (for example: observations 1-30, then 2-31, then 3-32, ...).
But let's stick on the first approach for the moment. Here is a function that allows you to specify a block length and then read the file blocks:
function res = readFileChunks(file,chunkSize)
fid = fopen(file,'r');
if (fid < 0)
error('Cannot open file "%s".',file);
end
res = {};
while (~feof(fid))
res{end+1} = fscanf(fid,'%f',[1 chunkSize])';
end
fclose(fid);
end
Example:
res = readFileChunks('data.txt',2500);
Now, all you have to do is sanitize your result (clearing, for example, empty cells caused by empty file lines) and process your autocorrelation into a loop for each vector.
Since Matlab loops are quite expensive, you could also calculate your autocorrelation values directly into the loop that reads file chunks. This way you will directly receive the final result.

Efficient byte pattern search in matlab memory map

I have large binary files (2+ GB) that are arranged with a sync pattern (0xDEADBEEF) followed by a data block of a fixed size.
Example:
0xDE AD BE EF ... 96 bytes of data
0xDE AD BE EF ... 96 bytes of data
... repeat ...
I need to locate the offsets to the start of each packet. Ideally this would just be [1:packetSize:fileSize] However, there is other data that can be interspersed, headers etc. so I need to search the file for the sync pattern.
I am using the following code which is based on Loren from Mathworks findPattern2 but modified a little to use a memory map.
function pattLoc = findPattern(fileName, bytePattern)
%Mem Map file
m = memmapfile(fileName);
% Find candidate locations match the first element in the pattern.
pattLoc = find(m.data==bytePattern(1));
%Remove start values that are too close to the end to possibly match
len = numel(bytePattern);
endVals = pattLoc+len-1;
pattLoc(endVals>length(m.data)) = [];
% loop over elements of Sync Pattern to check possible location validity.
for pattval = 2:len
% check viable locations in array
locs = bytePattern(pattval) == m.data(pattLoc+pattval-1);
pattLoc(~locs) = []; % delete false ones from indices
end
This works pretty well. However, I think there might be room for improvement. First I know my patterns can't be closer than packetSize (100 in this example) but may be farther apart. Seems like I should be able to use this information somehow to speed up the search. Second the initial search on line 5 uses a find for numerical indexing instead of logical. This line takes almost twice as long as leaving it as a logical. However, I tried to rework this function using logical indexing only and failed miserably. The problem arises inside the loop and keeping track of the nested indexing with logicals without using more finds ... or checking more data than needs to be.
So any help speeding this up would be appreciated. Below is some code which will create a simple sample binary file to work with if necessary.
function genSampleFile(numPackets)
pattern = hex2dec({'DE','AD','BE','EF'});
fileName = 'testFile.bin';
fid = fopen(fileName,'w+');
for f = 1:numPackets
fwrite(fid,[pattern; uint8(rand(96,1)*255)],'uint8');
end
fclose(fid);
Searching a file with 10000000 packets took the following:
>> genSampleFile(10000000); %Warning Makes 950+MB file
>> tic;pattLoc = findPattern(fileName, pattern);toc
Elapsed time is 4.608321 seconds.
You can get an immediate boost by using findstr, or better yet strfind instead of find:
pattLoc = strfind(m.data, bytePattern)
This removes the need for any further looping. You just need to clean up a couple of things in the returned array of indices.
You want to remove things that are closer than 100 bytes to the end, not closer than 4 bytes to the end, so set len = 100, instead of len = length(bytePattern).
To filter out elements that are closer than 100 bytes from each other, use diff on the list of indices:
pattLoc[diff(pattLoc) < 100] = []
This should speed up your code by relying more on builtins, which are generally much more efficient than loops.

Matlab preallocating arrays when final array-size is unknown [duplicate]

I am trying to speed up a script that I have written in Matlab that dynamically allocates memory to a matrix (basicallly reads a line of data from a file and writes it into a matrix, then reads another line and allocates more memory for a larger matrix to store the next line). The reason I did this instead of preallocating memory using zeroes() or something was that I don't know the exact size the matrix needs to be to hold all of the data. I also don't know the maximum size of the matrix, so I can't just preallocate a max size and then get rid of memory that I didn't use. This was fine for small amounts of data, but now I need to scale my script up to read many millions of data points and this implementation of dynamic allocation is just much too slow.
So here is my attempt to speed up the script: I tried to allocate memory in large blocks using the zeroes function, then once the block is filled I allocate another large block. Here is some sample code:
data = [];
count = 0;
for ii = 1:num_filelines
if mod(count, 1000) == 0
data = [data; zeroes(1000)]; %after 1000 lines are read, allocate another 1000 line
end
data(ii, :) = line_read(file); %line_read reads a line of data from 'file'
end
Unfortunately this doesn't work, when I run it I get an error saying "Error using vertcat
Dimensions of matrices being concatenated are not consistent."
So here is my question: Is this method of allocating memory in large blocks actually any faster than incremental dynamic allocation, and also why does the above code not run? Thanks for the help.
What I recommend doing, if you know the number of lines and can just guess a large enough number of acceptable columns, use a sparse matrix.
% create a sparse matrix
mat = sparse(numRows,numCols)
A sparse matrix will not store all of the zero elements, it only stores pointers to indices that are non-zero. This can help save a lot of space. They are used and accessed the same as any other matrix. That is only if you really need it in a matrix format from the beginning.
If not, you can just do everything as a cell. Preallocate a cell array with as many elements as lines in your file.
data = cell(1,numLines);
% get matrix from line
for i = 1:numLines
% get matrix from line
data{i} = lineData;
end
data = cell2mat(data);
This method will put everything into a cell array, which can store "dynamically" and then be converted to a regular matrix.
Addition
If you are doing the sparse matrix method, to trim up your matrix once you are done, because your matrix will likely be larger than necessary, you can trim this down easily, and then cast it to a regular matrix.
[val,~] = max(sum(mat ~= 0,2));
mat(:,val:size(mat,2)) = [];
mat = full(mat); % use this only if you really need the full matrix
This will remove any unnecessary columns and then cast it to a full matrix that includes the 0 elements. I would not recommend casting it to a full matrix, as this requires a ton more space, but if you truly need it, use it.
UPDATE
To get the number of lines in a file easily, use MATLAB's perl interpretter
create a file called countlines.pl and paste in the two lines below
while (<>) {};
print $.,"\n";
Then you can run this script on your file as follows
numLines = str2double(perl('countlines.pl','data.csv'));
Problem solved.
From MATLAB forum thread here
remember it is always best to preallocate everything before hand, because technically when doing shai's method you are reallocating large amounts a lot, especially if it is a large file.
To solve your error, simply use this syntax when allocating
data = [data; zeroes(1000, size(data,2))];
You might want to read the first line outside the loop so you'll know the number of columns and make the first allocation for data.
If you want to stick to your code as written I would substitute your initialization of data, data = [] to
data = zeros(1,1000);
Keep in mind though the warning from #MZimmerman6: zeros(1000) generates a 1000 x 1000 array. You may want to change all of your zeros statements to zeros( ... ,Nc), where Nc = length of line in characters.

MATLAB reading large text log files with errors

I'm trying to analyze a large text log file (11 GB). All of the data are numerical values, and a snippit of the data are listed below.
-0.0623 0.0524 -0.0658 -0.0015 0.0136 -0.0063 0.0259 -0.003
-0.0028 0.0403 0.0009 -0.0016 -0.0013 -0.0308 0.0511 0.0187
0.0894 0.0368 0*0243 0.0279 0.0314 -0.0212 0.0582 -0.0403 //<====row 3, weird ASCII char * is present
-0.0548 0.0132 0.0299 0.0215 0.0236 0.0215 0.003 -0.0641
-0.0615 0.0421 0.0009 0.0457 0.0018 -0.0259 0.041 0.031
-0.0793 0.01 //<====row 6, the data is misaligned here
0.0278 0.0053 -0.0261 0.0016 0.0233 0.0719
0.0143 0.0163 -0.0101 -0.0114 -0.0338 -0.0415 0.0143 0.129
-0.0748 -0.0432 0.0044 0.0064 -0.0508 0.0042 0.0237 0.0295
0.040 -0.0232 -0.0299 -0.0066 -0.0539 -0.0485 -0.0106 0.0225
Every set of data consists of 2048 rows, and each row has 8 columns.
Here comes the problem: when the data is transformed from binary files to text files using the logging software, a lot of the data are distorted. Take the data above as an example, row 3 column 3 there is a " * " present in the data. And in row 6, one row of data is broken into two rows, one row has 2 data and the other row has 6 data.
I am currently struggling reading this large text files using MATLAB. Since the file itself is so large, I can only use textscan to read the data.
for example:
C = textscan(fd,'%f%f%f%f%f%f%f%f',1,'Delimiter','\t ');
However, I cannot use '%f' as format since there contains several weird ASCII characters such as " * " or " ! " in the data. These distorted data cannot be treated as floating point numbers.
So I choose to use:
C = textscan(fd,'%s%s%s%s%s%s%s%s',1,'Delimiter','\t ');
and then I transfer those strings into doubles to be processed. However, this encounters the problem of broken lines. When it reaches row 6, it gives:
[-0.0793],[0.01],[],[],[],[],[],[];
[0.0278],[0.0053],[-0.0261],[0.0016],[0.0233],[0.0719],[0.0143],[0.0163];
while it is supposed to look like:
-0.0793 0.01 0.0278 0.0053 -0.0261 0.0016 0.0233 0.0719 ===> one set
0.0143 0.0163 -0.0101 -0.0114 -0.0338 -0.0415 0.0143 0.129 ===> another set
Then the data will be offset by one row and the columns are messed up.
Then I try to do:
C = textscan(fd,'%s',1,'Delimiter','\t ');
to read one element at one time. If this element is NaN, it will textscan the next one until it sees something other than NaN. Once it obtains 2048 non-empty elements, it will store those 2048 data into a matrix to be processed. After being processed, this matrix is cleared.
This method works well for the first 20% of the whole file....BUT,
since the file itself is 11GB which is very large, after reading about 20% of the file, MATLAB shows:
Error using ==> textscan
Out of memory. Type HELP MEMORY for your options.
(some people suggest using %f while doing textscan, but it won't work because there are some ASCII chars which are causing problem)
Any suggestions to deal with this file?
EDIT:
I have tried:
C = textscan(fd,'%s%s%s%s%s%s%s%s',2048,'Delimiter','\t ');
Although the result is incorrect due to the misalignment of data (like row 6), this code indeed does not cause the "Out of memory" problem. Out of memory problem only occurs when I try to use
C= textscan(fd,'%s',1,'Delimiter','\t ').
to read the data one entry by one entry. Anyone has any idea why this memory problem happens?
This might seem silly, but are you preallocating an array for this data? If the only issue (as it seems to be) with your last function is memory, perhaps
C = zeros(2048,8);
will alleviate your problem. Try inserting that line before you call textscan. I know that MATLAB often exhorts programmers to preallocate for speed; this is just a shot in the dark, but preallocating memory may fix your issue.
Edit: also see this MATLAB Central discussion of a similar issue. It may be that you will have to run the file in chunks, and then concatenate the arrays when each chunk is finished.
Try something like the code below. It preallocates space and reads numRow* numColumns from the textfile at a time. If you can initialize the bigData matrix then it shouldn't run out of memory ... I think.
Note: That I used 9 for #rows since your sample data had 9 complete rows you will want to use 2024 I presume. This also might need some end of file checks etc. and some error handling. Also any numbers w/ odd ascii text in the will turn into NaN.
Note 2: This still might not work or be very very slow. I had a similar problem reading large text files (10-20GB) that were slightly more complicated. I had to abandon reading them in Matlab. Instead I used Perl for an initial pass which output to binary. Then used Matlab to read the binary back into data. The 2 step approach ended up saving lots and lots of runtime. Link in case you are interested
function bigData = readData(fileName)
fid = fopen(fileName,'r');
numBlocks = 1; %Somehow determine # of blocks??? not sure if you know of a way to determine this
r = 9; %Replace 9 with your size 2048
c = 8;
bigData = zeros(r*numBlocks,8);
for k = 1:numBlocks
[dataBlock, rFlag] = readDataBlock(fid,r,c);
if rFlag
%Or some kind of error.
break
end
bigData((k-1)*r+1:k*r,:) = dataBlock;
end
fclose(fid);
function [dataBlock, rFlag]= readDataBlock(fid,r,c)
C= textscan(fid,'%s',r*c,'Delimiter','\t '); %replace 9*8 by the size of the data block.
dataBlock = [];
if numel(C{1}) == r*c
dataBlock = reshape(str2double(C{1}),9,8);
rFlag = false;
else
rFlag = true;
% ?? Throw an error or whatever is appropriate
end
While I don't really know how to solve your problems with the broken data, I can give some advice how to process big text data. Read it in batches of multiple lines and write the output directly to the hard drive. In your case the second might be unnecessary, if everything is working you could try to replace data with a variable.
The code was originally written for a different purpose, I deleted the parser for my problem and replaced it with parsed_data=buffer; %TODO;
outputfile='out1.mat';
inputfile='logfile1';
batchsize=1000; %process 1000 lines at once
data=matfile(outputfile,'writable',true); %Simply delete this line if you dant "data" to be a variable in your memory
h=dir(inputfile);
bytes_to_read=h.bytes;
data.out={};
input=fopen(inputfile);
buffer={};
while ftell(input)<bytes_to_read
buffer=[buffer,textscan(input,'%s',batchsize-numel(buffer))];
parsed_data=buffer; %TODO;
data.out(end+1,1)={parsed_data};
buffer={}; %In the default case your empty your buffer her.
%If a incomplete line read here partially, leave it in the buffer.
end
fclose(input);

Splitting an audio file in Matlab

I'm trying to split an audio file into 30 millisecond disjoint intervals using Matlab. I have the following code at the moment:
clear all
close all
% load the audio file and get its sampling rate
[y, fs] = audioread('JFK_ES156.wav');
for m = 1 : 6000
[t(m), fs] = audioread('JFK_ES156.wav', [(m*(0.03)*fs) ((m+1)*(0.03)*fs)]);
end
But the problem is that I get the following error:
In an assignment A(I) = B, the number of elements in B and I
must be the same.
Error in splitting (line 12)
[t(m), fs] = audioread('JFK_ES156.wav', [(m*(0.03)*fs)
((m+1)*(0.03)*fs)]);
I don't see why there's a mismatch in the number of elements in B and I and how to solve this. How can I get past this error? Or is there just an easier way to split the audio file (maybe another function I don't know about or something)?
I think the easiest way to split audio is to just load it and use the vec2mat function. so you would have something like this;
[X,Fs] = audioread('JFK_ES156.wav');
%Calculate how many samples you need to capture 30ms of audio
matSize = Fs*0.3;
%Pay attention to that apostrophe. Makes sure samples are stored in columns
%rather than rows.
output = vec2mat(x,matSize)';
%You can now have your audio split up into the different columns of your matrix.
%You can call them by using the column calling command for matrices.
%Plot first 30ms of audio
plot(output(:,1));
%You can join the audio back together using this command.
output = output(:);
Hope that helps. Another good thing about this method is that it keeps all your data in one place!
Edit : One thing I thought of, you may get a problem with this depending on your vector size. But I think vec2mat actually zeroPads your vector. Not a big thing, but if you're moving back and forth between the two, then it might be a good idea to have another variable that stores the original length of your signal.
You should just use the variable y and reshape it to form your split audio. For example,
chunk_size = fs*0.03;
y_chunks = reshape(y, chunk_size, 6000);
That will give you a matrix with each column a 30 ms chunk. This code will also be faster than reading small segments from file in a loop.
As hiandbaii suggested you could also use cell array. Make sure you clear your existing variables before that. Not clearing the array t is probably the reason you got the error "Cell contents assignment to a non-cell array object."
Your original error is because you cannot assign a vector with scalar indexing. That is, 'm' is a scalar, but your audioread call is returning a vector. This is what the error says about mismatch in size of I and B. You could also fix that by making t a 2-D array and use an assignment like
[t(m,:), fs] =
It appears that each 30 ms segment is not equal to one sample. That would be the only case where your code works. i.e. 0.03*fs != 1.
You could try using cells instead.. i.e. replace t(m) with t{m}