I have large binary files (2+ GB) that are arranged with a sync pattern (0xDEADBEEF) followed by a data block of a fixed size.
Example:
0xDE AD BE EF ... 96 bytes of data
0xDE AD BE EF ... 96 bytes of data
... repeat ...
I need to locate the offsets to the start of each packet. Ideally this would just be [1:packetSize:fileSize] However, there is other data that can be interspersed, headers etc. so I need to search the file for the sync pattern.
I am using the following code which is based on Loren from Mathworks findPattern2 but modified a little to use a memory map.
function pattLoc = findPattern(fileName, bytePattern)
%Mem Map file
m = memmapfile(fileName);
% Find candidate locations match the first element in the pattern.
pattLoc = find(m.data==bytePattern(1));
%Remove start values that are too close to the end to possibly match
len = numel(bytePattern);
endVals = pattLoc+len-1;
pattLoc(endVals>length(m.data)) = [];
% loop over elements of Sync Pattern to check possible location validity.
for pattval = 2:len
% check viable locations in array
locs = bytePattern(pattval) == m.data(pattLoc+pattval-1);
pattLoc(~locs) = []; % delete false ones from indices
end
This works pretty well. However, I think there might be room for improvement. First I know my patterns can't be closer than packetSize (100 in this example) but may be farther apart. Seems like I should be able to use this information somehow to speed up the search. Second the initial search on line 5 uses a find for numerical indexing instead of logical. This line takes almost twice as long as leaving it as a logical. However, I tried to rework this function using logical indexing only and failed miserably. The problem arises inside the loop and keeping track of the nested indexing with logicals without using more finds ... or checking more data than needs to be.
So any help speeding this up would be appreciated. Below is some code which will create a simple sample binary file to work with if necessary.
function genSampleFile(numPackets)
pattern = hex2dec({'DE','AD','BE','EF'});
fileName = 'testFile.bin';
fid = fopen(fileName,'w+');
for f = 1:numPackets
fwrite(fid,[pattern; uint8(rand(96,1)*255)],'uint8');
end
fclose(fid);
Searching a file with 10000000 packets took the following:
>> genSampleFile(10000000); %Warning Makes 950+MB file
>> tic;pattLoc = findPattern(fileName, pattern);toc
Elapsed time is 4.608321 seconds.
You can get an immediate boost by using findstr, or better yet strfind instead of find:
pattLoc = strfind(m.data, bytePattern)
This removes the need for any further looping. You just need to clean up a couple of things in the returned array of indices.
You want to remove things that are closer than 100 bytes to the end, not closer than 4 bytes to the end, so set len = 100, instead of len = length(bytePattern).
To filter out elements that are closer than 100 bytes from each other, use diff on the list of indices:
pattLoc[diff(pattLoc) < 100] = []
This should speed up your code by relying more on builtins, which are generally much more efficient than loops.
Related
I am trying to speed up a script that I have written in Matlab that dynamically allocates memory to a matrix (basicallly reads a line of data from a file and writes it into a matrix, then reads another line and allocates more memory for a larger matrix to store the next line). The reason I did this instead of preallocating memory using zeroes() or something was that I don't know the exact size the matrix needs to be to hold all of the data. I also don't know the maximum size of the matrix, so I can't just preallocate a max size and then get rid of memory that I didn't use. This was fine for small amounts of data, but now I need to scale my script up to read many millions of data points and this implementation of dynamic allocation is just much too slow.
So here is my attempt to speed up the script: I tried to allocate memory in large blocks using the zeroes function, then once the block is filled I allocate another large block. Here is some sample code:
data = [];
count = 0;
for ii = 1:num_filelines
if mod(count, 1000) == 0
data = [data; zeroes(1000)]; %after 1000 lines are read, allocate another 1000 line
end
data(ii, :) = line_read(file); %line_read reads a line of data from 'file'
end
Unfortunately this doesn't work, when I run it I get an error saying "Error using vertcat
Dimensions of matrices being concatenated are not consistent."
So here is my question: Is this method of allocating memory in large blocks actually any faster than incremental dynamic allocation, and also why does the above code not run? Thanks for the help.
What I recommend doing, if you know the number of lines and can just guess a large enough number of acceptable columns, use a sparse matrix.
% create a sparse matrix
mat = sparse(numRows,numCols)
A sparse matrix will not store all of the zero elements, it only stores pointers to indices that are non-zero. This can help save a lot of space. They are used and accessed the same as any other matrix. That is only if you really need it in a matrix format from the beginning.
If not, you can just do everything as a cell. Preallocate a cell array with as many elements as lines in your file.
data = cell(1,numLines);
% get matrix from line
for i = 1:numLines
% get matrix from line
data{i} = lineData;
end
data = cell2mat(data);
This method will put everything into a cell array, which can store "dynamically" and then be converted to a regular matrix.
Addition
If you are doing the sparse matrix method, to trim up your matrix once you are done, because your matrix will likely be larger than necessary, you can trim this down easily, and then cast it to a regular matrix.
[val,~] = max(sum(mat ~= 0,2));
mat(:,val:size(mat,2)) = [];
mat = full(mat); % use this only if you really need the full matrix
This will remove any unnecessary columns and then cast it to a full matrix that includes the 0 elements. I would not recommend casting it to a full matrix, as this requires a ton more space, but if you truly need it, use it.
UPDATE
To get the number of lines in a file easily, use MATLAB's perl interpretter
create a file called countlines.pl and paste in the two lines below
while (<>) {};
print $.,"\n";
Then you can run this script on your file as follows
numLines = str2double(perl('countlines.pl','data.csv'));
Problem solved.
From MATLAB forum thread here
remember it is always best to preallocate everything before hand, because technically when doing shai's method you are reallocating large amounts a lot, especially if it is a large file.
To solve your error, simply use this syntax when allocating
data = [data; zeroes(1000, size(data,2))];
You might want to read the first line outside the loop so you'll know the number of columns and make the first allocation for data.
If you want to stick to your code as written I would substitute your initialization of data, data = [] to
data = zeros(1,1000);
Keep in mind though the warning from #MZimmerman6: zeros(1000) generates a 1000 x 1000 array. You may want to change all of your zeros statements to zeros( ... ,Nc), where Nc = length of line in characters.
I have 50 matrices contained in one folder, all of dimension 181 x 360. How do I cycle through that folder and take an average of each corresponding data points across all 50 matrices?
If the matrices are contained within Matlab variables stored using save('filename','VariableName') then they can be opened using load('filename.mat').
As such, you can use the result of filesInDirectory = dir; to get a list of all your files, using a search pattern if appropriate, like files = dir('*.mat');
Next you can use your load command, and then whos to see which variables were loaded. You should consider storing these for ease clearing after each iteration of your loop.
Once you have your matrix loaded (one at a time), you can take averages as you need, probably summing a value across multiple loop iterations, then dividing by a total counter you've been measuring (using perhaps count = count + size(MatrixVar, dimension);).
If you need all of the matrices loaded at once, then you can modify the above idea, to load using a loop, then average outside of the loop. In this case, you may need to take care - but 50*181*360 isn't too bad I suspect.
A brief introduction to the load command can be found at this link. It talks mainly about opening one matrix, then plotting the values, but there are some comments about dealing with headers, if needed, and different ways in which you can open data, if load is insufficient. It doesn't talk about binary files, though.
Note on binary files, based on comment to OP's question:
If the file can be opened using
FID = fopen('filename.dat');
fread(FID, 'float');
then you can replace the steps referring to load above, and instead use a loop to find filenames using dir, open the matrices using fopen and fread, then average as needed, finally closing the files and clearing the matrices.
In this case, probably your file identifier is the only part you're likely to need to change during the loop (although your total will increase, if that's how you want to average your data)
Reshaping the matrix, or inverting it, might make the code clearer (which is good!), but might not be necessary depending on what you're trying to average - it may be that selecting only a subsection of the matrix is sufficient.
Possible example code?
Assuming that all of the files in the current directory are to be opened, and that no files are elsewhere, you could try something like:
listOfFiles = dir('*.dat');
for f = 1:size(listOfFiles,1)
FID = fopen(listOfFiles(f).name);
Data = fread(FID, 'float');
% Reshape if needed?
Total = Total + sum(Data(start:end,:)); % This might vary, depending on what you want to average etc.
Counter = Counter + (size(Data,1) * size(Data,2)); % This product will be the 181*360 you had in the matrix, in this case
end
Av = Total/Counter;
I'd like to read heterogenious binary data into matlab. I do know from the beginning how much it is and in which datatype each segment is. For example:
%double %double %int32 ...
and then this get repeated about a million times. Easy enoug to handle with fread since the know the number of bites for each segment and can therefore calculate the skip value for each row.
But now the data segment looks something like this :
%double %int32%*char %double %double ...
Whereby the int prior to the *char is the length of the said string. This brings the problem that I cannot calculate the skip anymore and I'm stuck be reading in the whole file line by line therefore needing to make a lot more file access and slowing everthing down.
In order to get at least some speed up I wan't to read in all %double %double ... (Around 30 elements) at a time and then use those from a buffer to fill up the array's. In C this would be a rather easy task here, without memcpy and not so direct access to pointers...
Do you know any way to achive this, not using mex files?
You can't solve the problem that the record size is unknown, and thus you don't know how much to read ahead of time. But you can batch up the reads, and if you have a reasonable max size for the string, you can always read that amount, and ignore the unneeded bytes at the end. typecast is the trick:
readlen = 1024;
buf = fread(fid, readlen, '*uint8'); % the asterisk keeps the returned array as uint8
rec.val1 = typecast(buf(1:8), 'double');
string_len = typecast(buf(9:12), 'int32');
rec.str1 = typecast(buf(13:13+string_len-1), 'uint8');
pos = 13+string_len;
rec.val2 = typecast(buf(pos:pos+8-1), 'double');
You might wrap a simple function around this technique to track the current offset automatically.
There are some good posts on here (such as this one) on how to make a circular buffer in MATLAB. However from looking at them, I do not believe they fit my application, because what I am seeking, is a circular buffer solution in MATLAB, that does NOT involve any copying of old data.
To use a simple example, let us say that I am processing 50 samples at a time, and I read in 10 samples each iteration. I would first run through 5 iterations, fill up my buffer, and in the end, process my 50 samples. So my buffer will be
[B1 B2 B3 B4 B5]
, where each 'B' is a block of 10 samples.
Now, I read in the next 10 samples, call them B6. I want my buffer to now look like:
[B2 B3 B4 B5 B6]
The catch is this - I do NOT want to copy the old data, B2, B3, B4, B5 everytime, because it becomes expensive in time. (I have very large data sets).
I am wondering if there is a way to do this without any copying of 'old' data. Thank you.
One way to quickly implement a circular buffer is to use modulus to circle back around to the front. This will slightly modify the order of the data from what you specified but may be faster and equivalent if you simply replace the oldest data with the newest so instead of
[B2 B3 B4 B5 B6]
You get
[B6 B2 B3 B4 B5]
By using code like this:
bufferSize = 5;
data = nan(bufferSize,1)';
for ind = 1:bufferSize+2
data(mod(ind-1, bufferSize)+1) = ind
end
and this works for arbitrary sized data.
If you're not familiar with modulo, the mod function returns effectively the remainder of a division operation. So mod(3,5) returns3, mod(6,5) returns 1, mod(7,5) returns 2 and so on until you reach mod(10,5) which equals 0 again. This allows us to 'wrap around' the vector by moving back to the start every time we reach the end. The +1 and -1 in the code is because MATLAB starts its vector indices at 1 rather than 0 so to get the math to work out right you have to remove 1 before doing the mod then add it back in to get the right index. The result is that when you try and write the 6th element to your vector is goes and writes it to the 1st position in the vector.
My idea would be to use a cell-array with 5 entries and use a variable to index the sub-array which should be overwritten in the next step. E.g. something like
a = {ones(10),2*ones(10),3*ones(10),4*ones(10),5*ones(10)};
index = 1;
in the next step you could then write into the sub-array:
a{index} = 6*ones(10);
and increase the index like
index = index+1
Obviously, some sort of limitation:
if(index > 5) % FIXED TYPO!!
index = 1;
end
Would that be for you?
EDIT: One more thing to look at would be the sortation of the entries which would therefore always be shifted by some entries but depending on how you keep on using the data, you could e.g. shift the usage of the data depending on the variable index.
EDIT2: I've got another idea: What about using classes in MATLAB. You could use a handle-class to hold your data thus using the buffer to only reference to the data. This might make it a bit faster, depending on what data (how large the datasets are etc) you hold and how many shifts you'll have in your code. See e.g. here: Matlab -- handle objects
You could use a simple handle class:
classdef Foo < handle
properties (SetAccess = public, GetAccess = public)
x
end
methods
function obj = foo(x)
% constructor
obj.x = x;
end
end
end
Store the data in it:
data = [1 2 3 4];
foo = Foo(data); % handle object
and then only store the object-reference in the circular buffer. In the posted link the answer shows that the assignment bar = foo doesn't copy the object but really only helds the reference:
foo.x = [3 4]
disp(bar.x) % would be [3 4]
but as stated, I don't know if that will be faster because of the OOP-overhead. It might be depending on your data... And here some more information about it: http://www.matlabtips.com/how-to-point-at-in-matlab/
I just uploaded my solution for a fast circular buffer to which does not copy old data
http://www.mathworks.com/matlabcentral/fileexchange/47025-circvbuf-m
The main idea of this circular buffer is constant and fast performance
and avoiding copy operations when using the buffer in a program:
% create a circular vector buffer
bufferSz = 1000;
vectorLen= 7;
cvbuf = circVBuf(int64(bufferSz),int64(vectorLen));
% fill buffer with 99 vectors
vecs = zeros(99,vectorLen,'double');
cvbuf.append(vecs);
% loop over lastly appended vectors of the circVBuf:
new = cvbuf.new;
lst = cvbuf.lst;
for ix=new:lst
vec(:) = cvbuf.raw(:,ix);
end
% or direct array operation on lastly appended vectors in the buffer (no copy => fast)
new = cvbuf.new;
lst = cvbuf.lst;
mean = mean(cvbuf.raw(3:7,new:lst));
Check the screenshot to see, that this circular buffer has advantages if the buffer is large, but the size of data to append each time is small as the performance of circVBuf does NOT depend on the buffer size, compared to a simple copy buffer.
The double buffering garanties a predictive time for an append depending on the data to append in any situation. In future this class shall give you a choice for double buffering yes or no - things will speedup, if you do not need the garantied time.
I'm writing a simulation in Matlab.
I will eventually run this simulation hundreds of times.
In each simulation run, there are millions of simulation cycles.
In each of these cycles, I calculate a very complex function, which takes ~0.5 sec to finish.
The function input is a long bit array (>1000 bits) - which is an array of 0 and 1.
I hold the bit arrays in a matrix of 0 and 1, and for each one of them I only run the function once - as I save the result in a different array (res) and check if the bit array is in the matrix before running the functions:
for i=1:1000000000
%pick a bit array somehow
[~,indx] = ismember(bit_array,bit_matrix,'rows');
if indx == 0
indx = length(results) + 1;
bit_matrix(indx,:) = bit_array;
res(indx) = complex_function(bit_array);
end
result = res(indx)
%do something with result
end
I have two quesitons, really:
Is there a more efficient way to find the index of a row in a matrix then 'ismember'?
Since I run the simulation many times, and there is a big overlap in the bit-arrays I'm getting, I want to cache the matrix between runs so that I don't recalculate the function over the same bit-arrays over and over again. How do I do that?
The answer to both questions is to use a map. There are a few steps to do this.
First you will need a function to turn your bit_array into either a number or a string. For example, turn [0 1 1 0 1 0] into '011010'. (Matlab only supports scalar or string keys, which is why this step is required.)
Defined a map object
cachedRunMap = containers.Map; %See edit below for more on this
To check if a particular case has been run, use iskey.
cachedRunMap.isKey('011010');
To add the results of a run use the appending syntax
cachedRunMap('011010') = [0 1 1 0 1]; %Or whatever your result is.
To retrieve cached results, use the getting syntax
tmpResult = cachedRunMap.values({'011010'});
This should efficiently store and retrieve values until you run out of system memory.
Putting this together, now your code would look like this:
%Hacky magic function to convert an array into a string of '0' and '1'
strFromBits = #(x) char((x(:)'~=0)+48); %'
%Initialize the map
cachedRunMap = containers.Map;
%Loop, computing and storing results as needed
for i=1:1000000000
%pick a bit array somehow
strKey = strFromBits(bit_array);
if cachedRunMap.isKey(strKey)
result = cachedRunMap(strKey);
else
result = complex_function(bit_array);
cachedRunMap(strKey) = reult;
end
%do something with result
end
If you want a key which is not a string, that needs to be declared at step 2. Some examples are:
cachedRunMap = containers.Map('KeyType', 'char', 'ValueType', 'any');
cachedRunMap = containers.Map('KeyType', 'double', 'ValueType', 'any');
cachedRunMap = containers.Map('KeyType', 'uint64', 'ValueType', 'any');
cachedRunMap = containers.Map('KeyType', 'uint64', 'ValueType', 'double');
Setting a KeyType of 'char' sets the map to use strings as keys. All other types must be scalars.
Regarding issues as you scale this up (per your recent comments)
Saving data between sessions: There should be no issues saving this map to a *.mat file, up to the limits of your systems memory
Purging old data: I am not aware of a straightforward way to add LRU features to this map. If you can find a Java implementation you can use it within Matlab pretty easily. Otherwise it would take some thought to determine the most efficient method of keeping track of the last time a key was used.
Sharing data between concurrent sessions: As you indicated, this probably requires a database to perform efficiently. The DB table would be two columns (3 if you want to implement LRU features), the key, value, (and last used time if desired). If your "result" is not a type which easily fits into SQL (e.g. a non-uniform size array, or complex structure) then you will need to put additional thought into how to store it. You will also need a method to access the database (e.g. the database toolbox, or various tools on the Mathworks file exchange). Finally you will need to actually setup a database on a server (e.g. MySql if you are cheap, like me, or whatever you have the most experience with, or can find the most help with.) This is not actually that hard, but it takes a bit of time and effort the first time through.
Another approach to consider (much less efficient, but not requiring a database) would be to break up the data store into a large (e.g. 1000's or millions) number of maps. Save each into a separate *.mat file, with a filename based on the keys contained in that map (e.g. the first N characters of your string key), and then load/save these files between sessions as needed. This will be pretty slow ... depending on your usage it may be faster to recalculate from the source function each time ... but it's the best way I can think of without setting up the DB (clearly a better answer).
For a large list, a hand-coded binary search can beat ismember, if maintaining it in sorted order isn't too expensive. If that's really your bottleneck. Use the profiler to see how much the ismember is really costing you. If there aren't too many distinct values, you could also store them in a containers.Map by packing the bit_matrix in to a char array and using it as the key.
If it's small enough to fit in memory, you could store it in a MAT file using save and load. They can store any basic Matlab datatype. Have the simulation save the accumulated res and bit_matrix at the end of its run, and re-load them the next time it's called.
I think that you should use containers.Map() for the purpose of speedup.
The general idea is to hold a map that contains all hash values. If your bit arrays have uniform distribution under the hash function, most of the time you won't need the call to ismember.
Since the key type cannot be an array in Matlab, you can calculate some hash function on your array of bits.
For example:
function s = GetHash(bitArray)
s = mod( sum(bitArray), intmax('uint32'));
end
This is a lousy hash function, but enough to understand the principle.
Then the code would look like:
map = containers.Map('KeyType','uint32','ValueType','any');
for i=1:1000000000
%pick a bit array somehow
s = GetHash(bit_array);
if isKey %Do the slow check.
[~,indx] = ismember(bit_array,bit_matrix,'rows');
else
map(s) = 1;
continue;
end
if indx == 0
indx = length(results) + 1;
bit_matrix(indx,:) = bit_array;
res(indx) = complex_function(bit_array);
end
result = res(indx)
%do something with result
end