Speedup processing of larger binary files

Speedup processing of larger binary files - matlab

I have to process thousands of binary files (each of 16MB) by reading pairs of them and creating a bit-level data structure (usually a 1x134217728 array) in order to process them on bit level.
Currently I am doing this the following way:
conv = #(c) uint8(bitget(c,1:32));
measurement = NaN(1,(sizeOfMeasurements*8)) %(1,134217728)
fid = fopen(fileName, 'rb');
byteContent = fread(fid,'uint32');
fclose(fid);
bitRepresentation1 = arrayfun(conv, byteContent, 'UniformOutput', false);
measurement=[bitRepresentation1{:}];
Thus, I replaced fopen with memmapfile as below:
m = memmapfile(fileName,'Format',{'uint32', [4194304 1], 'byteContent'});
byteContent = m.data.byteContent;
byteContent = double(byteContent);
I printed timing information (using tic/toc) for the individual instructions and it turns out that the bottleneck is:
bitRepresentation1 = arrayfun(conv, byteContent, 'UniformOutput', false); % see first line of code for conv
Are there more efficient ways of transforming byteContent into an array that stores a bit per index (i.e. that is a bit representation of byteContent)?

Let looping over all numbers be handled by bitget. You loop over the bits:
fid = fopen(fileName, 'rb');
bitContent = fread(fid,'*ubit64');
fclose(fid);
conv = #(ii) uint8(bitget(bitContent, ii));
bitRepresentation = arrayfun(conv, 1:64, 'UniformOutput', false);
measurement = [bitRepresentation{:}]';
measurement = measurement(:).';
EDIT you can also try a direct loop:
fid = fopen(fileName, 'rb');
bitContent = fread(fid,'*ubit64');
fclose(fid);
sz = 64 * size(bitContent,1);
measurement3 = zeros(1, sz, 'uint8');
weave = 1:64:sz;
for ii = 1:64
measurement3(weave + ii - 1) = uint8(bitget(bitContent, ii)); end
but on my system, that is (surprisingly) slower than arrayfun...but, my MATLAB version is from the stone age, your mileage may be different. Give it a try

Several things that seem to provide further improvement on Rody's suggestion:
(minor:) Using a local function instead of a function handle for conv.
(major:) Converting the result of conv to logical using ~~ instead of uint8.
(major:) cell2mat instead of [bitRepresentation{:}]'.
The result:
function q40863898(filename)
fid = fopen(filename, 'rb');
bitContent = fread(fid,'*ubit64');
fclose(fid);
bitRepresentation = arrayfun(#convert, 1:64, 'UniformOutput', false);
measurement = reshape(cell2mat(bitRepresentation).',[],1).';
function out = convert(ii)
out = ~~(bitget(bitContent, ii, 'uint64'));
end
end
Benchmark result (on MATLAB R2016b, Win10 x64, 14MB file):
Rody's vectorized method: 0.87783
Rody's loop method: 2.37
Dev-iL's method: 0.68387
Benchmark code:
function q40863898(filename)
%% Common code:
fid = fopen(filename, 'rb');
bitContent = fread(fid,'*ubit64');
fclose(fid);
%% Verification:
ref = Rody1();
res = {Rody2(), uint8(Devil1())};
assert(isequal(ref,res{1}));
assert(isequal(ref,res{2}));
%% Benchmark:
disp(['Rody''s vectorized method: ' num2str(timeit(#Rody1))]);
disp(['Rody''s loop method: ' num2str(timeit(#Rody2))]);
disp(['Dev-iL''s method: ' num2str(timeit(#Devil1))]);
%% Functions:
function measurement = Rody1()
conv = #(ii) uint8(bitget(bitContent, ii));
bitRepresentation = arrayfun(conv, 1:64, 'UniformOutput', false);
measurement = [bitRepresentation{:}]';
measurement = measurement(:).';
end
function measurement = Rody2()
sz = 64 * size(bitContent,1);
measurement = zeros(1, sz, 'uint8');
weave = 1:64:sz;
for ii = 1:64
measurement(weave + ii - 1) = uint8(bitget(bitContent, ii));
end
end
function measurement = Devil1()
bitRepresentation = arrayfun(#convert, 1:64, 'UniformOutput', false);
measurement = reshape(cell2mat(bitRepresentation).',[],1).';
function out = convert(ii)
out = ~~(bitget(bitContent, ii, 'uint64'));
end
end
end

Related

Rearrange array into a form suitable for NN training

I am working on a large dataset that I need to convert to a specific format for further processing. I am looking for advice in this regard.
Sample input:
A = [0.99 -0.99
1 -1
0.55 -0.55]
Sample output:
val(:,:,1,1)=0.99
val(:,:,2,1)=-0.99
val(:,:,1,2)=1
val(:,:,2,2)=-1
val(:,:,1,3)=0.55
val(:,:,2,3)=-0.55
While working on this, I found a code inside the CNN toolbox of MATLAB R2018b
function dummifiedOut = dummify(categoricalIn)
% iDummify Convert a categorical input into a dummified output.
%
% dummifiedOut(1,1,i,j)=1 if observation j is in class i, and zero
% otherwise. Therefore, dummifiedOut will be of size [1, 1, K, N],
% where K is the number of categories and N is the number of
% observation in categoricalIn.
% Copyright 2015-2016 The MathWorks, Inc.
numObservations = numel(categoricalIn);
numCategories = numel(categories(categoricalIn));
dummifiedSize = [1, 1, numCategories, numObservations];
dummifiedOut = zeros(dummifiedSize);
categoricalIn = iMakeHorizontal( categoricalIn );
idx = sub2ind(dummifiedSize(3:4), int32(categoricalIn), 1:numObservations);
dummifiedOut(idx) = 1;
end
function vec = iMakeHorizontal( vec )
vec = reshape( vec, 1, numel( vec ) );
end
Can we modify this block of code in such a way to produce the sample output?

Either do what rinkert suggested, or just use permute directly:
>> val = permute(A, [4,3,2,1])
val(:,:,1,1) =
0.9900
val(:,:,2,1) =
-0.9900
val(:,:,1,2) =
1
val(:,:,2,2) =
-1
val(:,:,1,3) =
0.5500
val(:,:,2,3) =
-0.5500
Note that the function which you posted requires categorical data, whereas you have a simple double array. If you insist on "adapting" the existing dummify, you could do:
function dummifiedOut = dummify(categoricalIn)
dummifiedOut = zeros([1,1,size(categoricalIn)]);
dummifiedOut(:) = categoricalIn;
end
(...although, IMHO, this makes little sense.)

Error in MSR Identity toolkit (fopen)

I try to run a demo for speaker verification using MSR Identity toolkit. However it left error after training UBM step. The error is as follow. It looks like fopen return -1 and cause error to fread. I can't understand why it can't read the filenames. I can't attach the code since it involves many functions. I just hope someone that familiar with this toolkit can help me.
Error using fread
Invalid file identifier. Use fopen to generate a valid file identifier.
Error in htkread (line 7)
nframes = fread(fid, 1, 'int32'); % number of frames
Error in mapAdapt>load_data (line 107)
data{ix} = htkread(filenames{ix});
Error in mapAdapt (line 52)
dataList = load_data(dataList);
Error in demo_gmm_ubm (line 69)
gmm_models{spk} = mapAdapt(spk_files, ubm, map_tau, config);
Part of the code where lead to the error as follow:
function data = load_data(datalist)
% load all data into memory
if ~iscellstr(datalist)
fid = fopen(datalist, 'rt');
filenames = textscan(fid, '%s');
fclose(fid);
filenames = filenames{1};
else
filenames = datalist;
end
nfiles = size(filenames, 1);
data = cell(nfiles, 1);
for ix = 1 : nfiles,
data{ix} = htkread(filenames{ix});
end
function [data, frate, feakind] = htkread(filename)
% read features with HTK format (uncompressed)
fid = fopen(filename, 'r','b'); %ERROR HERE
nframes = fread(fid, 1, 'int32'); % number of frames
frate = fread(fid, 1, 'int32'); % frame rate in nano-seconds unit
nbytes = fread(fid, 1, 'short'); % number of bytes per feature value
feakind = fread(fid, 1, 'short'); % 9 is USER
ndim = nbytes / 4; % feature dimension (4 bytes per value)
data = fread(fid, [ndim, nframes], 'float');
fclose(fid);
datalist contains:
'features\fadg0_sa2.htk'
'features\fadg0_si1279.htk'
'features\fadg0_si1909.htk'
'features\fadg0_si649.htk'
'features\fadg0_sx109.htk'
'features\fadg0_sx19.htk'
'features\fadg0_sx199.htk'
'features\fadg0_sx289.htk'
'features\fadg0_sx379.htk'

I use MSR Identity Toolkit without problem, fortunately.
What I have in htkread.m is as follows:
...
fid = fopen(filename, 'rb', 'ieee-be');
nframes = fread(fid, 1, 'int32'); % number of frames
...
Maybe the error you encountered comes from:
big-endian/little-endian issue
the *.htk feature you have is missing
the *.htk is with other format
regards,
tommy

Is your fid returning a negative value? If yes, try specifying entire path where dataList= 'ubm.lst'; is in the code

How to use reshape on a 4D matrix after using fread on a RGB RAW file?

The following code correctly loads an mp4 file and stores it in a 3D matrix.
r = 1;
fileName = testDummyMP4;
readerobj = VideoReader(fileName, 'tag', 'myreader1');
F = get(readerobj, 'numberOfFrames');
tampon = single(read(readerobj,1));
tampon = imresize(tampon(:,:,1),r);
I = zeros(size(tampon,1),size(tampon,2),F,'uint8');
for k = 1:F
disp(['Open: ' num2str(round(100*k/F)) '%'])
vidFrames = single(read(readerobj,k));
I(:,:,k) = imresize(vidFrames(:,:,2),r);
end;
imagesc((I(:,:,1)));
This is the output
I'm trying to reverse engineer this code so that it produces the same sort of result for but for a .raw 8bit rbg file. Following the answers from this question I tried the following:
You can download the 'M1302000245_1436389857.982603.raw' rbg file here
Or Google Drive version here
Ix = 256;
Iy = 256;
SF = 30; % Sample frequency
RecordingTime = 30;
Iz = SF*RecordingTime
testDummy = 'M1302000245_1436389857.982603.raw'
fin = fopen(testDummy, 'r');
I = fread(fin, Ix*Iy*3*Iz, 'uint8');
fclose(fin);
I = reshape(I, [Ix Iy 3 Iz]); % The rbg should be 256x256x3x900
% I've tried each of the following manipulations before calling imagesc to no avail
% I = flipdim(imrotate(I, -90),2);
% I=impixel(I)
% I=I'
imagesc((I(:,:,1))); % view first slice
This gives:
What am I doing wrong?
Additional info:
Recordings are taken using raspberry pi cameras with the following python code
class BrainCamera:
def __init__(self):
self.video_format = "rgb"
#self.video_quality = 5
# Set up the settings of the camera so that
# Exposure and gains are constant.
self.camera = picamera.PiCamera()
self.camera.resolution = (256,256)
self.camera.framerate = 30
sleep(2.0)
self.camera.shutter_speed = self.camera.exposure_speed
self.camera.exposure_mode = 'off'
g = self.camera.awb_gains
self.camera.awb_mode = 'off'
self.camera.awb_gains = g
self.camera.shutter_speed = 30000
self.camera.awb_gains = (1,1)
def start_recording(self, video_name_path):
self.camera.start_recording(video_name_path, format=self.video_format)
self.camera.start_preview()
def stop_recording(self):
self.camera.stop_recording()
self.camera.stop_preview()
# Destructor
def __del__(self):
print ("Closed Camera")
self.camera.close()

I don't get the output you have, but I get some reasonable image:
%your vode from above, ending with I=fread(...)
I=uint8(I)
I2=reshape(I, 3, Ix ,Iy, []);
I2=permute(I2,[3,2,1,4]);
imagesc(I2(:,:,:,1));
implay(I2);

With Daniel's help I got the following working. Additionally a co-worker found out how to get the number of frames of a .raw rbg video. This we needed since it turns out assuming noFrame = SF*RecordingTime was a bad idea.
testDummy = 'M1302000245_1436389857.982603.raw'
testDummyInfo = dir(testDummy);
noFrames = testDummyInfo.bytes/(256*256*3); % In my question Iz = noFrames
fin = fopen(testDummy, 'r');
I = fread(fin, testDummyInfo.bytes, 'uint8');
fclose(fin);
I=uint8(I);
I2=reshape(I, 3, Ix ,Iy, noFrames);
I2=permute(I2,[3,2,1,4]);
imagesc(I2(:,:,2,1)); % Throw out unnecessarry channels r and b. Only want GFP signal
to produce:

Importing data block with Matlab

I have a set of data in the following format, and I would like to import each block in order to analyze them with Matlab.
Emax=0.5/real
----------------------------------------------------------------------
4.9750557 14535
4.9825821 14522
4.990109 14511
4.9976354 14491
5.0051618 14481
5.0126886 14468
5.020215 14437
5.0277414 14418
5.0352678 14400
5.0427947 14372
5.0503211 14355
5.0578475 14339
5.0653744 14321
Emax=1/real
----------------------------------------------------------------------
24.965595 597544
24.973122 597543
24.980648 597543
24.988174 597542
24.995703 597542
25.003229 597542
I have modified this piece of code from MathWorks, but I think, I have problems dealing with the spaces between each column.
Each block of data consist of 3874 rows and is divided by a text (Emax=XX/real) and a line of ----, unfortunately is the only way the software export the data.

Here is one way to import the data:
% read file as a cell-array of lines
fid = fopen('file.dat', 'rt');
C = textscan(fid, '%s', 'Delimiter','');
C = C{1};
fclose(fid);
% remove separator lines
C(strncmp('---',C,3)) = [];
% location of section headers
headInd = [find(strncmp('Emax=', C, length('Emax='))) ; numel(C)+1];
% extract each section
num = numel(headInd)-1;
blocks = struct('header',cell(num,1), 'data',cell(num,1));
for i=1:num
% section header
blocks(i).header = C{headInd(i)};
% data
X = regexp(C(headInd(i)+1:headInd(i+1)-1), '\s+', 'split');
blocks(i).data = str2double(vertcat(X{:}));
end
The result is a structure array containing the data from each block:
>> blocks
blocks =
2x1 struct array with fields:
header
data
>> blocks(2)
ans =
header: 'Emax=1/real'
data: [6x2 double]
>> blocks(2).data(:,1)
ans =
24.9656
24.9731
24.9806
24.9882
24.9957
25.0032

This should work. I don't think textscan() will work with a file like this because of the breaks between blocks.
Essentially what this code does is loop through lines between blocks until it finds a line that matches the data format. The code is naive and assumes that the file will have exactly the number of blocks lines per block that you specify. If there were a fixed number of lines between blocks it would be a lot easier and you could remove the first inner loop and replace with just ~=fgets(fid) once for each line.
function block_data = readfile(in_file_name)
fid = fopen(in_file_name, 'r');
delimiter = ' ';
line_format = '%f %f';
n_cols = 2; % Number of numbers per line
block_length = 3874; % Number of lines per block
n_blocks = 2; % Total number of blocks in file
tline = fgets(fid);
line_data = cell2mat(textscan(tline,line_format,'delimiter',delimiter,'MultipleDelimsAsOne',1));
block_n = 0;
block_data = zeros(n_blocks,block_length,n_cols);
while ischar(tline) && block_n < n_blocks
block_n = block_n+1;
tline = fgets(fid);
if ischar(tline)
line_data = cell2mat(textscan(tline,line_format,'delimiter',delimiter,'MultipleDelimsAsOne',1));
end
while ischar(tline) && isempty(line_data)
tline = fgets(fid);
line_data = cell2mat(textscan(tline,line_format,'delimiter',delimiter,'MultipleDelimsAsOne',1));
end
line_n = 1;
while line_n <= block_length
block_data(block_n,line_n,:) = cell2mat(textscan(tline,line_format,'delimiter',delimiter,'MultipleDelimsAsOne',1));
tline = fgets(fid);
line_n = line_n+1;
end
end
fclose(fid)

MATLAB: One Step Ahead Neural Network Timeseries Forecast

Intro: I'm using MATLAB's Neural Network Toolbox in an attempt to forecast time series one step into the future. Currently I'm just trying to forecast a simple sinusoidal function, but hopefully I will be able to move on to something a bit more complex after I obtain satisfactory results.
Problem: Everything seems to work fine, however the predicted forecast tends to be lagged by one period. Neural network forecasting isn't much use if it just outputs the series delayed by one unit of time, right?
Code:
t = -50:0.2:100;
noise = rand(1,length(t));
y = sin(t)+1/2*sin(t+pi/3);
split = floor(0.9*length(t));
forperiod = length(t)-split;
numinputs = 5;
forecasted = [];
msg = '';
for j = 1:forperiod
fprintf(repmat('\b',1,numel(msg)));
msg = sprintf('forecasting iteration %g/%g...\n',j,forperiod);
fprintf('%s',msg);
estdata = y(1:split+j-1);
estdatalen = size(estdata,2);
signal = estdata;
last = signal(end);
[signal,low,high] = preprocess(signal'); % pre-process
signal = signal';
inputs = signal(rowshiftmat(length(signal),numinputs));
targets = signal(numinputs+1:end);
%% NARNET METHOD
feedbackDelays = 1:4;
hiddenLayerSize = 10;
net = narnet(feedbackDelays,[hiddenLayerSize hiddenLayerSize]);
net.inputs{1}.processFcns = {'removeconstantrows','mapminmax'};
signalcells = mat2cell(signal,[1],ones(1,length(signal)));
[inputs,inputStates,layerStates,targets] = preparets(net,{},{},signalcells);
net.trainParam.showWindow = false;
net.trainparam.showCommandLine = false;
net.trainFcn = 'trainlm'; % Levenberg-Marquardt
net.performFcn = 'mse'; % Mean squared error
[net,tr] = train(net,inputs,targets,inputStates,layerStates);
next = net(inputs(end),inputStates,layerStates);
next = postprocess(next{1}, low, high); % post-process
next = (next+1)*last;
forecasted = [forecasted next];
end
figure(1);
plot(1:forperiod, forecasted, 'b', 1:forperiod, y(end-forperiod+1:end), 'r');
grid on;
Note:
The function 'preprocess' simply converts the data into logged % differences and 'postprocess' converts the logged % differences back for plotting. (Check EDIT for preprocess and postprocess code)
Results:
BLUE: Forecasted Values
RED: Actual Values
Can anyone tell me what I'm doing wrong here? Or perhaps recommend another method to achieve the desired results (lagless prediction of sinusoidal function, and eventually more chaotic timeseries)? Your help is very much appreciated.
EDIT:
It's been a few days now and I hope everyone has enjoyed their weekend. Since no solutions have emerged I've decided to post the code for the helper functions 'postprocess.m', 'preprocess.m', and their helper function 'normalize.m'. Maybe this will help get the ball rollin.
postprocess.m:
function data = postprocess(x, low, high)
% denormalize
logdata = (x+1)/2*(high-low)+low;
% inverse log data
sign = logdata./abs(logdata);
data = sign.*(exp(abs(logdata))-1);
end
preprocess.m:
function [y, low, high] = preprocess(x)
% differencing
diffs = diff(x);
% calc % changes
chngs = diffs./x(1:end-1,:);
% log data
sign = chngs./abs(chngs);
logdata = sign.*log(abs(chngs)+1);
% normalize logrets
high = max(max(logdata));
low = min(min(logdata));
y=[];
for i = 1:size(logdata,2)
y = [y normalize(logdata(:,i), -1, 1)];
end
end
normalize.m:
function Y = normalize(X,low,high)
%NORMALIZE Linear normalization of X between low and high values.
if length(X) <= 1
error('Length of X input vector must be greater than 1.');
end
mi = min(X);
ma = max(X);
Y = (X-mi)/(ma-mi)*(high-low)+low;
end

I didn't check you code, but made a similar test to predict sin() with NN. The result seems reasonable, without a lag. I think, your bug is somewhere in synchronization of predicted values with actual values.
Here is the code:
%% init & params
t = (-50 : 0.2 : 100)';
y = sin(t) + 0.5 * sin(t + pi / 3);
sigma = 0.2;
n_lags = 12;
hidden_layer_size = 15;
%% create net
net = fitnet(hidden_layer_size);
%% train
noise = sigma * randn(size(t));
y_train = y + noise;
out = circshift(y_train, -1);
out(end) = nan;
in = lagged_input(y_train, n_lags);
net = train(net, in', out');
%% test
noise = sigma * randn(size(t)); % new noise
y_test = y + noise;
in_test = lagged_input(y_test, n_lags);
out_test = net(in_test')';
y_test_predicted = circshift(out_test, 1); % sync with actual value
y_test_predicted(1) = nan;
%% plot
figure,
plot(t, [y, y_test, y_test_predicted], 'linewidth', 2);
grid minor; legend('orig', 'noised', 'predicted')
and the lagged_input() function:
function in = lagged_input(in, n_lags)
for k = 2 : n_lags
in = cat(2, in, circshift(in(:, end), 1));
in(1, k) = nan;
end
end

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Speedup processing of larger binary files - matlab

Related

Rearrange array into a form suitable for NN training

Error in MSR Identity toolkit (fopen)

How to use reshape on a 4D matrix after using fread on a RGB RAW file?

Importing data block with Matlab

MATLAB: One Step Ahead Neural Network Timeseries Forecast

Categories

Resources