Processing a large dataset

Processing a large dataset - matlab

My MATLAB program generates N=100 trajectories with T=10^8 time steps in each, i.e.
x = randn(10^8,100);
Ultimately, I want to process this data set and obtain an average autocorrelation of all trajectories:
y = mean(fft(x),2); % output size (10^8, 1)
Now since x is too big to store, my only viable option is to save it on the hard drive in small chunks of 10^6
x1 = randn(10^6, 100);
x2 = randn(10^6, 100);
etc
and then obtain y by processing each trajectory n=1:100 individually and accumulating the result:
for n=1:100
y = y + fft([x1(:,n); x2(:,n); ...; x100(:,n)]);
end
Is there a more elegant way of doing this? I have 100GB of RAM and a pool of 12 workers.

An easier way would be to generate your data once and then chop it into small pieces you save on disk, or, if possible, create the data on the workers themselves.
x = randn(10^8,100);
for ii=1:100
if ii ~=100
tmp = x(ii:ii+1e6)
else
tmp = x(ii:end); %ii+1e6 would result in end+1
end
filename = sprintf('Dataset%i',ii); %create filename
save(filename,tmp,'-v7.3'); %save file to disk in -v7.3 format
end
y = cell(100,1) %initialise output
parfor ii = 1:100
filename = sprintf('Dataset%i',ii); %get the filenames back
load(filename); %load the file
y{ii} = mean(fft(tmp),2); % I called it tmp when saving, so it's called tmp here
end
Now you can accumulate the results from the cell y in the desired manner. You can of course play around with the amount of files you create, as less files will be processed faster due to the overhead of parfor.

Related

MATLAB data file when it overs its memory

There is a Matrix of 500000000 X 5.
And the sample of this data is like this :
1 01 06:0 48407
1 01 06:1 48407
.
.
.
865850 31 23:5 1586884493
Each column means [area_number date hour:minute amount_of_data]
I want to load them entirely, after that make another 865850 X 4464 matrix from their 5th column values. In this new matrix, row insists area_number. And each column means amout_of_data according to time priority.
This is what I wrote.
clear all; close all;
fileID=fopen('data2.txt','r');
Data=fscanf(fileID, '%d %d %d:%d %d',[5 100000]);
Data = Data';
Zeros = zeros(4000, 4464);
DataA = Data(:,1); % indexs
DataB = Data(:,2); % dates
DataC = Data(:,3); % hours
DataD = Data(:,4); % minutes
DataE = Data(:,5); % data
for m=1:40000
r = DataA(m);
c = (DataB(m)-1)*24*6 + DataC(m)*6 + DataD(m);
Zeros(r,c) = DataE(m);
end
I can't finish it because the matrix too big to load it at once.
It overs memory limitation of MATLAB.
Please help me...
Thank you~!

To solve your problem, using the matfile command is probably the best choice. It allows you to write data directly to a mat-file on the filesystem but access it like a variable.
If I understood your data right, all lines with the same index are next to each other, and all data with the same index is small enough to fit your memory.
Read all data with index 1
process it like you did above, creating one row of your intended matrix
Write this row to your matfile
Proceed with the next index until you reach the end

Vectorization of matlab code for faster execution

My code works in the following manner:
1.First, it obtains several images from the training set
2.After loading these images, we find the normalized faces,mean face and perform several calculation.
3.Next, we ask for the name of an image we want to recognize
4.We then project the input image into the eigenspace, and based on the difference from the eigenfaces we make a decision.
5.Depending on eigen weight vector for each input image we make clusters using kmeans command.
Source code i tried:
clear all
close all
clc
% number of images on your training set.
M=1200;
%Chosen std and mean.
%It can be any number that it is close to the std and mean of most of the images.
um=60;
ustd=32;
%read and show images(bmp);
S=[]; %img matrix
for i=1:M
str=strcat(int2str(i),'.jpg'); %concatenates two strings that form the name of the image
eval('img=imread(str);');
[irow icol d]=size(img); % get the number of rows (N1) and columns (N2)
temp=reshape(permute(img,[2,1,3]),[irow*icol,d]); %creates a (N1*N2)x1 matrix
S=[S temp]; %X is a N1*N2xM matrix after finishing the sequence
%this is our S
end
%Here we change the mean and std of all images. We normalize all images.
%This is done to reduce the error due to lighting conditions.
for i=1:size(S,2)
temp=double(S(:,i));
m=mean(temp);
st=std(temp);
S(:,i)=(temp-m)*ustd/st+um;
end
%show normalized images
for i=1:M
str=strcat(int2str(i),'.jpg');
img=reshape(S(:,i),icol,irow);
img=img';
end
%mean image;
m=mean(S,2); %obtains the mean of each row instead of each column
tmimg=uint8(m); %converts to unsigned 8-bit integer. Values range from 0 to 255
img=reshape(tmimg,icol,irow); %takes the N1*N2x1 vector and creates a N2xN1 matrix
img=img'; %creates a N1xN2 matrix by transposing the image.
% Change image for manipulation
dbx=[]; % A matrix
for i=1:M
temp=double(S(:,i));
dbx=[dbx temp];
end
%Covariance matrix C=A'A, L=AA'
A=dbx';
L=A*A';
% vv are the eigenvector for L
% dd are the eigenvalue for both L=dbx'*dbx and C=dbx*dbx';
[vv dd]=eig(L);
% Sort and eliminate those whose eigenvalue is zero
v=[];
d=[];
for i=1:size(vv,2)
if(dd(i,i)>1e-4)
v=[v vv(:,i)];
d=[d dd(i,i)];
end
end
%sort, will return an ascending sequence
[B index]=sort(d);
ind=zeros(size(index));
dtemp=zeros(size(index));
vtemp=zeros(size(v));
len=length(index);
for i=1:len
dtemp(i)=B(len+1-i);
ind(i)=len+1-index(i);
vtemp(:,ind(i))=v(:,i);
end
d=dtemp;
v=vtemp;
%Normalization of eigenvectors
for i=1:size(v,2) %access each column
kk=v(:,i);
temp=sqrt(sum(kk.^2));
v(:,i)=v(:,i)./temp;
end
%Eigenvectors of C matrix
u=[];
for i=1:size(v,2)
temp=sqrt(d(i));
u=[u (dbx*v(:,i))./temp];
end
%Normalization of eigenvectors
for i=1:size(u,2)
kk=u(:,i);
temp=sqrt(sum(kk.^2));
u(:,i)=u(:,i)./temp;
end
% show eigenfaces;
for i=1:size(u,2)
img=reshape(u(:,i),icol,irow);
img=img';
img=histeq(img,255);
end
% Find the weight of each face in the training set.
omega = [];
for h=1:size(dbx,2)
WW=[];
for i=1:size(u,2)
t = u(:,i)';
WeightOfImage = dot(t,dbx(:,h)');
WW = [WW; WeightOfImage];
end
omega = [omega WW];
end
% Acquire new image
% Note: the input image must have a bmp or jpg extension.
% It should have the same size as the ones in your training set.
% It should be placed on your desktop
ed_min=[];
srcFiles = dir('G:\newdatabase\*.jpg'); % the folder in which ur images exists
for b = 1 : length(srcFiles)
filename = strcat('G:\newdatabase\',srcFiles(b).name);
Imgdata = imread(filename);
InputImage=Imgdata;
InImage=reshape(permute((double(InputImage)),[2,1,3]),[irow*icol,1]);
temp=InImage;
me=mean(temp);
st=std(temp);
temp=(temp-me)*ustd/st+um;
NormImage = temp;
Difference = temp-m;
p = [];
aa=size(u,2);
for i = 1:aa
pare = dot(NormImage,u(:,i));
p = [p; pare];
end
InImWeight = [];
for i=1:size(u,2)
t = u(:,i)';
WeightOfInputImage = dot(t,Difference');
InImWeight = [InImWeight; WeightOfInputImage];
end
noe=numel(InImWeight);
% Find Euclidean distance
e=[];
for i=1:size(omega,2)
q = omega(:,i);
DiffWeight = InImWeight-q;
mag = norm(DiffWeight);
e = [e mag];
end
ed_min=[ed_min MinimumValue];
theta=6.0e+03;
%disp(e)
z(b,:)=InImWeight;
end
IDX = kmeans(z,5);
clustercount=accumarray(IDX, ones(size(IDX)));
disp(clustercount);
Running time for 100 images:Elapsed time is 103.947573 seconds.
QUESTIONS:
1.It is working fine for M=50(i.e Training set contains 50 images) but not for M=1200(i.e Training set contains 1200 images).It is not showing any error.There is no output.I waited for 10 min still there is no output.What is the problem?Where i was wrong?

To answer your second question, you can simply 'save' any generated variable as a .mat file in your working directory (Current folder) which can be accessed later. So in your code, if the 'training eigenfaces' is given by the variable 'u', you can use the following:
save('eigenface.mat','u')
This creates a .mat file with the name eigenface.mat which contains the variable 'u', the eigenfaces. Note that this variable is saved in your Current Folder.
In a later instance when you are trying out with your test data, you can simply 'load' this variable:
load('eigenface.mat')
This automatically loads 'u' into your workspace.
You can also save additional variables in the same .mat file if necessary
save('eigenface.mat','u','v'...)
The answer to the first question is that the code is simply not done running yet. Vectorizing the code (instead of using a for loop), as suggested in the comment section above can improve the speed significantly.
[EDIT]
Since the images are not very big, the code is not significantly slowed down by the first for loop. You can improve the performance of the rest of the code by vectorizing the code. Vectorized operations are faster than for-loops. This link might be useful in understanding vectorization:
http://www.mathworks.com/help/matlab/matlab_prog/vectorization.html
For example, the second for-loop can be replaced by the following vectorized form, as suggested in the comments
tempS = double(S);
meanS = mean(S,1);
stdS = std(S,0,1);
S = (tempS - meanS) ./ stdS;
Use MATLAB's timer functions tic and toc for finding how long the first for-loop alone is taking to execute. Add tic before the for loop and toc after it. If the time taken for 50 images is about 104 seconds, then it would be significantly more for 1200 images.

Reading labview binary files in Matlab?

I have large .bin files (10GB-60GB) created by Labview software, the .bin files represent the output of two sensors used from experiments that I have done.
The problem I have is importing the data into Matlab, the only way I have achieved this so far is by converting the .bin files to .txt files in Labview software then Importing the data into MATLAB using the following code:
Nlines = 1e6; % set number of lines to sample per cycle
sample_rate = (1); %sample rate
DECE= 1000;% decimation factor
TIME = (0:sample_rate:sample_rate*((Nlines)-1));%first inctance of time vector
format = '%f\t%f';
fid = fopen('H:\PhD backup\Data/ONK_PP260_G_text.txt');
while(~feof(fid))
C = textscan(fid, format, Nlines, 'CollectOutput', true);
d = C{1}; % immediately clear C at this point you need the memory!
clearvars C ;
TIME = ((TIME(end)+sample_rate):sample_rate:(sample_rate*(size(d,1)))+(TIME(end)));%shift Time along
plot((TIME(1:DECE:end)),(d(1:DECE:end,:)))%plot and decimate
hold on;
clearvars d;
end
fclose(fid);
The basic idea behind my code is to conserve RAM by reading Nlines of data from .txt on disk to Matlab variable C in RAM, plotting C then clearing C. This process occurs in loop so the data is plotted in chunks until the end of the .txt file is reached.
I want to read the .bin file directly into MATLAB rather than converting it to .txt first because it takes hours for the conversion to complete and I have a lot of data. Here are some examples of my data but in manageable sizes:
https://www.dropbox.com/sh/rzut4zbrert9fm0/q9SiZYmrdG
Here is a description of the binary data:
http://forums.ni.com/t5/LabVIEW/Loading-Labview-Binary-Data-into-Matlab/td-p/1107587
Someone has all ready written a Matlab script to import Labveiw .bin files but their script will only work with very small files:
% LABVIEWLOAD Load Labview binary data into Matlab
%
% DATA = LABVIEWLOAD(FNAME,DIM); % Loads the Labview data in binary
% format from the file specified by FNAME, given the NUMBER of
% dimensions (not the actual dimensions of the data in the binary
% file) of dimensions specified by DIM.
%
% LABVIEWLOAD will repeatedly grab data until the end of the file.
% Labview arrays of the same number of dimensions can be repeatedly
% appended to the same binary file. Labview arrays of any dimensions
% can be read.
%
% DATA = LABVIEWLOAD(FNAME,DIM,PREC); % Loads the data with the specified
% precision, PREC.
%
% Note: This script assumes the default parameters were used in the
% Labview VI "Write to Binary File". Labview uses the Big Endian
% binary number format.
%
% Examples:
% D = labviewload('Data.bin',2); % Loads in Data.bin assuming it
% contains double precision data and two dimensions.
%
% D = labviewload('OthereData.bin',3,'int8'); % Loads in
% OtherData.bin assuming it contains 8 bit integer values or
% boolean values.
%
% Jeremiah Smith
% 4/8/10
% Last Edit: 5/6/10
function data = labviewload(fname,dim,varargin)
siz = [2^32 2^16 2^8 1]'; % Array dimension conversion table
% Precision Input
if nargin == 2
prec = 'double';
elseif nargin == 3
prec = varargin{1};
else
error('Too many inputs.')
end
%% Initialize Values
fid = fopen(fname,'r','ieee-be'); % Open for reading and set to big-endian binary format
fsize = dir(fname); % File information
fsize = fsize.bytes; % Files size in bytes
%% Preallocation
rows = [];
columns = [];
I = 0;
while fsize ~= ftell(fid)
dims = [];
for i=1:1:dim
temp = fread(fid,4);
temp = sum(siz.*temp);
dims = [dims,temp];
end
I = I + 1;
% fseek(fid,prod(dims)*8,'cof'); % Skip the actual data
temp = fread(fid,prod(dims),prec,0,'ieee-be'); % Skip the actual data (much faster for some reason)
end
fseek(fid,0,'bof'); % Reset the cursor
data = repmat({NaN*ones(dims)},I,1); % Preallocate space, assumes each section is the same
%% Load and parse data
for j=1:1:I
dims = []; % Actual array dimensions
for i=1:1:dim
temp = fread(fid,4);
temp = sum(siz.*temp);
dims = [dims,temp];
end
clear temp i
temp = fread(fid,prod(dims),prec,0,'ieee-be'); % 11 is the values per row,
% double is the data type, 0 is the bytes to skip, and
% ieee-be specified big endian binary data
%% Reshape the data into the correct array configuration
if dim == 1
temp = reshape(temp,1,dims);
else
evalfunc = 'temp = reshape(temp';
for i=1:1:dim
evalfunc = [evalfunc ',' num2str(dims(dim-i+1))];
end
if dim ~= 2
eval([evalfunc ');'])
else
eval([evalfunc ')'';'])
end
end
data{j} = temp; % Save the data
end
fclose(fid); % Close the file
The code has the following error message, when you try to process even relatively small .bin files:
Error using ones
Maximum variable size allowed by the program is exceeded.
Error in labviewload (line 65)
data = repmat({NaN*ones(dims)},I,1); % Preallocate space, assumes each section is the same
Can you help me modify the code so that I can open large .bin files? Any help will be much appreciated.
Cheers,
Jim

How to read binary file in one block rather than using a loop in matlab

I have this file which is a series of x, y, z coordinates of over 34 million particles and I am reading them in as follows:
parfor i = 1:Ntot
x0(i,1)=fread(fid, 1, 'real*8')';
y0(i,1)=fread(fid, 1, 'real*8')';
z0(i,1)=fread(fid, 1, 'real*8')';
end
Is there a way to read this in without doing a loop? It would greatly speed up the read in. I just want three vectors with x,y,z. I just want to speed up the read in process. Thanks. Other suggestions welcomed.

I do not have a machine with Matlab and I don't have your file to test either but I think coordinates = fread (fid, [3, Ntot], 'real*8') should work fine.

Maybe fread is the function you are looking for.

You're right. Reading data in larger batches is usually a key part of speeding up file reads. Another part is pre-allocating the destination variable zeros, for example, a zeros call.
I would do something like this:
%Pre-allocate
x0 = zeros(Ntot,1);
y0 = zeros(Ntot,1);
z0 = zeros(Ntot,1);
%Define a desired batch size. make this as large as you can, given available memory.
batchSize = 10000;
%Use while to step through file
indexCurrent = 1; %indexCurrent is the next element which will be read
while indexCurrent <= Ntot
%At the end of the file, we may need to read less than batchSize
currentBatch = min(batchSize, Ntot-indexCurrent+1);
%Load a batch of data
tmpLoaded = fread(fid, currentBatch*3, 'read*8')';
%Deal the fread data into the desired three variables
x0(indexCurrent + (0:(currentBatch-1))) = tmpLoaded(1:3:end);
y0(indexCurrent + (0:(currentBatch-1))) = tmpLoaded(2:3:end);
z0(indexCurrent + (0:(currentBatch-1))) = tmpLoaded(3:3:end);
%Update index variable
indexCurrent = indexCurrent + batchSize;
end
Of course, make sure you test, as I have not. I'm always suspicious of off-by-one errors in this sort of work.

Rolling window for averaging using MATLAB

I have the following code, pasted below. I would like to change it to only average the 10 most recently filtered images and not the entire group of filtered images. The line I think I need to change is: Yout(k,p,q) = (Yout(k,p,q) + (y.^2))/2;, but how do I do it?
j=1;
K = 1:3600;
window = zeros(1,10);
Yout = zeros(10,column,row);
figure;
y = 0; %# Preallocate memory for output
%Load one image
for i = 1:length(K)
disp(i)
str = int2str(i);
str1 = strcat(str,'.mat');
load(str1);
D{i}(:,:) = A(:,:);
%Go through the columns and rows
for p = 1:column
for q = 1:row
if(mean2(D{i}(p,q))==0)
x = 0;
else
if(i == 1)
meanvalue = mean2(D{i}(p,q));
end
%Calculate the temporal mean value based on previous ones.
meanvalue = (meanvalue+D{i}(p,q))/2;
x = double(D{i}(p,q)/meanvalue);
end
%Filtering for 10 bands, based on the previous state
for k = 1:10
[y, ZState{k}] = filter(bCoeff{k},aCoeff{k},x,ZState{k});
Yout(k,p,q) = (Yout(k,p,q) + (y.^2))/2;
end
end
end
% for k = 2:10
% subplot(5,2,k)
% subimage(Yout(k)*5000, [0 100]);
% colormap jet
% end
% pause(0.01);
end
disp('Done Loading...')

The best way to do this (in my opinion) would be to use a circular-buffer to store your images. In a circular-, or ring-buffer, the oldest data element in the array is overwritten by the newest element pushed in to the array. The basics of making such a structure are described in the short Mathworks video Implementing a simple circular buffer.
For each iteration of you main loop that deals with a single image, just load a new image into the circular-buffer and then use MATLAB's built in mean function to take the average efficiently.
If you need to apply a window function to the data, then make a temporary copy of the frames multiplied by the window function and take the average of the copy at each iteration of the loop.

The line
Yout(k,p,q) = (Yout(k,p,q) + (y.^2))/2;
calculates a kind of Moving Average for each of the 10 bands over all your images.
This line calculates a moving average of meanvalue over your images:
meanvalue=(meanvalue+D{i}(p,q))/2;
For both you will want to add a buffer structure that keeps only the last 10 images.
To simplify it, you can also just keep all in memory. Here is an example for Yout:
Change this line: (Add one dimension)
Yout = zeros(3600,10,column,row);
And change this:
for q = 1:row
[...]
%filtering for 10 bands, based on the previous state
for k = 1:10
[y, ZState{k}] = filter(bCoeff{k},aCoeff{k},x,ZState{k});
Yout(i,k,p,q) = y.^2;
end
YoutAvg = zeros(10,column,row);
start = max(0, i-10+1);
for avgImg = start:i
YoutAvg(k,p,q) = (YoutAvg(k,p,q) + Yout(avgImg,k,p,q))/2;
end
end
Then to display use
subimage(Yout(k)*5000, [0 100]);
You would do sth. similar for meanvalue

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Processing a large dataset - matlab

Related

MATLAB data file when it overs its memory

Vectorization of matlab code for faster execution

Reading labview binary files in Matlab?

How to read binary file in one block rather than using a loop in matlab

Rolling window for averaging using MATLAB

Categories

Resources