MATLAB data file when it overs its memory - matlab

There is a Matrix of 500000000 X 5.
And the sample of this data is like this :
1 01 06:0 48407
1 01 06:1 48407
.
.
.
865850 31 23:5 1586884493
Each column means [area_number date hour:minute amount_of_data]
I want to load them entirely, after that make another 865850 X 4464 matrix from their 5th column values. In this new matrix, row insists area_number. And each column means amout_of_data according to time priority.
This is what I wrote.
clear all; close all;
fileID=fopen('data2.txt','r');
Data=fscanf(fileID, '%d %d %d:%d %d',[5 100000]);
Data = Data';
Zeros = zeros(4000, 4464);
DataA = Data(:,1); % indexs
DataB = Data(:,2); % dates
DataC = Data(:,3); % hours
DataD = Data(:,4); % minutes
DataE = Data(:,5); % data
for m=1:40000
r = DataA(m);
c = (DataB(m)-1)*24*6 + DataC(m)*6 + DataD(m);
Zeros(r,c) = DataE(m);
end
I can't finish it because the matrix too big to load it at once.
It overs memory limitation of MATLAB.
Please help me...
Thank you~!

To solve your problem, using the matfile command is probably the best choice. It allows you to write data directly to a mat-file on the filesystem but access it like a variable.
If I understood your data right, all lines with the same index are next to each other, and all data with the same index is small enough to fit your memory.
Read all data with index 1
process it like you did above, creating one row of your intended matrix
Write this row to your matfile
Proceed with the next index until you reach the end

Related

Saving values of variable in MATLAB

Hi for my code I would like to know how to best save my variable column. column is 733x1. Ideally I would like to have
column1(y)=column, but I obtain the error:
Conversion to cell from logical is not possible.
in the inner loop. I find it difficult to access these stored values in overlap.
for i = 1:7
for y = 1:ydim % ydim = 436
%execute code %code produces different 'column' on each iteration
column1{y} = column; %'column' size 733x1 %altogether 436 sets of 'column'
end
overlap{i} = column1; %iterates 7 times.
end
Ideally I want overlap to store 7 variables saved that are (733x436).
Thanks.
I'm assuming column is calculated using a procedure where each column is dependent on the latter. If not, then there are very likely improvements that can be made to this:
column = zeros(733, 1); % Might not need this. Depends on you code.
all_columns = zeros(xdim, ydim); % Pre-allocate memory (always do this)
% Note that the first dimension is usually called x,
% and the second called y in MATLAB
overlap = cell(7, 1);
overlap(:) = {zeros(xdim, ydim)}; % Pre-allocate memory
for ii = 1:numel(overlap) % numel is better than length
for jj = 1:ydim % ii and jj are better than i and j
% several_lines_of_code_to_calculate_column
column = something;
all_columns(:, jj) = column;
end
overlap{ii} = all_columns;
end
You can access the variables in overlap like this: overlap{1}(1,1);. This will get the first element in the first cell. overlap{2} will get the entire matrix in the second cell.
You specified that you wanted 7 variables. Your code implies that you know that cells are better than assigning it to different variables (var1, var2 ...). Good! The solution with different variables is bad bad bad.
Instead of using a cell array, you could instead use a 3D-array. This might make processing later on faster, if you can vectorize stuff for instance.
This will be:
column = zeros(733, 1); % Might not need this. Depends on you code.
overlap = zeros(xdim, ydim, 7) % Pre-allocate memory for 3D-matrix
for ii = 1:7
for jj = 1:ydim
% several_lines_of_code_to_calculate_column
column = something;
all_column(:, jj, ii) = column;
end
end

Processing a large dataset

My MATLAB program generates N=100 trajectories with T=10^8 time steps in each, i.e.
x = randn(10^8,100);
Ultimately, I want to process this data set and obtain an average autocorrelation of all trajectories:
y = mean(fft(x),2); % output size (10^8, 1)
Now since x is too big to store, my only viable option is to save it on the hard drive in small chunks of 10^6
x1 = randn(10^6, 100);
x2 = randn(10^6, 100);
etc
and then obtain y by processing each trajectory n=1:100 individually and accumulating the result:
for n=1:100
y = y + fft([x1(:,n); x2(:,n); ...; x100(:,n)]);
end
Is there a more elegant way of doing this? I have 100GB of RAM and a pool of 12 workers.
An easier way would be to generate your data once and then chop it into small pieces you save on disk, or, if possible, create the data on the workers themselves.
x = randn(10^8,100);
for ii=1:100
if ii ~=100
tmp = x(ii:ii+1e6)
else
tmp = x(ii:end); %ii+1e6 would result in end+1
end
filename = sprintf('Dataset%i',ii); %create filename
save(filename,tmp,'-v7.3'); %save file to disk in -v7.3 format
end
y = cell(100,1) %initialise output
parfor ii = 1:100
filename = sprintf('Dataset%i',ii); %get the filenames back
load(filename); %load the file
y{ii} = mean(fft(tmp),2); % I called it tmp when saving, so it's called tmp here
end
Now you can accumulate the results from the cell y in the desired manner. You can of course play around with the amount of files you create, as less files will be processed faster due to the overhead of parfor.

How to store .csv data and calculate average value in MATLAB

Can someone help me to understand how I can save in matlab a group of .csv files, select only the columns in which I am interested and get as output a final file in which I have the average value of the y columns and standard deviation of y axes? I am not so good in matlab and so I kindly ask if someone to help me to solve this question.
Here what I tried to do till now:
clear all;
clc;
which_column = 5;
dirstats = dir('*.csv');
col3Complete=0;
col4Complete=0;
for K = 1:length(dirstats)
[num,txt,raw] = xlsread(dirstats(K).name);
col3=num(:,3);
col4=num(:,4);
col3Complete=[col3Complete;col3];
col4Complete=[col4Complete;col4];
avgVal(K)=mean(col4(:));
end
col3Complete(1)=[];
col4Complete(1)=[];
%columnavg = mean(col4Complete);
%columnstd = std(col4Complete);
% xvals = 1 : size(columnavg,1);
% plot(xvals, columnavg, 'b-', xvals, columnavg-columnstd, 'r--', xvals, columnavg+columstd, 'r--');
B = reshape(col4Complete,[5000,K]);
m=mean(B,2);
C = reshape (col4Complete,[5000,K]);
S=std(C,0,2);
Now I know that I should compute mean and stdeviation inside for loop, using mean()function, but I am not sure how I can use it.
which_column = 5;
dirstats = dir('*.csv');
col3Complete=[]; % Initialise as empty matrix
col4Complete=[];
avgVal = zeros(length(dirstats),2); % initialise as columnvector
for K = 1:length(dirstats)
[num,txt,raw] = xlsread(dirstats(K).name);
col3=num(:,3);
col4=num(:,4);
col3Complete=[col3Complete;col3];
col4Complete=[col4Complete;col4];
avgVal(K,1)=mean(col4(:)); % 1st column contains mean
avgVal(K,2)=std(col4(:)); % 2nd column contains standard deviation
end
%columnavg = mean(col4Complete);
%columnstd = std(col4Complete);
% xvals = 1 : size(columnavg,1);
% plot(xvals, columnavg, 'b-', xvals, columnavg-columnstd, 'r--', xvals, columnavg+columstd, 'r--');
B = reshape(col4Complete,[5000,K]);
meanVals=mean(B,2);
I didn't change much, just initialised your arrays as empty arrays so you do not have to delete the first entry later on and made avgVal a column vector with the mean in column 1 and the standard deviation in column 1. You can of course add two columns if you want to collect those statistics for your 3rd column in the csv as well.
As a side note: xlsread is rather heavy for reading files, since Excel is horribly inefficient. If you want to read a structured file such as a csv, it's faster to use importdata.
Create some random matrix to store in a file with header:
A = rand(1e3,5);
out = fopen('output.csv','w');
fprintf(out,['ColumnA', '\t', 'ColumnB', '\t', 'ColumnC', '\t', 'ColumnD', '\t', 'ColumnE','\n']);
fclose(out);
dlmwrite('output.csv', A, 'delimiter','\t','-append');
Load it using csvread:
data = csvread('output.csv',1);
data now contains your five columns, without any headers.

Average across many files on matlab

I was wondering if I can get help with this. I have many .mat files with an array in each and I want to average each cell individually (average all (1,1)s, all (1,2)s, ... (2,1)s etc.) and store them.
Thanks for the help!
I'm not quite sure how your data is organised, but you can do something like this:
% Assume you know the size of the arrays and that the variables r and c
% hold the numbers of rows and columns respectively.
xTotals = zeros(r, c);
xCount = 0;
% for each file: assume the data is loaded into a variable called x, which is
% r rows by c columns
for ...
xTotals = xTotals + x;
xCount = xCount + 1;
end
xAvg = xTotals / xCount;
And xAvg will contain the average for each array cell. Note that you probably know xCount without having to count each time you go round the loop, but it depends on where you are getting your data. Hopefully you get the idea!

Problem with loop MATLAB

no time scores
1 10 123
2 11 22
3 12 22
4 50 55
5 60 22
6 70 66
. . .
. . .
n n n
Above a the content of my txt file (thousand of lines).
1st column - number of samples
2nd column - time (from beginning to end ->accumulated)
3rd column - scores
I wanted to create a new file which will be the total of every three sample of the scores divided by the time difference of the same sample.
e.g. (123+22+22)/ (12-10) = 167/2 = 83.5
(55+22+66)/(70-50) = 143/20 = 7.15
new txt file
83.5
7.15
.
.
.
n
so far I have this code:
fid=fopen('data.txt')
data = textscan(fid,'%*d %d %d')
time = (data{1})
score= (data{2})
for sample=1:length(score)
..... // I'm stucked here ..
end
....
If you are feeling adventurous, here's a vectorized one-line solution using ACCUMARRAY (assuming you already read the file in a matrix variable data like the others have shown):
NUM = 3;
result = accumarray(reshape(repmat(1:size(data,1)/NUM,NUM,1),[],1),data(:,3)) ...
./ (data(NUM:NUM:end,2)-data(1:NUM:end,2))
Note that here the number of samples NUM=3 is a parameter and can be substituted by any other value.
Also, reading your comment above, if the number of samples is not a multiple of this number (3), then simply discard the remaining samples by doing this beforehand:
data = data(1:fix(size(data,1)/NUM)*NUM,:);
Im sorry, here's a much simpler one :P
result = sum(reshape(data(:,3), NUM, []))' ./ (data(NUM:NUM:end,2)-data(1:NUM:end,2));
%# Easier to load with importdata
data = importdata('data.txt',' ',1);
%# Get the number of rows
n = size(data,1);
%# Column IDs
time = 2;score = 3;
%# The interval size (3 in your example)
interval = 3;
%# Pre-allocate space
new_data = zeros(numel(interval:interval:n),1);
%# For each new element in the new data
index = 1;
%# This will ignore elements past the closest (floor) multiple of 3 as requested
for i = interval:interval:n
%# First and last elements in a batch
a = i-interval+1;
b = i;
%# Compute the new data
new_data(index) = sum( data(a:b,score) )/(data(b,time)-data(a,time));
%# Increment
index = index+1;
end
For what it's worth, here is how you would go about to do that in Python. It is probably adaptable to Matlab.
import numpy
no, time, scores = numpy.loadtxt('data', skiprows=1).T
# here I assume that your n is a multiple of 3! otherwise you have to adjust
sums = scores[::3]+scores[1::3]+scores[2::3]
dt = time[2::3]-time[::3]
result = sums/dt
I suggest you use the importdata() function to get your data into your variable called data. Something like this:
data = importdata('data.txt',' ', 1)
replace ' ' by the delimiter your file uses, the 1 specifies that Matlab should ignore 1 header line. Then, to compute your results, try this statement:
(data(1:3:end,3)+data(2:3:end,3)+data(3:3:end,3))./(data(3:3:end,2)-data(1:3:end,2))
This worked on your sample data, should work on the real data you have. If you figure it out yourself you'll learn some useful Matlab.
Then use save() to write the results back to a file.
PS If you find yourself writing loops in Matlab you are probably doing something wrong.