I am using Matlab do run some evaluations and then I want to save the struct where the results are saved, for future use.
Problem: The first thing I have noticed was that the execution lasted too long, maybe 8hours, then when I wanted to save the struct maybe further 2hours. After several blocking and redoing the process, I finally managed to save a copy of the data. What I find it confusing is that the file is 150GB big.
Process: The code structure goes as follows: It iterates over .csv files in a folder (50 000), it reads them in as csv format files, extracts the needed columns and calculates the data.
My view: I guess, the whole iteration and extracting data from each file takes a lot of cache, which could slow the process down, as the time goes!? But, I still don't undestand why the final .mat file takes so much memory, since in the past for the same data, but different parameters, it didn't need that much space to save the results.
Question/s: Is it possible to reduce the size of the final file, without affecting the results? I base this question on the assumption that matlab is maybe saving additional information from the process?
Code Schema:
clc; close all; clear all; fclose('all');
result = struct('values_a', [], 'other', []);
counter = 1;
for i=1:length(dataNames)
try
structRead = ezread(nameOfFile, ',');
values_a = structRead.timestamp;
for j=1:length(values_a)
if(strcmp(values_a(j),'N'))
if(j==1)
values_a(j) = values_a(j+1);
elseif (j==length(values_a))
values_a(j) = values_a(j-1);
else
values_a(j) = (values_a(j-1)+values_a(j+1))/2;
end
end
result(counter).values_a(j) = values_a(j);
end
counter=counter+1;
catch
counterFailed= counterFailed+1;
end
end
end
save(path2save,'result','-v7.3');
Related
I'm working with these h5 files that have tens of thousands of datasets that contains vectors of numerical values and all of the same size. My goal is to read the datasets and create one large matrix from these vectors. The datasets are named from "0" to "xxxxx" (some large number) I was able to read them and get the matrix but it takes forever to do so. I was wondering if you can take a look at my code and suggest a way to make it run faster
here is how I do it right now
t =[];
for i = 0:40400 % there are 40401 datasets in this particular file
j = int2str(i);
p = '/mesh/'; % The parent group
s = strcat(p,j); % to create the full path of a dataset e.g. '/mesh/0'
r = h5read('temp.h5',s); % the file name is temp and s has the dataset path
t = [t;r];
end
in this particular case, there are 40401 datasets, each has 80802x1 vector of numerical values. Therefore eventually I want to create 80802x40401 matrix. This code takes over a day to finish. I think one of the reason it is slow because in every iteration, matlab access the h5 file. I would appreciate it if some of you have some tips in speeding up the code
When I copied you code in an editor, I get the red tilde under the t with the warning:
The variable t appears to change size on every loop iteration. Consider preallocating for speed.
You should allocate the final memory of t before starting the loop, with the function zeros:
t = zeros(80804,40401);
You should also read this: Programming Patterns: Maximizing Code Performance by Optimizing Memory Access:
Preallocate arrays before accessing them within loops
Store and access data in columns
Avoid creating unnecessary variables
Maybe p = '/mesh/'; is useless inside the loop and can be done outside the loop, since it doesn't change. It could be even better to not have p and directly do s = strcat('/mesh/',j);
I have a piece of MATLAB code that works fine, but I wanted to know is there any faster way of performing the same task, where each .csv file is a 768*768 dimension matrix
Current code:
for k = 1:143
matFileName = sprintf('ang_thresholded%d.csv', k);
matData = load(matFileName);
imshow(matData)
end
Any help in this regard will be very helpful. Thank You!
In general, its better to separate the loading, computational and graphical stuff.
If you have enough memory, you should try to change your code to:
n_files=143;
% If you know the size of your images a priori:
matData=zeros( 768, 768,n_files); % prealocate for speed.
for k = 1:n_files
matFileName = sprintf('ang_thresholded%d.csv', k);
matData(:,:,k) = load(matFileName);
end
seconds=0.01;
for k=1:n_Files
%clf; %Not needed in your case, but needed if you want to plot more than one thing (hold on)
imshow(matData(:,:,k));
pause(seconds); % control "framerate"
end
Note the use of pause().
Here is another option using Matlab's data stores which are designed to work with large datasets or lots of smaller sets. The TabularTextDatastore is specifically for this kind of text based data.
Something like the following. However, note that since I don't have any test files it is sort of notional example ...
ttds = tabularTextDatastore('.\yourDirPath\*.csv'); %Create the data store
while ttds.hasdata %This turns false after reading the last file.
temp = read(ttds); %Returns a Matlab table class
imshow(temp.Variables)
end
Since it looks like your filenames' numbering is not zero padded (e.g. 1 instead of 001) then the file order might get messed up so that may need addressed as well. Anyway I thought this might be a good alternative approach worth considering depending on what else you want to do with the data and how much of it there might be.
I have a loop as below
for chnum=1:300
PI=....
area=....
save ('Result.mat' ,'chnum' ,'PI' ,'area',' -append') %-append
%% I like to have sth like below
% 1, 1.2,3.7
% 2, 1,8, 7.8
% .....
end
but it doesn't save. Do you have any idea why?
Best
Analysis of the Problem
The matlab help page for save states that the -append option will append new variables to the saved file. It will not append new rows to the already saved matrices.
Solution
To achieve what you intended you have to store your data in matrices and save the whole matrice with a single call to save().
PI = zeros(300,1);
area = zeros(300,1);
for chnum=1:300
PI(chnum)=.... ;
area(chnum)=.... ;
end
save ('Result.mat' ,'chnum' ,'PI' ,'area');
For nicer memory management I have added a pre-allocation of the arrays.
Well, even if it's not part of the question, I don't think that you are using a good approach to save your calculations. Reading/writing operations performed on the disk (saving data on a file is falls in this case) are very expensive in terms of time. This is why I suggest you to proceed as follows:
res = NaN(300,2)
for chnum = 1:300
PI = ...
area = ...
res(chnum,:) = [PI area]; % saving chnum looks a bit like an overkill since you can retrieve it just using size(res,1) when you need it...
end
save('Result.mat','res');
Basically, instead of processing a row and saving it into the file, then processing another row and saving it into the file, etc... you just save your whole data into a matrix and you just save your final result to file.
I have 360 3D-nifti files, I want to read all these files and save into one nifti file using Nifti Analyze tool that should yield a 4D file of large size. So far I have written following lines
clear all;
clc;
fileFolder=fullfile(pwd, '\functional');
files=dir(fullfile(fileFolder, '*.nii'));
fileNames={files.name};
for i=1:length(fileNames)
fname=fullfile(fileFolder,fileNames{i});
z(i)=load_nii(fname);
y=z(i).img;
temp(:,:,:,i) = make_nii(y);
save_nii(temp(:,:,:,i), 'myfile.nii')
fprintf('Iter: %d\n', i)
end
This code facilitates with a variable temp that is 4D struct and contains all the images. However, myfile.nii is just one single file its not all the images because its size is just 6mb it should be atleast one 1gb.
Can someone please have a look and let me know where I am wrong?
The way that you have it written, your loop is overwriting myfile.nii since you're calling save_nii every time through the loop with only the latest data. You'll want to instead call save_nii only once outside of the loop and save the entire temp variable at once.
for k = 1:numel(fileNames)
fname = fullfile(fileFolder, fileNames{k});
z(k) = load_nii(fname);
y(:,:,:,k) = z(k).img;
end
% Create the ND Nifti file
output = make_nii(y);
% Save it to a file
save_nii(output, 'myfile.nii')
I have a very large number of large data files. I would like to be able to categorize the data in each file, and then save the filename to a cell array, such that at the end I'll have one cell array of filenames for each category of data, which I could then save to a mat file so that I can then come back later and run analysis on each category. It might look something like this:
MatObj = matfile('listOfCategorizedFilenames.mat');
MatObj.boring = {};
MatObj.interesting = {};
files = dir(directory);
K = numel(files);
for k=1:K
load(files(k).name,'data')
metric = testfunction(data)
if metric < threshold
MatObj.boring{end+1} = files(k).name;
else
MatObj.interesting{end+1} = files(k).name;
end
end
Because the list of files is very long, and testfunction can be slow, I'd like to set this to run unattended overnight or over the weekend (this is a stripped down version, metric might return one of several different categories), and in case of crashes or unforeseen errors, I'd like to save the data on the fly rather than populating a cell array in memory and dumping to disk at the end.
The problem is that using matfile will not allow cell indexing, so the save step throws an error. My question is, is there a workaround for this limitation? Is there better way to incrementally write the filenames to a list that would be easy to retrieve later?
I have no experience with matfile, so I cannot help you with that. As a quick and dirty solution, I would just write the filenames to two different text-files. Quick testing suggests that the data is flushed to disk straight away and that the text-files are OK even if you close matlab without doing a fclose (to simulate a crash). Untested code:
files = dir(directory);
K = numel(files);
boring = fopen('boring.txt', 'w');
interesting = fopen('interesting.txt', 'w');
for k=1:K
load(files(k).name,'data')
metric = testfunction(data)
if metric < threshold
fprintf(boring, '%s\n', files(k).name);
else
fprintf(interesting, '%s\n', files(k).name);
end
end
%be nice and close files
fclose(boring);
fclose(interesting);
Processing the boring/interesting text files afterwards should be trivial. If you would also write the directory listing to a separate file before starting the loop, it should be pretty easy (either by hand or automatically) to figure out where to continue in case of a crash.
Mat files are probably the most efficient way to store lists of files, but I guess whenever I've had this problem, I make a cell array and save it using xlswrite or fprintf into a document that I can just reload later.
You said the save step throws an error, so I assume this part is okay, right?
for k=1:K
load(files(k).name,'data')
metric = testfunction(data)
if metric < threshold
MatObj.boring{end+1} = files(k).name;
else
MatObj.interesting{end+1} = files(k).name;
end
end
Personally, I just then write,
xlswrite('name.xls', MatObj.interesting, 1, 'A1');
[~, ~, list] = xlsread('name.xls'); % later on
Or if you prefer text,
% I'm assuming here that it's just a single list of text strings.
fid = fopen('name.txt', 'w');
for row=1:nrows
fprintf(fid, '%s\n', MatObj.interesting{row});
end
fclose(fid);
And then later open with fscanf. I just use the xlswrite. I've never had a problem with it, and it's not noticeably slow enough to detract me from using it. I know my answer is just a workaround instead of a real solution, but I hope it helps.