I have a problem with insufficient memory (RAM) when I am reading metrological data (GRIB files), amounting to 35 GB of data, into a matlab cell array.
How can I work around my RAM-restrictions when I load big data sets?
I have tried to preallocate the cell-array, but that does not help. It stops at 70% loading of the data set.
Here is the FOR-loop that errors:
% load grib files
for ii = 1:number_files
waitbar(ii/number_files,h);
file_name = [fname,'\',num2str(ii),'.grb'];
grib_struct = read_grib([file_name],-1);
Temp{ii} = single(grib_struct(1,1).fltarray);
Rad_direct{ii} = single(grib_struct(1,2).fltarray);
Rad_diff{ii} = single(grib_struct(1,3).fltarray);
fclose('all');
end
Thanks!
You can use the matfile command to work directly on the file system. It stores every data you put into directly on the file system. It will be slow, but it is possible.
Related
i have a very big size of text file(about 11GB) that needs to load in matlab.but when i use "textread" function,"out of memory" error occurs.and There is no way to reduce the file size. when i type memory, show this to me.
memory
Maximum possible array: 24000 MB (2.517e+10 bytes) *
Memory available for all arrays: 24000 MB (2.517e+10 bytes) *
Memory used by MATLAB: 1113 MB (1.167e+09 bytes)
Physical Memory (RAM): 16065 MB (1.684e+10 bytes)
* Limited by System Memory (physical + swap file) available.
Does anyone have a solution to this problem?
#Anthony suggested a way to read the file line-by-line, which is perfectly fine, but more recent (>=R2014b) versions of MATLAB have datastore functionality, which is designed for processing large data files in chunks.
There are several types of datastore available depending on the format of your text file. In the simplest cases (e.g. CSV files), the automatic detection works well and you can simply say
ds = datastore('myCsvFile.csv');
while hasdata(ds)
chunkOfData = read(ds);
... compute with chunkOfData ...
end
In even more recent (>=R2016b) versions of MATLAB, you can go one step further and wrap your datastore into a tall array. tall arrays let you operate on data that is too large to fit into memory all at once. (Behind the scenes, tall arrays perform computations in chunks, and give you the results only when you ask for them via a call to gather). For example:
tt = tall(datastore('myCsvFile.csv'));
data = tt.SomeVariable;
result = gather(mean(data)); % Trigger tall array evaluation
According to your clarification of the purpose of your code:
it is a point cloud with XYZRGB column in txt file and i needs to add another column to this.
What I suggest you to do is read the text file one line at a time, modify the line and write the modified line straight to a new text file.
To read one line at a time:
% Open file for reading.
fid = fopen(filename, 'r');
% Get the first line.
line = fgetl(fid);
while ~isnumeric(line)
% Do something.
% get the next line
line = fgetl(fid);
end
fclose(fid);
To write the line, you can use fprintf.
Here is a demonstration:
filename = 'myfile.txt';
filename_new = 'myfile_new.txt';
fid = fopen(filename);
fid_new = fopen(filename_new,'w+');
line = fgetl(fid);
while ~isnumeric(line)
% Make sure you add \r\n at the end of the string;
% otherwise, your text file will become a one liner.
fprintf(fid_new, '%s %s\r\n', line, 'new column');
line = fgetl(fid);
end
fclose(fid);
fclose(fid_new);
I'm working with these h5 files that have tens of thousands of datasets that contains vectors of numerical values and all of the same size. My goal is to read the datasets and create one large matrix from these vectors. The datasets are named from "0" to "xxxxx" (some large number) I was able to read them and get the matrix but it takes forever to do so. I was wondering if you can take a look at my code and suggest a way to make it run faster
here is how I do it right now
t =[];
for i = 0:40400 % there are 40401 datasets in this particular file
j = int2str(i);
p = '/mesh/'; % The parent group
s = strcat(p,j); % to create the full path of a dataset e.g. '/mesh/0'
r = h5read('temp.h5',s); % the file name is temp and s has the dataset path
t = [t;r];
end
in this particular case, there are 40401 datasets, each has 80802x1 vector of numerical values. Therefore eventually I want to create 80802x40401 matrix. This code takes over a day to finish. I think one of the reason it is slow because in every iteration, matlab access the h5 file. I would appreciate it if some of you have some tips in speeding up the code
When I copied you code in an editor, I get the red tilde under the t with the warning:
The variable t appears to change size on every loop iteration. Consider preallocating for speed.
You should allocate the final memory of t before starting the loop, with the function zeros:
t = zeros(80804,40401);
You should also read this: Programming Patterns: Maximizing Code Performance by Optimizing Memory Access:
Preallocate arrays before accessing them within loops
Store and access data in columns
Avoid creating unnecessary variables
Maybe p = '/mesh/'; is useless inside the loop and can be done outside the loop, since it doesn't change. It could be even better to not have p and directly do s = strcat('/mesh/',j);
I am trying to speed up a script that I have written in Matlab that dynamically allocates memory to a matrix (basicallly reads a line of data from a file and writes it into a matrix, then reads another line and allocates more memory for a larger matrix to store the next line). The reason I did this instead of preallocating memory using zeroes() or something was that I don't know the exact size the matrix needs to be to hold all of the data. I also don't know the maximum size of the matrix, so I can't just preallocate a max size and then get rid of memory that I didn't use. This was fine for small amounts of data, but now I need to scale my script up to read many millions of data points and this implementation of dynamic allocation is just much too slow.
So here is my attempt to speed up the script: I tried to allocate memory in large blocks using the zeroes function, then once the block is filled I allocate another large block. Here is some sample code:
data = [];
count = 0;
for ii = 1:num_filelines
if mod(count, 1000) == 0
data = [data; zeroes(1000)]; %after 1000 lines are read, allocate another 1000 line
end
data(ii, :) = line_read(file); %line_read reads a line of data from 'file'
end
Unfortunately this doesn't work, when I run it I get an error saying "Error using vertcat
Dimensions of matrices being concatenated are not consistent."
So here is my question: Is this method of allocating memory in large blocks actually any faster than incremental dynamic allocation, and also why does the above code not run? Thanks for the help.
What I recommend doing, if you know the number of lines and can just guess a large enough number of acceptable columns, use a sparse matrix.
% create a sparse matrix
mat = sparse(numRows,numCols)
A sparse matrix will not store all of the zero elements, it only stores pointers to indices that are non-zero. This can help save a lot of space. They are used and accessed the same as any other matrix. That is only if you really need it in a matrix format from the beginning.
If not, you can just do everything as a cell. Preallocate a cell array with as many elements as lines in your file.
data = cell(1,numLines);
% get matrix from line
for i = 1:numLines
% get matrix from line
data{i} = lineData;
end
data = cell2mat(data);
This method will put everything into a cell array, which can store "dynamically" and then be converted to a regular matrix.
Addition
If you are doing the sparse matrix method, to trim up your matrix once you are done, because your matrix will likely be larger than necessary, you can trim this down easily, and then cast it to a regular matrix.
[val,~] = max(sum(mat ~= 0,2));
mat(:,val:size(mat,2)) = [];
mat = full(mat); % use this only if you really need the full matrix
This will remove any unnecessary columns and then cast it to a full matrix that includes the 0 elements. I would not recommend casting it to a full matrix, as this requires a ton more space, but if you truly need it, use it.
UPDATE
To get the number of lines in a file easily, use MATLAB's perl interpretter
create a file called countlines.pl and paste in the two lines below
while (<>) {};
print $.,"\n";
Then you can run this script on your file as follows
numLines = str2double(perl('countlines.pl','data.csv'));
Problem solved.
From MATLAB forum thread here
remember it is always best to preallocate everything before hand, because technically when doing shai's method you are reallocating large amounts a lot, especially if it is a large file.
To solve your error, simply use this syntax when allocating
data = [data; zeroes(1000, size(data,2))];
You might want to read the first line outside the loop so you'll know the number of columns and make the first allocation for data.
If you want to stick to your code as written I would substitute your initialization of data, data = [] to
data = zeros(1,1000);
Keep in mind though the warning from #MZimmerman6: zeros(1000) generates a 1000 x 1000 array. You may want to change all of your zeros statements to zeros( ... ,Nc), where Nc = length of line in characters.
I have a 37,000,000x1 double array saved in a matfile under a structure labelled r. I can point to this file using matfile(...) then just use the find(...) command to find all values above a threshold val
This finds all the values greater than/equal to 0.004 but given the size of my data, this takes some time.
I want to reduce the time and have considered using bin files (apparently they are better than txt files in terms of not losing precision?) etc, however I'm not knowledgable with the syntax/method
I've managed to save the data into the bin file, but what is the quickest way to search through this large file?
The only output data I want are the actually values greater than my specified value.
IS using a bin file the best? Or a matfile? Etc
I don't want to load the entire file into matlab. I want to conserve the matlab memory as other programs may need the space and I don't want memory errors again
As #OlegKomarov points out, a 37,000,000 element array of doubles is not very big. Your real problem may be that you don't have enough RAM and/or are using a 32-bit version of Matlab. The find function will require additional memory for the input and the out array of indices.
If you want to load and process your data in chunks, you can use the matfile function. Here's a small example:
fname = [tempname '.mat']; % Use temp directory file for example
matObj = matfile(fname,'Writable',true); % Create MAT-file
matObj.r = rand(37e4,1e2); % Write random date to r variable in file
szR = size(matObj,'r'); % Get dimensions of r variable in file
idx = [];
for i = 1:szR(2)
idx = [idx;find(matObj.r(:,i)>0.999)]; % Find indices of r greater than 0.999
end
delete(fname); % Delete example file
This will save you memory, but it definitely not faster than storing everything in memory and calling find once. File access is always slower (though it will help a bit if you have an SSD). The code above uses dynamic memory allocation for the idx variable, but the memory is only re-allocated a few times in large chunks, which can be quite fast in current versions of Matlab.
Here is the desired workflow:
I want to load 100 images into MATLAB workspace
Run a bunch of my code on the images
Save my output (the output returned by my code is an integer array) in a new array
By the end I should have a data structure storing the output of the code for images 1-100.
How would I go about doing that?
If you know the name of the directory they are in, or if you cd to that directory, then use dir to get the list of image names.
Now it is simply a for loop to load in the images. Store the images in a cell array. For example...
D = dir('*.jpg');
imcell = cell(1,numel(D));
for i = 1:numel(D)
imcell{i} = imread(D(i).name);
end
BEWARE that these 100 images will take up too much memory. For example, a single 1Kx1K image will require 3 megabytes to store, if it is uint8 RGB values. This may not seem like a huge amount.
But then 100 of these images will require 300 MB of RAM. The real issue comes about if your operations on these images turn them into doubles, then they will now take up 2.4 GIGAbytes of memory. This will quickly eat up the amount of RAM you have, especially if you are not using a 64 bit version of MATLAB.
Assuming that your images are named in a sequential way, you could do this:
N = 100
IMAGES = cell(1,N);
FNAMEFMT = 'image_%d.png';
% Load images
for i=1:N
IMAGES{i} = imread(sprintf(FNAMEFMT, i));
end
% Run code
RESULT = cell(1,N);
for i=1:N
RESULT{i} = someImageProcessingFunction(IMAGES{i});
end
The cell array RESULT then contains the output for each image.
Be aware that depending on the size of your images, prefetching the images might make you run out of memory.
As many have said, this can get pretty big. Is there a reason you need ALL of these in memory when you are done? Could you write the individual results out as files when you are done with them such that you never have more than the input and output images in memory at a given time?
IMWRITE would be good to get them out of memory when you are done.