Decimating large data files from disk in MATLAB? - matlab

I have very large data files (typically 30Gb to 60Gb) in .txt format. I want to find a way to to automatically decimate the files without importing them to memory first.
My .txt files consist of two columns of data, this is an example file:
https://www.dropbox.com/s/87s7qug8aaipj31/RTL5_57.txt
What I have done so far is to import the data to a variable "C" then down sample the data. The problem with this method is that the variable "C" often fills the memory capacity of MATLAB before the program has change to decimate:
function [] = textscan_EPS(N,D,fileEPS )
%fileEPS: .txt address
%N: number of lines to read
%D: Decimation factor
fid = fopen(fileEPS);
format = '%f\t%f';
C = textscan(fid, format, N, 'CollectOutput', true);% this variable exceeds memory capacity
d = downsample(C{1},D);
plot(d);
fclose(fid);
end
How can I modify this line:
C = textscan(fid, format, N, 'CollectOutput', true);
So that it effectively decimates the data at this instance by importing every other line of or every 3rd line ect.. of the .txt file from disk to variable "C" in memory.
Any help would be much appreciated.
Cheers,
Jim
PS
An alternative method that I have been playing around with uses "fread" but it encouters the same problem:
function [d] = fread_EPS(N,D,fileEPS)
%N: number of lines to read
%D: decimation factor
%fileEPS: location of .txt fiel
%read in the data as characters
fid = fopen(fileEPS);
c = fread(fid,N*19,'*char');% EWach line of .txt has 19 characters
%Parse and read the data into floading point numbers
f=sscanf(c,'%f');
%Reshape the data into a two column format
format long
d=decimate((flipud(rot90(reshape(f,2,[])))),D); %reshape for 2 colum format, rotate 90, flip veritically,decimation factor

I believe that textscan is the way to go, however you may need to take an intermediate step. Here is what I would do assuming you can easily read N lines at a time:
Read in N lines with textscan(fileID,formatSpec,N)
Sample from these lines, store the result (file or variable) and drop the rest
As long as there are lines left continue with step 1
Optional, depending on your storage method: combine all stored results into one big sample
It should be possible to just read 1 line each time, and decide whether you want to keep/discard it. Though this should consume minimal memory I would try to do a few thousand each time to get reasonable performance.

I ended up writing the code below based on Dennis Jaheruddin's advice. It appears to work well for large .txt files (10GB to 50Gb). The code is also inspired by another post:
Memory map file in MATLAB?
Nlines = 1e3; % set numbe of lines to sample per cycle
sample_rate = (1/1.5e6); %data sample rate
DECE= 1;% decimation factor
start = 40; %start of plot time
finish = 50; % end plot time
TIME = (0:sample_rate:sample_rate*((Nlines)-1));
format = '%f\t%f';
fid = fopen('C:\Users\James Archer\Desktop/RTL5_57.txt');
while(~feof(fid))
C = textscan(fid, format, Nlines, 'CollectOutput', true);
d = C{1}; % immediately clear C at this point you need the memory!
clearvars C ;
TIME = ((TIME(end)+sample_rate):sample_rate:(sample_rate*(size(d,1)))+(TIME(end)));%shift Time along
if ((TIME(end)) > start) && ((TIME(end)) < finish);
plot((TIME(1:DECE:end)),(d(1:DECE:end,:)))%plot and decimate
end
hold on;
clearvars d;
end
fclose(fid);
older versions of MATLAB do not process this code well, the following message appears:
Caught std::exception Exception message is: bad allocation
But MATLAB 2013 works just fine

Related

A solution for "out of memory" error in matlab

i have a very big size of text file(about 11GB) that needs to load in matlab.but when i use "textread" function,"out of memory" error occurs.and There is no way to reduce the file size. when i type memory, show this to me.
memory
Maximum possible array: 24000 MB (2.517e+10 bytes) *
Memory available for all arrays: 24000 MB (2.517e+10 bytes) *
Memory used by MATLAB: 1113 MB (1.167e+09 bytes)
Physical Memory (RAM): 16065 MB (1.684e+10 bytes)
* Limited by System Memory (physical + swap file) available.
Does anyone have a solution to this problem?
#Anthony suggested a way to read the file line-by-line, which is perfectly fine, but more recent (>=R2014b) versions of MATLAB have datastore functionality, which is designed for processing large data files in chunks.
There are several types of datastore available depending on the format of your text file. In the simplest cases (e.g. CSV files), the automatic detection works well and you can simply say
ds = datastore('myCsvFile.csv');
while hasdata(ds)
chunkOfData = read(ds);
... compute with chunkOfData ...
end
In even more recent (>=R2016b) versions of MATLAB, you can go one step further and wrap your datastore into a tall array. tall arrays let you operate on data that is too large to fit into memory all at once. (Behind the scenes, tall arrays perform computations in chunks, and give you the results only when you ask for them via a call to gather). For example:
tt = tall(datastore('myCsvFile.csv'));
data = tt.SomeVariable;
result = gather(mean(data)); % Trigger tall array evaluation
According to your clarification of the purpose of your code:
it is a point cloud with XYZRGB column in txt file and i needs to add another column to this.
What I suggest you to do is read the text file one line at a time, modify the line and write the modified line straight to a new text file.
To read one line at a time:
% Open file for reading.
fid = fopen(filename, 'r');
% Get the first line.
line = fgetl(fid);
while ~isnumeric(line)
% Do something.
% get the next line
line = fgetl(fid);
end
fclose(fid);
To write the line, you can use fprintf.
Here is a demonstration:
filename = 'myfile.txt';
filename_new = 'myfile_new.txt';
fid = fopen(filename);
fid_new = fopen(filename_new,'w+');
line = fgetl(fid);
while ~isnumeric(line)
% Make sure you add \r\n at the end of the string;
% otherwise, your text file will become a one liner.
fprintf(fid_new, '%s %s\r\n', line, 'new column');
line = fgetl(fid);
end
fclose(fid);
fclose(fid_new);

Retrieve data from .rec binary file

The question may be naive, but answers could help me.
A measurement is recorded in binary format, with a header that contains all information about the data and the data itself (i.e. a series of doubles).
The measurement data can be exported in csv format from the application, but it takes ages.
What do you have to pay attention to when trying to read data from a binary file? Is this process even feasible using Matlab to import as an array or labview (export as .txt maybe?)
Binary .rec file format may refer to various things (audio/video encoding format of Topfield based on MPEG4-TS, proprietary audio encoding, and even MRI scanner from Phillips) ...
If it refers to MRI scanner you may find some direct reader on fileexchange: Matlab PAR REC Reader
If it refer to something else, you may parse binary file header and data by yourself using the low level routine: fread
Edit
Not knowing the exact file format for your recorded sensor displacement, here is dummy example with fread for reading large rec file block-by-block supposing header contains just the length of data, and that data is just a serie of double values:
function [] = DummyReadRec()
%[
% Open rec file for reading
[fid, errmsg] = fopen('dummy.rec', 'r');
if (fid < 0), error(errmsg); end
cuo = onCleanup(#()fclose(fid));
% Read header (here supposing it is only an integer giving length of data)
reclenght = fread(fid, 1, 'uint32');
% Read data block-by-block (here supposing it is only double values)
MAX_BLOCK_LENGTH = 512;
blockCount = ceil(reclenght / MAX_BLOCK_LENGTH);
for bi = 1:blockCount,
% Will read a maximum of 'MAX_BLOCK_LENGTH' (or less if we're on the last block)
[recdata, siz] = fread(fid, [1 MAX_BLOCK_LENGTH], 'double');
% Do something with this block (fft or whatever)
offset = (bi-1)*MAX_BLOCK_LENGTH;
position = (offset+1):(offset+siz);
plot(position, 20*log10(abs(fft(recdata))));
drawnow();
end
%]
end
The answer is going to depend on the format of your binary file and how large it is.
I have done many conversion of various binary files all with differing layouts. If the file will fit into memory then you can just use fread as long as you know the layout of the binary file. Below is an example of reading a header & simple data block. It would of course have to be modified depending on the layout of your file. Depending on recording equipment & computer type you may also need to make use of the machinefmt ('ieee-le' or 'ieee-be') options of fread ... that has burned me before.
%Open the File for reading
fid = fopen(yourRECfile,'r');
%Read the Header ... your layout will be different
header.MajorRel = fread(fid,1,'uint16'); %Major File Rev #
header.MinorRel = fread(fid,1,'uint16'); %Minor File Rev #
header.IRIGStart = fread(fid,1,'double'); %Start time in secs
header.Flags = fread(fid,1,'uint32'); %Flags
%Read everything else from there until end of file as a series of doubles.
data = fread(fid,inf,'double');
fclose(fid);
If the file does not fit into memory you will either need to process it in blocks or look into using memmapfile.

Read binary matrix from a file in Matlab

I have a binary square matrix with complex values, stored in a .bin format file. I have tried to read this 100-by-100 matrix with a Matlab script:
i=fopen('matrix.bin','r')
A=fread(i,[100 100]
This code does not correctly read the complex values contained in A. I only get a 100-by-100 matrix of integers.
MATLAB fread support ANSI C types, but there is no native ANSI C types that represent complex numbers. Most likely, a complex number is stored as a pair of real and imaginary numbers.
Without information as to how the binary file is saved, you can still perform some test to figure this out. If the complex number is represented as a real part and an imaginary part, and both in double precision, then a single complex number would take up 8 + 8 = 16 bytes. We can test this by navigating to the end of the file, and see how many bytes there are.
fID = fopen('matrix.bin','r')
fseek(fID, 0, 'eof') % Go to the end of file
ftell(fID) % Tell current position in the open file
fclose(fID)
If this number is equal to 16 * 100 * 100 = 160000, then you're in luck. There is no extra stuff saved in this file, and you can simply read the data by this code:
fID = fopen('matrix.bin','r')
data = []
for ii = 1:10000
data = [data; fread(fID, 2, 'double')']
end
fclose(fID)
You'll end up with a 10000*2 array, with each row representing a complex number. If the file size is 80000, then both real and imaginary part could be saved in single data type. If file size is some other number, then it probably means some additional information is stored in the binary. You'll have to know what additional information is stored so you can read the file correctly.

Textscan on file with large number of lines

I'm trying to analyze a very large file using textscan in MATLAB. The file in question is about 12 GB in size and contains about 250 million lines with seven (floating) numbers in each (delimited by a whitespace); because this obviously would not fit into the RAM of my desktop, I'm using the approach suggested in the MATLAB documentation (i.e. loading and analyzing a smaller block of the file at a time. According to the documentation this should allow for processing "arbitrarily large delimited text file[s]"). This only allows me to scan about 43% of the file, after which textscan starts returning empty cells (despite there still being data left to scan in the file).
To debug, I attempted to go to several positions in the file using the fseek function, for example like this:
fileInfo = dir(fileName);
fid = fileopen(fileName);
fseek(fid, floor(fileInfo.bytes/10), 'bof');
textscan(fid,'%f %f %f %f %f %f %f','Delimiter',' ');
I'm assuming that the way I'm using fseek here moves the position indicator to about 10% of my file. (I'm aware this doesn't necessarily mean the indicator is at the beginning of a line, but if I run textscan twice I get a satisfactory answer.) Now, if I substitute fileInfo.bytes/10 by fileInfo.bytes/2 (i.e. moving it to about 50% of the file) everything breaks down and textscan only returns an empty 1x7 cell.
I looked at the file using a text editor for large files, and this shows that the entire file looks fine, and that there should be no reason for textscan to be confused. The only possible explanation that I can think of is that something goes wrong on a much deeper level that I have little understanding of. Any suggestions would be greatly appreciated!
EDIT
The relevant part of my code used to look like this:
while ~feof(fid)
data = textscan(fid, FormatString, nLines, 'Delimiter', ' '); %// Read nLines
%// do some stuff
end
First I tried fixing it using ftell and fseek as suggested by Hoki below. This gave exactly the same error as I got before: MATLAB was unable to read in more than approximately 43% of the file. Then I tried using the HeaderLines solution (also suggested below), like this:
i = 0;
while ~feof(fid)
frewind(fid)
data = textscan(fid, FormatString, nLines, 'Delimiter',' ', 'HeaderLines', i*nLines);
%// do some stuff
i = i + 1;
end
This seems to read in the data without producing errors; it is, however, incredibly slow.
I'm not entirely sure I understand what HeaderLines does in this context, but it seems to make textscan completely ignore everything that comes before the specified line. This doesn't seem to happen when using textscan in the "appropriate" way (either with or without ftell and fseek): in both cases it tries to continue from its last position, but to no avail because of some reason I don't understand yet.
fseek a pointer in a file is only good when you know precisely where (or by how many bytes) you want to move the cursor. It is very useful for binary files when you just want to skip some records of known length. But on a text file it is more dangerous and confusing than anything (unless you are absolutely sure that each line is the same size and each element on the line is at the same exact place/column, but that doesn't happen often).
There are several ways to read a text file block by block:
1) Use the HeaderLines option
To simply skip a block of lines on a text file, you can use the HeaderLines parameter of textscan, so for example:
readFormat = '%f %f %f %f %f %f %f' ; %// read format specifier
nLines = 10000 ; %// number of line to read per block
fileInfo = dir(fileName);
%// read FIRST block
fid = fileopen(fileName);
M = textscan(fid, readFormat, nLines,'Delimiter',' '); %// read the first 10000 lines
fclose(fid)
%// Now do something with your "M" data
Then when you want to read the second block:
%// later read the SECOND block:
fid = fileopen(fileName);
M = textscan(fid, readFormat, nLines,'Delimiter',' ','HeaderLines', nLines); %// read lines 10001 to 20000
fclose(fid)
And if you have many blocks, for the Nth block, just adapt:
%// and then for the Nth BLOCK block:
fid = fileopen(fileName);
M = textscan(fid, readFormat, nLines,'Delimiter',' ','HeaderLines', (N-1)*nLines);
fclose(fid)
If necessary (if you have many blocks), just code this last version in a loop.
Note that this is good if you close your file after each block reading (so the file pointer will start at the beginning of the file when you open it again). Closing the file after reading a block of data is safer if your processing might take a long time or may error out (you don't want to have files which remain open too long or loose the fid if you crash).
2) Read by block (without closing the file)
If the processing of the block is quick and safe enough so you're sure it won't bomb out, you could afford to not close the file. In this case, the textscan file pointer will stay where you stopped, so you could also :
read a block (do not close the file): M = textscan(fid, readFormat, nLines)
Process it then save your result (and release memory)
read the next block with the same call: M = textscan(fid, readFormat, nLines)
In this case you wouldn't need the headerlines parameter because textscan will resume reading exactly where it stopped.
3) use ftell and fseek
Lastly, you could use fseek to start reading the file at the precise position you want, but in this case I recommend using it in conjunction with ftell.
ftell will return the current position in an open file, so use that to know at which position you stop reading last, then use fseek the next time to go straight at this position. Something like:
%// read FIRST block
fid = fileopen(fileName);
M = textscan(fid, readFormat, nLines,'Delimiter',' ');
lastPosition = ftell(fid) ;
fclose(fid)
%// do some stuff
%// then read another block:
fid = fileopen(fileName);
fseek( fid , 'bof' , lastPosition ) ;
M = textscan(fid, readFormat, nLines,'Delimiter',' ');
lastPosition = ftell(fid) ;
fclose(fid)
%// and so on ...

MATLAB error message when plotting big data files?

I have written code to plot data from very large .txt files (20Gb to 60Gb). The .txt files contain two columns of data, that represent the outputs of two sensors from an experiment that I did. The reason the data files are so large is that the data was recorded at 4M samples/s.
The code works well for plotting relatively small .txt files (10Gb), however when I try to plot my larger data files (60Gb) I get the following error message:
Attempted to access TIME(0); index must be a
positive integer or logical.
Error in textscan_loop (line 17)
TIME =
((TIME(end)+sample_rate):sample_rate:(sample_rate*(size(d,1)))+(TIME(end)));%shift
Time along
The basic idea behind my code is to conserve RAM by reading Nlines of data from .txt on disk to Matlab variable C in RAM, plotting C then clearing C. This process occurs in loop so the data is plotted in chunks until the end of the .txt file is reached. The code can be found below:
Nlines = 1e6; % set numbe of lines to sample per cycle
sample_rate = (1); %sample rate
DECE= 1000;% decimation factor
TIME = (0:sample_rate:sample_rate*((Nlines)-1));%first inctance of time vector
format = '%f\t%f';
fid = fopen('H:\PhD backup\Data/ONK_PP260_G_text.txt');
while(~feof(fid))
C = textscan(fid, format, Nlines, 'CollectOutput', true);
d = C{1}; % immediately clear C at this point you need the memory!
clearvars C ;
TIME = ((TIME(end)+sample_rate):sample_rate:(sample_rate*(size(d,1)))+(TIME(end)));%shift Time along
plot((TIME(1:DECE:end)),(d(1:DECE:end,:)))%plot and decimate
hold on;
clearvars d;
end
fclose(fid);
I think the while loop does around 110 cycles before the code stops executing and the error message is displayed, I know this because the graph shows around 110e7 data points and the loop processes 1e6 data points at a time.
If anyone knows why this error might be occurring please let me know.
Cheers,
Jim
The error that you encounter is in fact not in the plotting, but in the line of reference.
Though I have been unable to reproduce the exact error, I suspect it to be related to this:
Time = 1:0
Time(end)
In any case, the way forward is clear. You need to run this code with dbstop if error and observe all relevant variables in the line that throws the error.
From here you will likely figure out what is causing the problem, hopefully just something simple like your code being unable to deal with data size that is an exact multiple of 1000 or so.
Trying to use plot for big data is problematic as matlab is trying to plot every single data point.
Obviously the screen will not display all of these points (many will overlap), and therefore it is recommended to plot only the relevant points. One could subsample and do this manually as you seem to have tried, but fortunately we have a ready to use solution for this:
The Plot (Big) File Exchange Submission
Here is the introduction:
This simple tool intercepts data going into a plot and reduces it to
the smallest possible set that looks identical given the number of
pixels available on the screen. It then updates the data as a user
zooms or pans. This is useful when a user must plot a very large
amount of data and explore it visually.
This works with MATLAB's built-in line plot functions, allowing the
functionality of those to be preserved.
Instead of:
plot(t, x);
One could use:
reduce_plot(t, x);
Most plot options, such as multiple series and line properties, can be
passed in too, such that 'reduce_plot' is largely a drop-in
replacement for 'plot'.
h = reduce_plot(t, x(1, :), 'b:', t, x(2, :), t, x(3, :), 'r--*');
This function works on plots where the "x" data is always increasing,
which is the most common, such as for time series.