Read compressed files without unzipping them in Matlab - matlab

I have (very large) comma separated files compressed in bz2 format. If I un-compressed them and I read with
fileID = fopen('file.dat');
X = textscan(fileID,'%d %d64 %s %f %d %f %f %d', 'delimiter', ',');
fclose(fileID);
everything is fine. But I would like to read them without uncompressing them, something like
fileID = fopen('file.bz2');
X = textscan(fileID,'%d %d64 %s %f %d %f %f %d', 'delimiter', ',');
fclose(fileID);
which, unfortunatley, returns an empty X. Any suggestions? Do I have to uncompressed them unavoidably via the system(' ... ') command ?

You could try to use the form of textscan that takes a string instead of a stream. Using the Matlab Java integration, you can leverage Java chained streams to decompress on the fly and read single lines, which can then be parsed:
% Build a stream chain that reads, decompresses and decodes the file into lines
fileStr = javaObject('java.io.FileInputStream', 'file.dat.gz');
inflatedStr = javaObject('java.util.zip.GZIPInputStream', fileStr);
charStr = javaObject('java.io.InputStreamReader', inflatedStr);
lines = javaObject('java.io.BufferedReader', charStr);
% If you know the size in advance you can preallocate the arrays instead
% of just stating the types to allow vcat to succeed
X = { int32([]), int64([]), {}, [], int32([]), [], [], int32([]) };
curL = lines.readLine();
while ischar(curL) % on EOF, readLine returns null, which becomes [] (type double)
% Parse a single line from the file
curX = textscan(curL,'%d %d64 %s %f %d %f %f %d', 'delimiter', ',');
% Append new line results
for iCol=1:length(X)
X{iCol}(end+1) = curX{iCol};
end
curL = lines.readLine();
end
lines.close(); % Don't forget this or the file will remain open!
I'm not exactly vouching for the performance of this method, with all the array appending going on, but at least that is how you can read a GZ file on the fly in Matlab/Octave. Also:
If you have a Java stream class that decompresses another format (try e.g. Apache Commons Compress), you can read it the same way. You could read bzip2 or xz files.
There are also classes to access archives, like zip files in the base Java distribution, or tar/RAR/7z and more in Apache Commons Compress. These classes usually have some way of finding files stored within the archive, allowing you to open an input stream to them within the archive and read in the same way as above.

On a unix system I would use named pipes and do something like this:
system('mkfifo pipename');
system(['bzcat file.bz2 > pipename &']);
fileID = fopen('pipename', 'r');
X = textscan(fileID,'%d %d64 %s %f %d %f %f %d', 'delimiter', ',');
fclose(fileID);
system('remove pipename');

Related

I want to read this specific csv file, using Matlab. I used textscan but I failed

csv file:
Date,Open,High,Low,Close,Volume,Adj Close
20170217,64.470001,64.690002,64.300003,64.620003,21234600,64.620003
20170216,64.739998,65.239998,64.440002,64.519997,20524700,64.519997
I used this:
fileID = fopen('table.csv');
C = textscan(fileID,'%s %f %f %f %f %d %f','Delimiter',',');
fclose(fileID);
celldisp(C)
but It does not read anything.
You can use the csvread function to read a csv file.
m=csvread('table.csv',1,0)
The values are stored in a matrix.
Since your file has an header line, you have to specify, in the call, to start reading from the second row of the file.
You can do it by adding two parameters in the call:
the first defines the row from which to start reading (notice that the index is zero base)
the second defines the column from which to start (in the case of the example, from the first)
If, nevertheless, you want to use textscan, you have to modify your code as follows:
fileID = fopen('table.csv');
% C = textscan(fileID,'%s %f %f %f %f %d %f','Delimiter',',');
C1 = textscan(fileID,'%s',2);
C2 = textscan(fileID,'%d%f%f%f%f%d%f','delimiter',',')
fclose(fileID);
You have to call textscan twice:
the first time ro read the first row (the header)
the second time to read the data
Notice in the first call the third parameter in the call: it specifies that the format (%s) has to be used twice.
This because in your header row the last word is separated by a space.
Once you've read the header row, you call textscan for the again to read the numeric values.
CSV is reading by xlsread('File');
if it is reading nan so do
[num text all]=xlsread('file');
and do for loops on text output

Reading large text files with insuficient RAM in Matlab

I want to read a large text file of about 2GB and perform a series of operations on that data. Following approach
tic
fid=fopen(strcat(Name,'.dat'));
C=textscan(fid, '%d%d%f%f%f%d');
fclose(fid);
%Extract cell values
y=C{1}(1:Subsampling:end)/Subsampling;
x=C{2}(1:Subsampling:end)/Subsampling;
%...
Reflectanse=C{6}(1:Subsampling:end);
Overlap=round(Overlap/Subsampling);
fails immediatly after reading C (C=textscan(fid, '%d%d%f%f%f%d');) with a strange peak in my memory usage:
What would be the best way to import a file of this size? Is there a way to use textscan() to read individual parts of a text file, or are there any other functions better suited for this task?
Edit: Every column in the textscan contains an information field information for 3D-Points:
width hieght X Y Z Grayscale
345 453 3.422 53.435 0.234 200
346 453 3.545 52.345 0.239 200
... % and so on for ~40 millon points
If you can process each row individually then the following code will allow you to do that. I've included start_line and end_line if you want to specify a block of data.
headerSpec = '%s %s %s %s %s %s';
dataSpec = '%f %f %f %f %f %f';
fid=fopen('data.dat');
% Read Header
names = textscan(fid, headerSpec, 1, 'delimiter', '\t');
k = 0;
% specify a start and end line for getting a block of data
start_line = 2;
end_line = 3;
while ~feof(fid)
k=k+1;
if k < start_line
continue;
end
if k > end_line
break;
end
% get data
C = textscan(fid, dataSpec, 1, 'delimiter', '\t');
row = [C{1:6}]; % convert data from cell to vector
% do what you want with the row
end
fclose(fid);
There is the possibility of reading in the entire file, but this will depend on the amount of memory you have available and if matlab has any restrictions in place. This can be seen by typing memory at the command window.

how to concatenate many files into one matrix for plotting

I need to concatenate multiple files' data into one matrix.
So far, the way that I have been testing loading my data is something akin to the following:
fid = fopen('data01.txt', 'r');
raw = textscan(fid, '%d/%d/%d %d:%d:%f %f %f %f %d', 'delimiter', ',');
m = cellfun(#double, raw, 'UniformOutput', false);
value_of_interest = m{:,10}
...But the data set that I have on disk is many files and all exist within a single directory. I'd prefer to refer to a specific path for this directory, rather than placing my script there. How can I modify my script so that it loads all data for all of the files in said folder?
So far I have this:
dirname = uigetdir;
files = dir(dirname);
fileIndex = find(~[files.isdir]);
for i = 1:length(fileIndex)
fileName = files(fileIndex(i)).name;
fid = fopen(fileName, 'r');
raw = textscan(fid, '%d/%d/%d %d:%d:%f %f %f %f %d', 'delimiter', ',');
time = [m{:,4}, m{:,5}, m{:,6}]; %needs to contain a float
converted_time = ((m{:,4} * 3600.0) + (m{:,5} * 60.0) + m{:,6}); %hh:mm:ss -> seconds
values = power(m{:,10}, 2);
values(values <= thresh) = 0;
% need to concat into the var 'values' here... also need to accumulate the time variable
end
plot(converted_time, values);
...But I need to put the two together.
EDIT: I should mention that I may run out of memory, which is explained later in comments below to my chosen answer.
First, have another look at how you are defining the fileName of the file to be opened. Instead, you should try fileName = [dirname, '\', files(fileIndex(i)).name];, since the name field of files will not contain the full path. This will solve your problem of referencing a list of files that are not in your current path.
Now, to avoid remembering all the data from all those files, we can do this job per file inside the loop:
...
plot(converted_time, values);
hold('on');
end
The short command hold('on');, often written simply hold on; modifies the plot axes such that subsequent data can be plotted without erasing any previous lines.

fscanf file read in matlab for mixed numeric and non-numeric data (textscan not available)

I am trying to read a data file but I have an older version of Matlab that does not include textscan. I am trying to use fscanf but I am unable to figure out how to read the second element which is time format. The txt data looks like this:
20120502,16:30:00,1397.5,1397.5,1397.0,1397.5,1283
20120502,16:32:00,1397.25,1397.5,1397.0,1397.0,582
I have tried this, with different attempts at figuring out the 2nd column which is the time vector, but I am not having any luck.
fid = fopen('C:\matlab\data\GLOBEX.txt','r');
[c] = fscanf(fid, '%f %s %f %f %f %f %f');
Thanks
Try the following:
[c] = fscanf(fid, '%f,%d:%d:%d,%f,%f,%f,%f,%f');
c = reshape(c, 9, length(c)/9)';
Now you have hours, minutes, and seconds in columns 2, 3, and 4.

Abort reading a txt file while it is being updated - Matlab

I have 2 instances of Matlab running. While the first is writing data to a .txt file, the other is reading that data.
Is there a way to verify that the .txt file is being accessed and accordingly throw an exception/error?
I found that the second Matlab instance reads the data anyways but generates an error such as Horzcat etc while that .txt file was being updated as well.
fName = 'Test.txt' ;
% Matlab Instance1
mat = 1 + (2-1)*randn(100000,5) ; mat = mat.' ;
[fid, fMsg] = fopen(fName, 'at') ;
if fid~=-1, fprintf(fid, '%.10f\t%.10f\t%.10f\t%.10f\t%.10f\r\n', mat(:)) ; end
fclose(fid);
% Matlab Instance2
fid = fopen(fName);
C = textscan(fid, '%f %f %f %f %f', 'Delimiter', '\t');
C=cell2mat(C);
fclose(fid);
On the writing instance create a file called 'busyWriting.bla' before opening the file for writing, delete this file after you are done writing. And on the reading instance enclose everything with the clause if(~exist('busyWriting.bla','file')) ... end