How to sparsely read a large file in Matlab? - matlab

I ran a simulation which wrote a huge file to disk. The file is a big matrix v. I can't read it all, but I really only need a portion of the matrix, say, 1:100 of the columns and rows. I'd like to do something like
vtag = dlmread('v',1:100:end, 1:100:end);
Of course, that doesn't work. I know I should have only done the following when writing to the file
dlmwrite('vtag',v(1:100:end, 1:100:end));
But I did not, and running everything again would take two more days.
Thanks
Amir

Thankfully the dlmread function supports specifying a range to read as the third input. So if you wan to read all N columns for the first 100 rows, you can specify that with the following command
startRow = 1;
startColumn = 1;
endRow = 100;
endColumn = N;
rng = [startRow, startColumn, endRow, endColumn] - 1;
vtag = dlmread(filename, ',', rng);
EDIT Based on your clarification
Since you don't want 1:100 rows but rather 1:100:end rows, the following approach should work better for you.
You can use textscan to read chunks of data at a time. You can read a "good" row and then read in the next "chunk" of data to ignore (discarding it in the process), and continue until you reach the end of the file.
The code below is a slight modification of that idea, except it utilizes the HeaderLines input to textscan which instructs the function how many lines to ignore before reading in the data. The first time through the loop, no lines will be skipped, however all other times through the loop, rows2skip lines will be skipped. This allows us to "jump" through the file very rapidly without calling any additional file opertions.
startRow = 1;
rows2skip = 99;
columns = 3000;
fid = fopen(filename, 'rb');
% For now, we'll just assume you're reading in floating-point numbers
format = repmat('%f ', [1 columns]);
count = 1;
lines2discard = startRow - 1;
while ~feof(fid)
% Use "HeaderLines" to skip data before reading in data we care about
row = textscan(fid, format, 1, 'Delimiter', ',', 'HeaderLines', lines2discard);
data{count} = [row{:}];
% After the first time through, set the "HeaderLines" (i.e. lines to ignore)
% to be the # we want to skip between lines (much faster than alternatives!)
lines2discard = rows2skip;
count = count + 1;
end
fclose(fid);
data = cat(1, data{:});
You may need to adjust your format specifier for your own type of input.

Related

Convert .csv to .out (complex numbers)

I have a csv file that has complex numbers.
This is sample of some numbers I have in the csv file:
(0.12825663763789857+0.20327998150393212j),(0.21890748607218197+0.160563964013564j),(0.28205414129281525+0.09884068776334366j),(0.030927026479380615+0.26334550583848626j)
I want to read this file and then save in (.out) file all the real parts in the first column and all the imaginary parts in the second column (without the imaginary letter j).
Here is one attempt. It is slightly more complicated due to the ( and ) that surround your numbers.
First, use textscan to read the file. Since I guess you don't know how many numers are in the file, read everything into a singe string. Will work with mutiple lines, too:
filename = 'data.csv';
fid = fopen(filename);
content = textscan(fid, '%s');
fclose(fid);
For this purpose, content now is a slightly weird cell array (look at the textscan-docs for details). Just initialize the variable nums which will store the numbers and loop through content (if you know a bit more about your csv file, you might pre-allocate nums):
nums = [];
for c1 = 1:numel(content{1})
Next, split the string at every occurence of ,:
string_list = strsplit(content{1}{c1},',');
This gives another cell array. Loop through it to convert the strings to numbers (and end the outer loop):
for c2 = 1 : numel(string_list)
nums(end+1) = str2num(string_list{c2});
end
end
Last, just store the real and the imaginary part of the numbers in separate columns:
out = [];
out(:,1) = real(nums);
out(:,2) = imag(nums);
and save it to data.out.
Update As you mentioned precision, you could use
dlmwrite('data.out', out, 'precision','%.20f');
However, here you need to understand the floating point representation in Matlab. In particular, try to understand the following:
>> a = 0.12825663763789857
a =
0.1283
>> fprintf('%.20f\n', a)
0.12825663763789857397
>> eps(a)
ans =
2.7756e-17
Note that one could have done this without cenverting the strings to numbers, but the way above would allow you to use the data in Matlab instead of just saving it.
HEre is an attempt without converting your strings to numbers, therefore one does not have to deal with precision. It works with negative real and imaginary numbers, too. + signs are removed when written to the new file, - signs are preserved:
filename = 'data.csv';
fid = fopen(filename);
content = textscan(fid, '%s');
fclose(fid);
fid = fopen('data.out','w');
pattern = '(?<real>-{0,1}\d+.\d+)(?<imag>[+-]\d+.\d+)j';
for c1 = 1:numel(content{1})
result = regexp(content{1}{c1}, pattern, 'names');
for c2 = 1:numel(result)
fprintf(fid, '%s,%s\n', strrep(result(c2).real,'+',''), strrep(result(c2).imag,'+',''));
end
end
fclose(fid);

Read csv files with double quotes and comma delimiter, containing doubles and strings (arbitrary number of rows and columns)

I am reading csv files with the following format:
Header:,Date,Time,"MC2_Y241_TightnessPressValve","MC2_Y243_PressingValve""
Data,2015-09-16,15:41:52;781,"780.000000","0.0034"
Data,2015-09-16,15:41:52;791,"790.000000","0.1255"
Data,2015-09-16,15:41:52;801,"800.000000","1.5123"
Atm I am using fgetl(fid) to find the headers and all the dates and times. Then I use the knowledge of which rows and columns are containing doubles, to be able to use csvread() for fast reading. However, to use csvread(), I must first remove the quotes. I am currently doing this with a powershell script within matlab but it is too time consuming as I will need to read files of +200Mb.
Note: textscan with '%q' cannot be used for two reasons:
1) I want all doubles to be read as doubles immediately (converting is too time consuming).
2) The files contain various numbers of rows and columns.
This is for a standalone application.
I truly appreciate all help, I have spent countless hours on making this efficient.
I though I should post an answer, in case anyone else has the same problem. Acknowledgements to Lee who helped me with this!
% Remove quotes
fid = fopen('temp.csv','w');
s = fileread('myFile.csv');
s = strrep(s,'"','');
fprintf(fid,s);
fclose(fid);
% Open new, quote free, file
fid = fopen('temp.csv','r');
% Get Headers
Headers{1} = fgetl(fid);
Headers = textscan(Headers{1},'%s','Delimiter',',');
Headers = Headers{1};
Headers = Headers(3:end);
% Get date and time
lineIndex = 1;
nextLine = fgetl(fid);
time{lineIndex} = nextLine(6:28);
lineIndex = lineIndex + 1;
nextLine = fgetl(fid);
while ~isequal(nextLine,-1) && ~isempty(nextLine)
time{lineIndex} = nextLine(6:28);
lineIndex = lineIndex + 1;
nextLine = fgetl(fid);
end
% Get doubles
Data = csvread('temp.csv', 1,3);
fclose(fid);
% Convert time with datenum
Time = datenum(time,'yyyy-mm-dd,HH:MM:SS;FFF');
% Put all data in one matrix
Data = [Time, Data];

Matlab Input/output of Several Files

I do matlab operation with two data file whose entries are complex numbers. For example,
fName = '1corr.txt';
f = dlmread('1EA.txt',',');
fid = fopen(fName);
tline = '';
Then I do matrix and other operations between these two files and write my output which I call 'modTrace' as:
modTrace
fileID = fopen('1out.txt','w');
v = [(0:(numel(modTrace)-1)).' real(modTrace(:)) ].';
fprintf(fileID,'%d %e\n',v);
The question is, if I have for example 100 pairs of such data files, like (2corr.txt, 2EA.txt), ....(50corr.txt, 50EA.txt) how can I generalize the input files and how to write all the output files at a time?
First of all, use sprintf to get your variable names depending on the current index.
corrName=sprintf('%dcorr.txt',idx);
EAName=sprintf('%dEA.txt',idx);
outName=sprintf('%dout.txt',idx);
This way, you have one variable (idx) which has to be changed.
Finally put everything into a loop:
n=100
for idx=1:n
corrName=sprintf('%dcorr.txt',idx);
EAName=sprintf('%dEA.txt',idx);
outName=sprintf('%dout.txt',idx);
f = dlmread(EAName,',');
fid = fopen(corrName);
tline = '';
modTrace
fileID = fopen(outName,'w');
v = [(0:(numel(modTrace)-1)).' real(modTrace(:)) ].';
fprintf(fileID,'%d %e\n',v);
end
Instead of hardcoding the number 100, you could also use n=numel(dir('*EA.txt')). It count's the files ending with EA.txt

Load multiple tab delimited text files

I have boatloads of tab delimited textfiles that contain numerical data in 1000x2 format.
They're named file00001.txt - file10000.txt
I would like to write a script to load each of these files and make a variable containing ONLY the 400th row of the 2nd column of each of these files.
After that I'm going to try and plot a graph with the data I collected - but that's not important here.
I would be very grateful for your help.
Edit -
My most recent endeavour is:
numfiles = 10;
mydata = cell(1, numfiles);
for k = 1:numfiles
myfilename = sprintf('DM0000%d.txt', k);
mydata{k} = importdata(myfilename);
end
I'm running into a few problems -
1) if numfiles is >9, the 10th file data entry in the mydata variable comes up as []. This may have something to do with the naming method of my files? They're named in this fashion:
DM00000 ...DM00009, DM00010, DM00011, etc.
2) Also this is pretty slow to load, someone said using fopen, if so where should I put it in and how?
I'm guessing it'd be somewhere along the lines of fopen('filename', 'r')?
Based on your edit, this is what I'd recommend:
numfiles = 10;
row = 400;
column = 2;
data = zeros(1, numfiles);
for k = 1:numfiles
filename = sprintf('DM%05d.txt', k);
fid = fopen(filename,'r');
tempdata = textscan(fid, '%f%f');
fclose(fid);
data(k) = tempdata{column}(row);
end
I've updated the formatspec in sprintf to create the filenames correctly (you were missing the padding with zeros). I'm using textscan to import the data as doubles (change the %f to something else if required - check out the formatspec documentation). I also changed data to be a matrix rather than a cell array. You mentioned that you'd want to plot the data, so it'll be easier if it's a matrix and I couldn't see any need to use a cell array here.

Problem (bug?) loading hexadecimal data into MATLAB

I'm trying to load the following ascii file into MATLAB using load()
% some comment
1 0xc661
2 0xd661
3 0xe661
(This is actually a simplified file. The actual file I'm trying to load contains an undefined number of columns and an undefined number of comment lines at the beginning, which is why the load function was attractive)
For some strange reason, I obtain the following:
K>> data = load('testMixed.txt')
data =
1 50785
2 58977
3 58977
I've observed that the problem occurs anytime there's a "d" in the hexadecimal number.
Direct hex2dec conversion works properly:
K>> hex2dec('d661')
ans =
54881
importdata seems to have the same conversion issue, and so does the ImportWizard:
K>> importdata('testMixed.txt')
ans =
1 50785
2 58977
3 58977
Is that a bug, am I using the load function in some prohibited way, or is there something obvious I'm overlooking?
Are there workarounds around the problem, save from reimplementing the file parsing on my own?
Edited my input file to better reflect my actual file format. I had a bit oversimplified in my original question.
"GOLF" ANSWER:
This starts with the answer from mtrw and shortens it further:
fid = fopen('testMixed.txt','rt');
data = textscan(fid,'%s','Delimiter','\n','MultipleDelimsAsOne','1',...
'CommentStyle','%');
fclose(fid);
data = strcat(data{1},{' '});
data = sscanf([data{:}],'%i',[sum(isspace(data{1})) inf]).';
PREVIOUS ANSWER:
My first thought was to use TEXTSCAN, since it has an option that allows you to ignore certain lines as comments when they start with a given character (like %). However, TEXTSCAN doesn't appear to handle numbers in hexadecimal format well. Here's another option:
fid = fopen('testMixed.txt','r'); % Open file
% First, read all the comment lines (lines that start with '%'):
comments = {};
position = 0;
nextLine = fgetl(fid); % Read the first line
while strcmp(nextLine(1),'%')
comments = [comments; {nextLine}]; % Collect the comments
position = ftell(fid); % Get the file pointer position
nextLine = fgetl(fid); % Read the next line
end
fseek(fid,position,-1); % Rewind to beginning of last line read
% Read numerical data:
nCol = sum(isspace(nextLine))+1; % Get the number of columns
data = fscanf(fid,'%i',[nCol inf]).'; % Note '%i' works for all integer formats
fclose(fid); % Close file
This will work for an arbitrary number of comments at the beginning of the file. The computation to get the number of columns was inspired by Jacob's answer.
New:
This is the best I could come up with. It should work for any number of comment lines and columns. You'll have to do the rest yourself if there are strings, etc.
% Define the characters representing the start of the commented line
% and the delimiter
COMMENT_START = '%%';
DELIMITER = ' ';
% Open the file
fid = fopen('testMixed.txt');
% Read each line till we reach the data
l = COMMENT_START;
while(l(1)==COMMENT_START)
l = fgetl(fid);
end
% Compute the number of columns
cols = sum(l==DELIMITER)+1;
% Split the first line
split_l = regexp(l,' ','split');
% Read all the data
A = textscan(fid,'%s');
% Compute the number of rows
rows = numel(A{:})/cols;
% Close the file
fclose(fid);
% Assemble all the data into a matrix of cell strings
DATA = [split_l ; reshape(A{:},[cols rows])']; %' adding this to make it pretty in SO
% Recognize each column and process accordingly
% by analyzing each element in the first row
numeric_data = zeros(size(DATA));
for i=1:cols
str = DATA(1,i);
% If there is no '0x' present
if isempty(findstr(str{1},'0x')) == true
% This is a number
numeric_data(:,i) = str2num(char(DATA(:,i)));
else
% This is a hexadecimal number
col = char(DATA(:,i));
numeric_data(:,i) = hex2dec(col(:,3:end));
end
end
% Display the data
format short g;
disp(numeric_data)
This works for data like this:
% Comment 1
% Comment 2
1.2 0xc661 10 0xa661
2 0xd661 20 0xb661
3 0xe661 30 0xc661
Output:
1.2 50785 10 42593
2 54881 20 46689
3 58977 30 50785
OLD:
Yeah, I don't think LOAD is the way to go. You could try:
a = char(importdata('testHexa.txt'));
a = hex2dec(a(:,3:end));
This is based on both gnovice's and Jacob's answers, and is a "best of breed"
For files like:
% this is my comment
% this is my other comment
1 0xc661 123
2 0xd661 456
% surprise comment
3 0xe661 789
4 0xb661 1234567
(where the number of columns within the file MUST be the same, but not known ahead of time, and all comments denoted by a '%' character), the following code is fast and easy to read:
f = fopen('hexdata.txt', 'rt');
A = textscan(f, '%s', 'Delimiter', '\n', 'MultipleDelimsAsOne', '1', 'CollectOutput', '1', 'CommentStyle', '%');
fclose(f);
A = A{1};
data = sscanf(A{1}, '%i')';
data = repmat(data, length(A), 1);
for ctr = 2:length(A)
data(ctr,:) = sscanf(A{ctr}, '%i')';
end