Making a matrix of strings read in from file using feof function in matlab - matlab

I have an index file (called runnumber_odour.txt) that looks like this:
run00001.txt ptol
run00002.txt cdeg
run00003.txt adef
run00004.txt adfg
I need some way of loading this in to a matrix in matlab, such that I can search through the second column to find one of those strings, load the corresponding file and do some data analysis with it. (i.e. if I search for "ptol", it should load run00001.txt and analyse the data in that file).
I've tried this:
clear; clc ;
% load index file - runnumber_odour.txt
runnumber_odour = fopen('Runnumber_odour.txt','r');
count = 1;
lines2skip = 0;
while ~feof(runnumber_odour)
runnumber_odourmat = zeros(817,2);
if count <= lines2skip
count = count+1;
[~] = fgets(runnumber_odour); % throw away unwanted line
continue;
else
line = strcat(fgets(runnumber_odour));
runnumber_odourmat = [runnumber_odourmat ;cell2mat(textscan(line, '%f')).'];
count = count +1;
end
end
runnumber_odourmat
But that just produces a 817 by 2 matrix of zeros (i.e. not writing to the matrix), but without the line runnumber_odourmat = zeros(817,2); I get the error "undefined function or variable 'runnumber_odourmat'.
I have also tried this with strtrim instead of strcat but that also doesn't work, with the same problem.
So, how do I load that file in to a matrix in matlab?

You can do all of this pretty easily using a Map object so you will not have to do any searching or anything like that. Your second column will be a key to the first column. The code will be as follows
clc; close all; clear all;
fid = fopen('fileList.txt','r'); %# open file for reading
count = 1;
content = {};
lines2skip = 0;
fileMap = containers.Map();
while ~feof(fid)
if count <= lines2skip
count = count+1;
[~] = fgets(fid); % throw away unwanted line
else
line = strtrim(fgets(fid));
parts = regexp(line,' ','split');
if numel(parts) >= 2
fileMap(parts{2}) = parts{1};
end
count = count +1;
end
end
fclose(fid);
fileName = fileMap('ptol')
% do what you need to do with this filename
This will provide for quick access to any element
You can then do what was described in the previous question you had asked, with the answer I provided.

Related

Matlab: Error using readtable (line 216) Input must be a row vector of characters or string scalar

I gave the error Error using readtable (line 216) Input must be a row vector of characters or string scalar when I tried to run this code in Matlab:
clear
close all
clc
D = 'C:\Users\Behzad\Desktop\New folder (2)';
filePattern = fullfile(D, '*.xlsx');
file = dir(filePattern);
x={};
for k = 1 : numel(file)
baseFileName = file(k).name;
fullFileName = fullfile(D, baseFileName);
x{k} = readtable(fullFileName);
fprintf('read file %s\n', fullFileName);
end
% allDates should be out of the loop because it's not necessary to be in the loop
dt1 = datetime([1982 01 01]);
dt2 = datetime([2018 12 31]);
allDates = (dt1 : calmonths(1) : dt2).';
allDates.Format = 'MM/dd/yyyy';
% 1) pre-allocate a cell array that will store
% your tables (see note #3)
T2 = cell(size(x)); % this should work, I don't know what x is
% the x is xlsx files and have different sizes, so I think it should be in
% a loop?
% creating loop
for idx = 1:numel(x)
T = readtable(x{idx});
% 2) This line should probably be T = readtable(x(idx));
sort = sortrows(T, 8);
selected_table = sort (:, 8:9);
tempTable = table(allDates(~ismember(allDates,selected_table.data)), NaN(sum(~ismember(allDates,selected_table.data)),size(selected_table,2)-1),'VariableNames',selected_table.Properties.VariableNames);
T2 = outerjoin(sort,tempTable,'MergeKeys', 1);
% 3) You're overwriting the variabe T2 on each iteration of the i-loop.
% to save each table, do this
T2{idx} = fillmissing(T2, 'next', 'DataVariables', {'lat', 'lon', 'station_elevation'});
end
the x is each xlsx file from the first loop. my xlsx file has a different column and row size. I want to make the second loop process for all my xlsx files in the directory.
did you know what is the problem? and how to fix it?
Readtable has one input argument, a filename. It returns a table. In your code you have the following:
x{k} = readtable(fullFileName);
All fine, you are reading the tables and storing the contents in x. Later in your code you continue with:
T = readtable(x{idx});
You already read the table, what you wrote is basically T = readtable(readtable(fullFileName)). Just use T=x{idx}

Reading Irregular Text Files with MATLAB

In short, I'm having a headache in multiple languages to read a txt file (linked below). My most familiar language is MATLAB so for that reason I'm using that in this example. I've found a way to read this file in ~ 5 minutes, but given I'll have tons and tons of data from my instrument shortly as it measures all day every 30 seconds this just isn't feasible.
I'm looking for a way to quickly read these irregular text files so that going forward I can knock these out with less of a time burden.
You can find my exact data at this link:
http://lb3.pandonia.net/BostonMA/Pandora107s1/L0/Pandora107s1_BostonMA_20190814_L0.txt.bz2
I've been using the "readtable" function in matlab and I have achieved a final product I want but I'm looking to increase the speed
Below is my code!
clearvars -except pan day1; % Clearing all variables except for the day and instrument variables.
close all;
clc;
pan_mat = [107 139 155 153]; % Matrix of pandora numbers for file-choosing
reasons.
pan = pan_mat(pan); % pandora number I'm choosing
pan = num2str(pan); % Turning Pandora number into a string.
%pan = '107'
pandora = strcat('C:\Users\tadams15\Desktop\Folders\Counts\Pandora_Dta\',pan)
% string that designates file location
%date = '90919'
month = '09'; % Month
day2 = strcat('0',num2str(day1)) % Creating a day name for the figure I ultimately produce
cd(pandora)
d2 = strcat('2019',num2str(month),num2str(day2)); % The final date variable
for the figure I produce
%file_pan = 'Pandora107s1_BostonMA_20190909_L0';
file_pan = strcat('Pandora',pan,'s1_BostonMA_',d2,'_L0'); % File name string
%Try reading it in line by line?
% Load in as a string and then convert the lines you want as numbers into
% number.
delimiterIn = '\t';
headerlinesIn = 41;
A = readtable(file_pan,'HeaderLines', 41, 'Delimiter', '\t'); %Reading the
file as a table
A = table2cell(A); % Converting file to a cell
A = regexp(A, ' ', 'split'); % converting cell to a structure matrix.
%%
A= array2table(A); % Converting Structure matrix back to table
row_num = 0;
pan_mat_2 = zeros(2359,4126);
datetime_mat = zeros(2359,2);
blank = 0;
%% Converting data to proper matrices
[length width] = size(A);
% The matrix below is going through "A" and writing from it to a new
% matrix, "pan_mat_2" which is my final product as well as singling out the
% rows that contain non-number variables I'd like to keep and adding them
% later.
tic
%flag1
for i = 1:length; % Make second number the length of the table, A
blank = 0;
b = table2array(A{i,1});
[rows, columns] = size(b);
if columns > 4120 && columns < 4140
row_num = row_num + 1;
blank = regexp(b(2), 'T', 'split');
blank2 = regexp(blank{1,1}(2), 'Z', 'split');
datetime_mat(row_num,1) = str2double(blank{1,1}(1));
datetime_mat(row_num,2) = str2double(blank2{1,1}(1));
for j = 1:4126;
pan_mat_2(row_num,j) = str2double(b(j));
end
end
end
toc
%flag2
In short, I'm already getting the result I want but the part of the code where I'm writing to a new array "flag 1" to "flag 2" is taking roughly 222 seconds while the entire code only takes about 248 seconds. I'd like to find a better way to create the data there than to write it to a new array and take a whole bunch of time.
Any suggestions?
Note:
There are a quite a few improvments you can make for speed but there are also corrections. You preallocate you final output variable with hard coded values:
pan_mat_2 = zeros(2359,4126);
But later you populate it in a loop which run for i = 1:length.
length is the full number of lines picked from the file. In your example file there are only 784 lines. So even if all your line were valid (ok to be parsed), you would only ever fill the first 784 lines of the total 2359 lines you allocated in your pan_mat_2. In practice, this file has only 400 valid data lines, so your pan_mat_2 could definitely be smaller.
I know you couldn't know you had only 400 line parsed before you parsed them, but you knew from the beginning that you had only 784 line to parse (you had the info in the variable length). So in case like these pre-allocate to 784 and only later discard the empty lines.
Fortunately, the solution I propose does not need to pre-allocate larger then discard. The matrices will end up the right size from the start.
The code:
%%
file_pan = 'Pandora107s1_BostonMA_20190814_L0.txt' ;
delimiterIn = '\t';
headerlinesIn = 41;
A = readtable(file_pan,'HeaderLines', 41, 'Delimiter', '\t'); %Reading the file as a table
A = table2cell(A); % Converting file to a cell
A = regexp(A, ' ', 'split'); % converting cell to a structure matrix.
%% Remove lines which won't be parsed
% Count the number of elements in each line
nelem = cell2mat( cellfun( #size , A ,'UniformOutput',0) ) ;
nelem(:,1) = [] ;
% find which lines does not have enough elements to be parsed
idxLine2Remove = ~(nelem > 4120 & nelem < 4140) ;
% remove them from the data set
A(idxLine2Remove) = [] ;
%% Remove nesting in cell array
nLinesToParse = size(A,1) ;
A = reshape( [A{:}] , [], nLinesToParse ).' ;
% now you have a cell array of size [400x4126] cells
%% Now separate the columns with different data type
% Column 1 => [String] identifier
% Column 2 => Timestamp
% Column 3 to 4125 => Numeric values
% Column 4126 => empty cell created during the 'split' operation above
% because of a trailing space character.
LineIDs = A(:,1) ;
TimeStamps = A(:,2) ;
Data = A(:,3:end-1) ; % fetch to "end-1" to discard last empty column
%% now extract the values
% You could do that directly:
% pan_mat = str2double(Data) ;
% but this takes a long time. A much computationnaly faster way (even if it
% uses more complex code) would be:
dat = strjoin(Data) ; % create a single long string made of all the strings in all the cells
nums = textscan( dat , '%f' , Inf ) ; % call textscan on it (way faster than str2double() )
pan_mat = reshape( cell2mat( nums ) , nLinesToParse ,[] ) ; % reshape to original dimensions
%% timestamps
% convert to character array
strTimeStamps = char(TimeStamps) ;
% convert to matlab own datetime numbering. This will be a lot faster if
% you have operations to do on the time stamps later
ts = datenum(strTimeStamps,'yyyymmddTHHMMSSZ') ;
%% If you really want them the way you had it in your example
strTimeStamps(:,9) = ' ' ; % replace 'T' with ' '
strTimeStamps(:,end) = ' ' ; % replace 'Z' characters with ' '
%then same again, merge into a long string, parse then reshape accordingly
strdate = reshape(strTimeStamps.',1,[]) ;
tmp = textscan( strdate , '%d' , Inf ) ;
datetime_mat = reshape( double(cell2mat(tmp)),2,[]).' ;
The performance:
As you can see on my machine your original code takes ~102 seconds to execute, with 80% of that (81s) spent on calling the function str2double() 3,302,400 times!
My solution, run on the same input file, takes ~5.5 seconds, with half of the time spent on calling strjoin() 3 times.
When you read the code above, try to understand how I limited the repetition of function call in lengthy loops by trying to keep everything as vectorised as possible.
Using the profiler, you can see that you call str2double 3302400 times in a run which takes about 80% of the total time on my pc. Now thats suboptimal, as each time you only translate 1 value and as far as your code goes you dont need the values as string again. I added this under you original code:
row_num = 0;
pan_mat_2_b = cell(2359,4126);
datetime_mat_b = cell(2359,2);%not zeros
blank = 0;
tic
%flag1
for i = 1:length % Make second number the length of the table, A
blank = 0;
b = table2array(A{i,1});
[rows, columns] = size(b);
if columns > 4120 && columns < 4140
row_num = row_num + 1;
blank = regexp(b(2), 'T', 'split');
blank2 = regexp(blank{1,1}(2), 'Z', 'split');
%datetime_mat(row_num,1) = str2double(blank{1,1}(1));
%datetime_mat(row_num,2) = str2double(blank2{1,1}(1));
datetime_mat_b(row_num,1) = blank{1,1}(1);
datetime_mat_b(row_num,2) = blank2{1,1}(1);
pan_mat_2_b(row_num,:) = b;
% for j = 1:4126
% pan_mat_2(row_num,j) = str2double(b(j));
% end
end
end
datetime_mat_b = datetime_mat_b(~all(cellfun('isempty',datetime_mat_b),2),:);
pan_mat_2_b=pan_mat_2_b(~all(cellfun('isempty',pan_mat_2_b),2),:);
datetime_mat_b=str2double(string(datetime_mat_b));
pan_mat_2_b=str2double(pan_mat_2_b);
toc
Still not great, but better. If you want to speed this up further i recommend you take a closer look at the readtable part. As you can save up quite some time if you start with reading in the format as doubles right from the beginning

MATLAB: export scalars within for loop to text file

I have a large number of text files that I have to read, find the max value for a certain column, and the corresponding time. The for loop for finding these values works fine, but my problem is writing a text file that shows the three variables I need (thisfilename, M, and wavetime) for each iteration of the for loop.
Output_FileName_MaxWaveHeights = ['C:\Users\jl44459\Desktop\QGIS_and_Basement\BASEMENT\Mesh_5_2045\Run_A\','MaxWaveHeights.txt'];
writefile = fopen(Output_FileName_MaxWaveHeights,'a');
dinfo = dir('*.dat');
for K = 1 : length(dinfo)
thisfilename = dinfo(K).name; %just the name of the file
fileID = fopen(thisfilename); %creates numerical ID for the file name
thisdata = textscan(fileID,'%f64%f64%f64%f64%f64%f64%f64',500,'HeaderLines',1); %load just this file
thisdataM = cell2mat(thisdata); %transforms file from cell array to matrix
[M,I] = max(thisdataM(:,5)); %finds max WSE and row it's in
wavetime = 2*(I-1); %converts column of max WSE to time
fprintf(writefile,'%s %8.4f %4.0f \r\n',thisfilename,M,wavetime);
fclose(fileID); %closes file to make space for next one
end
The text file ends up just giving me the values for one iteration instead of all of them. I was able to use displaytable as a workaround, but then I have problems writing "thisfilename", which includes non-numerical characters.
Although I am not able to reproduce the issue with the code provided, a possible solution might be to write to the file outside of the loop and to close the file afterwards:
Output_FileName_MaxWaveHeights = ['C:\Users\jl44459\Desktop\QGIS_and_Basement\BASEMENT\Mesh_5_2045\Run_A\','MaxWaveHeights.txt'];
writefile = fopen(Output_FileName_MaxWaveHeights,'a');
s = [];
dinfo = dir('*.dat');
for K = 1 : length(dinfo)
thisfilename = dinfo(K).name; %just the name of the file
fileID = fopen(thisfilename); %creates numerical ID for the file name
thisdata = textscan(fileID,'%f64%f64%f64%f64%f64%f64%f64',500,'HeaderLines',1); %load just this file
thisdataM = cell2mat(thisdata); %transforms file from cell array to matrix
[M,I] = max(thisdataM(:,5)); %finds max WSE and row it's in
wavetime = 2*(I-1); %converts column of max WSE to time
s = [s, fprintf(writefile,'%s %8.4f %4.0f \r\n',thisfilename,M,wavetime)];
fclose(fileID); %closes file to make space for next one
end
fprintf(writefile,s);
fclose(writefile);
Solved--it was simply me forgetting to close the output file after the loop. Thanks for the help!

fprintf Octave - Data corruption

I am trying to write data to .txt files. Each of the files is around 170MB (after writing data to it).
I am using octave's fprintf function, with '%.8f' to write floating point values to a file. However, I am noticing a very weird error, in that a sub-set of entries in some of the files are getting corrupted. For example, one of the lines in a file is this:
0.43529412,0.}4313725,0.43137255,0.33233533,...
that "}" should have been "4". Now how did octave's fprintf write that "}" with '%.8f' option in the first place? What is going wrong?
Another example is,
0.73289\8B987,...
how did that "\8B" get there?
I have to process a very large data-set with 360 Million points in total. This error in a sub-set of rows in some files is becoming a big problem. What is causing this problem?
Also, this corruption doesn't occur at random. For example, if a file has 1.1 Million rows, where each row corresponds to a vector representing a data-instance, then the problem occurs say in 100 rows at max, and these 100 rows are clustered togeter. Say for example, these are distributed from row 8000 to 8150, but it is not the case that out of 100 corrupted rows, first 50 are located near say 10000th row and the remaining at say 20000th row. They always form a cluster.
Note: Below code is the code-block responsible for extracting data and writing it to files. Some variables in the code, like K_Cell have been computed computed earlier and play virtually no role in data-writing process.
mf = fspecial('gaussian',[5 5], 2);
fidM = fopen('14_01_2016_Go_AeossRight_ClustersM_wLAMRD.txt','w');
fidC = fopen('14_01_2016_Go_AeossRight_ClustersC_wLAMRD.txt','w');
fidW = fopen('14_01_2016_Go_AeossRight_ClustersW_wLAMRD.txt','w');
kIdx = 1;
featMat = [];
% - Generate file names to print the data to
featNo = 0;
fileNo = 1;
filePath = 'wLRD10_Data_Road/featMat_';
fileName = [filePath num2str(fileNo) '.txt'];
fidFeat = fopen(fileName, 'w');
% - Compute the global means and standard deviations
gMean = zeros(1,13); % - Global mean
gStds = zeros(1,13); % - Global variance
gNpts = 0; % - Total number of data points
fidStat = fopen('wLRD10_Data_Road/featStat.txt','w');
for i=1600:10:10000
if (featNo > 1000000)
% - If more than 1m points, close the file and open new one
fclose(fidFeat);
% - Get the new file name
fileNo = fileNo + 1;
fileName = [filePath num2str(fileNo) '.txt'];
fidFeat = fopen(fileName, 'w');
featNo = 0;
end
imgName = [fAddr num2str(i-1) '.jpg'];
img = imread(imgName);
Ir = im2double(img(:,:,1));
Ig = im2double(img(:,:,2));
Ib = im2double(img(:,:,3));
imgR = filter2(mf, Ir);
imgG = filter2(mf, Ig);
imgB = filter2(mf, Ib);
I = im2double(img);
I(:,:,1) = imgR;
I(:,:,2) = imgG;
I(:,:,3) = imgB;
I = im2uint8(I);
[Feat1, Feat2] = funcFeatures1(I);
[Feat3, Feat4] = funcFeatures2(I);
[Feat5, Feat6, Feat7] = funcFeatures3(I);
[Feat8, Feat9, Feat10] = funcFeatures4(I);
ids = K_Cell{kIdx};
pixVec = zeros(length(ids),13); % - Get the local image features
for s = 1:length(ids) % - Extract features
pixVec(s,:) = [Ir(ids(s,1),ids(s,2)) Ig(ids(s,1),ids(s,2)) Ib(ids(s,1),ids(s,2)) Feat1(ids(s,1),ids(s,2)) Feat2(ids(s,1),ids(s,2)) Feat3(ids(s,1),ids(s,2)) Feat4(ids(s,1),ids(s,2)) ...
Feat5(ids(s,1),ids(s,2)) Feat6(ids(s,1),ids(s,2)) Feat7(ids(s,1),ids(s,2)) Feat8(ids(s,1),ids(s,2))/100 Feat9(ids(s,1),ids(s,2))/500 Feat10(ids(s,1),ids(s,2))/200];
end
kIdx = kIdx + 1;
for s=1:length(ids)
featNo = featNo + 1;
fprintf(fidFeat,'%d,%.8f,%.8f,%.8f,%.8f,%.8f,%.8f,%.8f,%.8f,%.8f,%.8f,%.8f,%.8f,%.8f\n', featNo, pixVec(s,:));
end
% - Compute the mean and variances
for s = 1:length(ids)
gNpts = gNpts + 1;
delta = pixVec(s,:) - gMean;
gMean = gMean + delta./gNpts;
gStds = gStds*(gNpts-1)/gNpts + delta.*(pixVec(s,:) - gMean)/gNpts;
end
end
Note that the code block:
for s=1:length(ids)
featNo = featNo + 1;
fprintf(fidFeat,'%d,%.8f,%.8f,%.8f,%.8f,%.8f,%.8f,%.8f,%.8f,%.8f,%.8f,%.8f,%.8f,%.8f\n', featNo, pixVec(s,:));
end
is the only part of the code that writes the data-points to the files.
The earlier code-block,
if (featNo > 1000000)
% - If more than 1m points, close the file and open new one
fclose(fidFeat);
% - Get the new file name
fileNo = fileNo + 1;
fileName = [filePath num2str(fileNo) '.txt'];
fidFeat = fopen(fileName, 'w');
featNo = 0;
end
opens a new file for writing the data to it, when the currently opened file exceeds the limit of 1 million data-points.
Furthermore, note that
pixVec
variable cannot contain anything other than floats/double values, or the octave will throw an error.

Reading data from a Text File into Matlab array

I am having difficulty in reading data from a .txt file using Matlab.
I have to create a 200x128 dimension array in Matlab, using the data from the .txt file. This is a repetitive task, and needs automation.
Each row of the .txt file is a complex number of form a+ib, which is of form a[space]b. A sample of my text file :
Link to text file : Click Here
(0)
1.2 2.32222
2.12 3.113
.
.
.
3.2 2.22
(1)
4.4 3.4444
2.33 2.11
2.3 33.3
.
.
.
(2)
.
.
(3)
.
.
(199)
.
.
I have numbers of rows (X), inside the .txt file surrounded by brackets. My final matrix should be of size 200x128. After each (X), there are exactly 128 complex numbers.
Here is what I would do. First thing, delete the "(0)" types of lines from your text file (could even use a simple shells script for that). This I put into the file called post2.txt.
# First, load the text file into Matlab:
A = load('post2.txt');
# Create the imaginary numbers based on the two columns of data:
vals = A(:,1) + i*A(:,2);
# Then reshape the column of complex numbers into a matrix
mat = reshape(vals, [200,128]);
The mat will be a matrix of 200x128 complex data. Obviously at this point you can put a loop around this to do this multiple times.
Hope that helps.
You can read the data in using the following function:
function data = readData(aFilename, m,n)
% if no parameters were passed, use these as defaults:
if ~exist('aFilename', 'var')
m = 128;
n = 200;
aFilename = 'post.txt';
end
% init some stuff:
data= nan(n, m);
formatStr = [repmat('%f', 1, 2*m)];
% Read in the Data:
fid = fopen(aFilename);
for ind = 1:n
lineID = fgetl(fid);
dataLine = fscanf(fid, formatStr);
dataLineComplex = dataLine(1:2:end) + dataLine(2:2:end)*1i;
data(ind, :) = dataLineComplex;
end
fclose(fid);
(edit) This function can be improved by including the (1) parts in the format string and throwing them out:
function data = readData(aFilename, m,n)
% if no parameters were passed, use these as defaults:
if ~exist('aFilename', 'var')
m = 128;
n = 200;
aFilename = 'post.txt';
end
% init format stuff:
formatStr = ['(%*d)\n' repmat('%f%f\n', 1, m)];
% Read in the Data:
fid = fopen(aFilename);
data = fscanf(fid, formatStr);
data = data(1:2:end) + data(2:2:end)*1i;
data = reshape(data, n,m);
fclose(fid);