4,7,33 308:0.364759856031284 1156:0.273818346738286 1523:0.17279792082766 9306:0.243665855423149
7,4,33 1156:0.185729429759684 1681:0.104443202690279 5351:0.365670526234034 6212:0.0964006003127458
I have a textfile in the above format. The first 3 columns are labels and need to be extracted into another file keeping the order intact.
As one proceeds through the file, each row has varying number of labels. I have been reading the csvread function in Matlab but that is generic and would not work in the above case.
Next, I need to extract
308:0.364759856031284 1156:0.273818346738286 1523:0.17279792082766 9306:0.243665855423149
such that at column 308 in the Matrix for Row 1, I put in the value 0.364759856031284.
Since your row format changes you need to read the file line by line
fid = fopen(filename, 'rt'); % open file for text reading
resultMatrix = []; % start with empty matrix
while 1
line = fgetl(fid);
if ~ischar(line), break, end
% get the value separators, :
seps = strfind(line, ':');
% get labels
whitespace = iswhite(line(1:seps(1)-1)); % get white space
lastWhite = find(whitespace, 'last'); % get last white space pos
labelString = deblank(line(1:lastWhite)); % string with all labels separated by ,
labelString(end+1) = ','; % add final , to match way of extraction
ix = strfind(labelString, ','); % get pos of ,
labels = zeros(numel(ix),1); % create the labels array
start = 1;
for n1 = 1:numel(labels)
labels(n1,1) = str2num(labelString(start:ix(n1)-1));
end
% continue and do a similar construction for the COL:VALUE pairs
% If you do not know how many rows and columns your final matrix will have
% you need to continuously update the size 1 row for each loop and
% then add new columns if any of the current rows COLs are outside current matrix
cols = zeros(numel(seps,1);
values = cols;
for n1 = 1:numel(seps)
% find out current columns start pos and value end pos using knowledge of
% the separator, :, position and that whitespace separates COL:VALUE pairs
cols(n1,1) = str2num(line(colStart:seps(n1)-1));
values(n1,1) = str2num(line(seps(n1)+1:valueEnd));
end
if isempty(resultMatrix)
resultMatrix = zeros(1,max(cols)); % make the matrix large enough to hold the last col
else
[nR,nC] = size(resultMatrix); % get current size
if max(cols) > nC
% add columns to resultMatrix so we still can fit column data
resultMatrix = [resultMatrix, zeros(nR, max(cols)-nC)];
end
% add row
resultMatrix = [resultMatrix;zeros(1,max(nC,max(cols)))];
end
% loop through cols and values and populate the resultMatrix with wanted data
end
fclose(fid);
Note, the above code is untested so it will contain errors but it should be easy to fix.
I purposely left out minor parts and it should not be too difficult filling that out.
Related
I gave the error Error using readtable (line 216) Input must be a row vector of characters or string scalar when I tried to run this code in Matlab:
clear
close all
clc
D = 'C:\Users\Behzad\Desktop\New folder (2)';
filePattern = fullfile(D, '*.xlsx');
file = dir(filePattern);
x={};
for k = 1 : numel(file)
baseFileName = file(k).name;
fullFileName = fullfile(D, baseFileName);
x{k} = readtable(fullFileName);
fprintf('read file %s\n', fullFileName);
end
% allDates should be out of the loop because it's not necessary to be in the loop
dt1 = datetime([1982 01 01]);
dt2 = datetime([2018 12 31]);
allDates = (dt1 : calmonths(1) : dt2).';
allDates.Format = 'MM/dd/yyyy';
% 1) pre-allocate a cell array that will store
% your tables (see note #3)
T2 = cell(size(x)); % this should work, I don't know what x is
% the x is xlsx files and have different sizes, so I think it should be in
% a loop?
% creating loop
for idx = 1:numel(x)
T = readtable(x{idx});
% 2) This line should probably be T = readtable(x(idx));
sort = sortrows(T, 8);
selected_table = sort (:, 8:9);
tempTable = table(allDates(~ismember(allDates,selected_table.data)), NaN(sum(~ismember(allDates,selected_table.data)),size(selected_table,2)-1),'VariableNames',selected_table.Properties.VariableNames);
T2 = outerjoin(sort,tempTable,'MergeKeys', 1);
% 3) You're overwriting the variabe T2 on each iteration of the i-loop.
% to save each table, do this
T2{idx} = fillmissing(T2, 'next', 'DataVariables', {'lat', 'lon', 'station_elevation'});
end
the x is each xlsx file from the first loop. my xlsx file has a different column and row size. I want to make the second loop process for all my xlsx files in the directory.
did you know what is the problem? and how to fix it?
Readtable has one input argument, a filename. It returns a table. In your code you have the following:
x{k} = readtable(fullFileName);
All fine, you are reading the tables and storing the contents in x. Later in your code you continue with:
T = readtable(x{idx});
You already read the table, what you wrote is basically T = readtable(readtable(fullFileName)). Just use T=x{idx}
In short, I'm having a headache in multiple languages to read a txt file (linked below). My most familiar language is MATLAB so for that reason I'm using that in this example. I've found a way to read this file in ~ 5 minutes, but given I'll have tons and tons of data from my instrument shortly as it measures all day every 30 seconds this just isn't feasible.
I'm looking for a way to quickly read these irregular text files so that going forward I can knock these out with less of a time burden.
You can find my exact data at this link:
http://lb3.pandonia.net/BostonMA/Pandora107s1/L0/Pandora107s1_BostonMA_20190814_L0.txt.bz2
I've been using the "readtable" function in matlab and I have achieved a final product I want but I'm looking to increase the speed
Below is my code!
clearvars -except pan day1; % Clearing all variables except for the day and instrument variables.
close all;
clc;
pan_mat = [107 139 155 153]; % Matrix of pandora numbers for file-choosing
reasons.
pan = pan_mat(pan); % pandora number I'm choosing
pan = num2str(pan); % Turning Pandora number into a string.
%pan = '107'
pandora = strcat('C:\Users\tadams15\Desktop\Folders\Counts\Pandora_Dta\',pan)
% string that designates file location
%date = '90919'
month = '09'; % Month
day2 = strcat('0',num2str(day1)) % Creating a day name for the figure I ultimately produce
cd(pandora)
d2 = strcat('2019',num2str(month),num2str(day2)); % The final date variable
for the figure I produce
%file_pan = 'Pandora107s1_BostonMA_20190909_L0';
file_pan = strcat('Pandora',pan,'s1_BostonMA_',d2,'_L0'); % File name string
%Try reading it in line by line?
% Load in as a string and then convert the lines you want as numbers into
% number.
delimiterIn = '\t';
headerlinesIn = 41;
A = readtable(file_pan,'HeaderLines', 41, 'Delimiter', '\t'); %Reading the
file as a table
A = table2cell(A); % Converting file to a cell
A = regexp(A, ' ', 'split'); % converting cell to a structure matrix.
%%
A= array2table(A); % Converting Structure matrix back to table
row_num = 0;
pan_mat_2 = zeros(2359,4126);
datetime_mat = zeros(2359,2);
blank = 0;
%% Converting data to proper matrices
[length width] = size(A);
% The matrix below is going through "A" and writing from it to a new
% matrix, "pan_mat_2" which is my final product as well as singling out the
% rows that contain non-number variables I'd like to keep and adding them
% later.
tic
%flag1
for i = 1:length; % Make second number the length of the table, A
blank = 0;
b = table2array(A{i,1});
[rows, columns] = size(b);
if columns > 4120 && columns < 4140
row_num = row_num + 1;
blank = regexp(b(2), 'T', 'split');
blank2 = regexp(blank{1,1}(2), 'Z', 'split');
datetime_mat(row_num,1) = str2double(blank{1,1}(1));
datetime_mat(row_num,2) = str2double(blank2{1,1}(1));
for j = 1:4126;
pan_mat_2(row_num,j) = str2double(b(j));
end
end
end
toc
%flag2
In short, I'm already getting the result I want but the part of the code where I'm writing to a new array "flag 1" to "flag 2" is taking roughly 222 seconds while the entire code only takes about 248 seconds. I'd like to find a better way to create the data there than to write it to a new array and take a whole bunch of time.
Any suggestions?
Note:
There are a quite a few improvments you can make for speed but there are also corrections. You preallocate you final output variable with hard coded values:
pan_mat_2 = zeros(2359,4126);
But later you populate it in a loop which run for i = 1:length.
length is the full number of lines picked from the file. In your example file there are only 784 lines. So even if all your line were valid (ok to be parsed), you would only ever fill the first 784 lines of the total 2359 lines you allocated in your pan_mat_2. In practice, this file has only 400 valid data lines, so your pan_mat_2 could definitely be smaller.
I know you couldn't know you had only 400 line parsed before you parsed them, but you knew from the beginning that you had only 784 line to parse (you had the info in the variable length). So in case like these pre-allocate to 784 and only later discard the empty lines.
Fortunately, the solution I propose does not need to pre-allocate larger then discard. The matrices will end up the right size from the start.
The code:
%%
file_pan = 'Pandora107s1_BostonMA_20190814_L0.txt' ;
delimiterIn = '\t';
headerlinesIn = 41;
A = readtable(file_pan,'HeaderLines', 41, 'Delimiter', '\t'); %Reading the file as a table
A = table2cell(A); % Converting file to a cell
A = regexp(A, ' ', 'split'); % converting cell to a structure matrix.
%% Remove lines which won't be parsed
% Count the number of elements in each line
nelem = cell2mat( cellfun( #size , A ,'UniformOutput',0) ) ;
nelem(:,1) = [] ;
% find which lines does not have enough elements to be parsed
idxLine2Remove = ~(nelem > 4120 & nelem < 4140) ;
% remove them from the data set
A(idxLine2Remove) = [] ;
%% Remove nesting in cell array
nLinesToParse = size(A,1) ;
A = reshape( [A{:}] , [], nLinesToParse ).' ;
% now you have a cell array of size [400x4126] cells
%% Now separate the columns with different data type
% Column 1 => [String] identifier
% Column 2 => Timestamp
% Column 3 to 4125 => Numeric values
% Column 4126 => empty cell created during the 'split' operation above
% because of a trailing space character.
LineIDs = A(:,1) ;
TimeStamps = A(:,2) ;
Data = A(:,3:end-1) ; % fetch to "end-1" to discard last empty column
%% now extract the values
% You could do that directly:
% pan_mat = str2double(Data) ;
% but this takes a long time. A much computationnaly faster way (even if it
% uses more complex code) would be:
dat = strjoin(Data) ; % create a single long string made of all the strings in all the cells
nums = textscan( dat , '%f' , Inf ) ; % call textscan on it (way faster than str2double() )
pan_mat = reshape( cell2mat( nums ) , nLinesToParse ,[] ) ; % reshape to original dimensions
%% timestamps
% convert to character array
strTimeStamps = char(TimeStamps) ;
% convert to matlab own datetime numbering. This will be a lot faster if
% you have operations to do on the time stamps later
ts = datenum(strTimeStamps,'yyyymmddTHHMMSSZ') ;
%% If you really want them the way you had it in your example
strTimeStamps(:,9) = ' ' ; % replace 'T' with ' '
strTimeStamps(:,end) = ' ' ; % replace 'Z' characters with ' '
%then same again, merge into a long string, parse then reshape accordingly
strdate = reshape(strTimeStamps.',1,[]) ;
tmp = textscan( strdate , '%d' , Inf ) ;
datetime_mat = reshape( double(cell2mat(tmp)),2,[]).' ;
The performance:
As you can see on my machine your original code takes ~102 seconds to execute, with 80% of that (81s) spent on calling the function str2double() 3,302,400 times!
My solution, run on the same input file, takes ~5.5 seconds, with half of the time spent on calling strjoin() 3 times.
When you read the code above, try to understand how I limited the repetition of function call in lengthy loops by trying to keep everything as vectorised as possible.
Using the profiler, you can see that you call str2double 3302400 times in a run which takes about 80% of the total time on my pc. Now thats suboptimal, as each time you only translate 1 value and as far as your code goes you dont need the values as string again. I added this under you original code:
row_num = 0;
pan_mat_2_b = cell(2359,4126);
datetime_mat_b = cell(2359,2);%not zeros
blank = 0;
tic
%flag1
for i = 1:length % Make second number the length of the table, A
blank = 0;
b = table2array(A{i,1});
[rows, columns] = size(b);
if columns > 4120 && columns < 4140
row_num = row_num + 1;
blank = regexp(b(2), 'T', 'split');
blank2 = regexp(blank{1,1}(2), 'Z', 'split');
%datetime_mat(row_num,1) = str2double(blank{1,1}(1));
%datetime_mat(row_num,2) = str2double(blank2{1,1}(1));
datetime_mat_b(row_num,1) = blank{1,1}(1);
datetime_mat_b(row_num,2) = blank2{1,1}(1);
pan_mat_2_b(row_num,:) = b;
% for j = 1:4126
% pan_mat_2(row_num,j) = str2double(b(j));
% end
end
end
datetime_mat_b = datetime_mat_b(~all(cellfun('isempty',datetime_mat_b),2),:);
pan_mat_2_b=pan_mat_2_b(~all(cellfun('isempty',pan_mat_2_b),2),:);
datetime_mat_b=str2double(string(datetime_mat_b));
pan_mat_2_b=str2double(pan_mat_2_b);
toc
Still not great, but better. If you want to speed this up further i recommend you take a closer look at the readtable part. As you can save up quite some time if you start with reading in the format as doubles right from the beginning
I'm trying to make some changes in a list of precipitation data.
The data are in a *.txt file in this format:
50810,200301010600,0.0
50810,200301010601,0.0
50810,200301010800,0.0
50810,200301010938,0.1
50810,200301010947,0.1
50810,200301010957,0.1
Each file I have contains precipitation data per minute from 1 year.
The first number is just the station number, and this is equal for each line in the file.
In my original file, I have one line with data for every minute.
I want to make a new file that contains:
Only lines with precipitation that is not zero, so I want to remove
all the lines where the last number is zero. I've figured this out.
If there is one whole hour with no precipitation, I want to remove all
of the zero-lines and create a new line which says:
50810,200301010600,0.0. If there was no precipitation between 6 and 7
am. at the 01.01.
clear all
data = load('jan-31des_2003.txt'); %opens file with data
fid=fopen('50810_2003','w'); %opens empty file to write
[nrow, ncol] = size(data); %size of data
fprintf(fid,'%5s %12s %5s \r\n','Snr','Dato - Tid','RR_01') %Header
for row = 1:nrow
y = data(row,2); %year
m = data(row,3); % month
d = data(row,4); % date
h = data(row,5); % hour
M = data(row,6); % minute
p = data(row,8); % precipitation
if p > 0
fprintf(fid,'50810,%04d%02d%02d%02d%02d,%.1f \r\n',[y,m,d,h,M,p]);
end
if p==0
HERE I NEED SOME HELP
end
end
fclose(fid);
What is the code for my desired formatting within the if p==0 condition?
Assuming you've already imported your data as an M*3 array named infile and M is divisible by 60 and your data begins at the first minute of an hour, and you have an array outfile that will be written to a new file:
outfile = []
while ~isempty(infile)
block = infile(1:60,:); % Take one hour's worth of data
if sum (block(:,3)==0) == 60 % Check if the last column are all zero
outfile = cat(1, outfile, [block(1,1:2), 0]); % If so, put one line to new array
else
filter = block(:,3)~=0; % Otherwise, pick out only rows that the last column is not zeros
outfile = cat(1, outfile, block(filter,:)); % Put these rows into the new array
end
infile(1:60,:) = []; % Remove already processed data from original array
end
Then write the entire outfile array to file.
Can someone help me to understand how I can save in matlab a group of .csv files, select only the columns in which I am interested and get as output a final file in which I have the average value of the y columns and standard deviation of y axes? I am not so good in matlab and so I kindly ask if someone to help me to solve this question.
Here what I tried to do till now:
clear all;
clc;
which_column = 5;
dirstats = dir('*.csv');
col3Complete=0;
col4Complete=0;
for K = 1:length(dirstats)
[num,txt,raw] = xlsread(dirstats(K).name);
col3=num(:,3);
col4=num(:,4);
col3Complete=[col3Complete;col3];
col4Complete=[col4Complete;col4];
avgVal(K)=mean(col4(:));
end
col3Complete(1)=[];
col4Complete(1)=[];
%columnavg = mean(col4Complete);
%columnstd = std(col4Complete);
% xvals = 1 : size(columnavg,1);
% plot(xvals, columnavg, 'b-', xvals, columnavg-columnstd, 'r--', xvals, columnavg+columstd, 'r--');
B = reshape(col4Complete,[5000,K]);
m=mean(B,2);
C = reshape (col4Complete,[5000,K]);
S=std(C,0,2);
Now I know that I should compute mean and stdeviation inside for loop, using mean()function, but I am not sure how I can use it.
which_column = 5;
dirstats = dir('*.csv');
col3Complete=[]; % Initialise as empty matrix
col4Complete=[];
avgVal = zeros(length(dirstats),2); % initialise as columnvector
for K = 1:length(dirstats)
[num,txt,raw] = xlsread(dirstats(K).name);
col3=num(:,3);
col4=num(:,4);
col3Complete=[col3Complete;col3];
col4Complete=[col4Complete;col4];
avgVal(K,1)=mean(col4(:)); % 1st column contains mean
avgVal(K,2)=std(col4(:)); % 2nd column contains standard deviation
end
%columnavg = mean(col4Complete);
%columnstd = std(col4Complete);
% xvals = 1 : size(columnavg,1);
% plot(xvals, columnavg, 'b-', xvals, columnavg-columnstd, 'r--', xvals, columnavg+columstd, 'r--');
B = reshape(col4Complete,[5000,K]);
meanVals=mean(B,2);
I didn't change much, just initialised your arrays as empty arrays so you do not have to delete the first entry later on and made avgVal a column vector with the mean in column 1 and the standard deviation in column 1. You can of course add two columns if you want to collect those statistics for your 3rd column in the csv as well.
As a side note: xlsread is rather heavy for reading files, since Excel is horribly inefficient. If you want to read a structured file such as a csv, it's faster to use importdata.
Create some random matrix to store in a file with header:
A = rand(1e3,5);
out = fopen('output.csv','w');
fprintf(out,['ColumnA', '\t', 'ColumnB', '\t', 'ColumnC', '\t', 'ColumnD', '\t', 'ColumnE','\n']);
fclose(out);
dlmwrite('output.csv', A, 'delimiter','\t','-append');
Load it using csvread:
data = csvread('output.csv',1);
data now contains your five columns, without any headers.
I'm trying to load the following ascii file into MATLAB using load()
% some comment
1 0xc661
2 0xd661
3 0xe661
(This is actually a simplified file. The actual file I'm trying to load contains an undefined number of columns and an undefined number of comment lines at the beginning, which is why the load function was attractive)
For some strange reason, I obtain the following:
K>> data = load('testMixed.txt')
data =
1 50785
2 58977
3 58977
I've observed that the problem occurs anytime there's a "d" in the hexadecimal number.
Direct hex2dec conversion works properly:
K>> hex2dec('d661')
ans =
54881
importdata seems to have the same conversion issue, and so does the ImportWizard:
K>> importdata('testMixed.txt')
ans =
1 50785
2 58977
3 58977
Is that a bug, am I using the load function in some prohibited way, or is there something obvious I'm overlooking?
Are there workarounds around the problem, save from reimplementing the file parsing on my own?
Edited my input file to better reflect my actual file format. I had a bit oversimplified in my original question.
"GOLF" ANSWER:
This starts with the answer from mtrw and shortens it further:
fid = fopen('testMixed.txt','rt');
data = textscan(fid,'%s','Delimiter','\n','MultipleDelimsAsOne','1',...
'CommentStyle','%');
fclose(fid);
data = strcat(data{1},{' '});
data = sscanf([data{:}],'%i',[sum(isspace(data{1})) inf]).';
PREVIOUS ANSWER:
My first thought was to use TEXTSCAN, since it has an option that allows you to ignore certain lines as comments when they start with a given character (like %). However, TEXTSCAN doesn't appear to handle numbers in hexadecimal format well. Here's another option:
fid = fopen('testMixed.txt','r'); % Open file
% First, read all the comment lines (lines that start with '%'):
comments = {};
position = 0;
nextLine = fgetl(fid); % Read the first line
while strcmp(nextLine(1),'%')
comments = [comments; {nextLine}]; % Collect the comments
position = ftell(fid); % Get the file pointer position
nextLine = fgetl(fid); % Read the next line
end
fseek(fid,position,-1); % Rewind to beginning of last line read
% Read numerical data:
nCol = sum(isspace(nextLine))+1; % Get the number of columns
data = fscanf(fid,'%i',[nCol inf]).'; % Note '%i' works for all integer formats
fclose(fid); % Close file
This will work for an arbitrary number of comments at the beginning of the file. The computation to get the number of columns was inspired by Jacob's answer.
New:
This is the best I could come up with. It should work for any number of comment lines and columns. You'll have to do the rest yourself if there are strings, etc.
% Define the characters representing the start of the commented line
% and the delimiter
COMMENT_START = '%%';
DELIMITER = ' ';
% Open the file
fid = fopen('testMixed.txt');
% Read each line till we reach the data
l = COMMENT_START;
while(l(1)==COMMENT_START)
l = fgetl(fid);
end
% Compute the number of columns
cols = sum(l==DELIMITER)+1;
% Split the first line
split_l = regexp(l,' ','split');
% Read all the data
A = textscan(fid,'%s');
% Compute the number of rows
rows = numel(A{:})/cols;
% Close the file
fclose(fid);
% Assemble all the data into a matrix of cell strings
DATA = [split_l ; reshape(A{:},[cols rows])']; %' adding this to make it pretty in SO
% Recognize each column and process accordingly
% by analyzing each element in the first row
numeric_data = zeros(size(DATA));
for i=1:cols
str = DATA(1,i);
% If there is no '0x' present
if isempty(findstr(str{1},'0x')) == true
% This is a number
numeric_data(:,i) = str2num(char(DATA(:,i)));
else
% This is a hexadecimal number
col = char(DATA(:,i));
numeric_data(:,i) = hex2dec(col(:,3:end));
end
end
% Display the data
format short g;
disp(numeric_data)
This works for data like this:
% Comment 1
% Comment 2
1.2 0xc661 10 0xa661
2 0xd661 20 0xb661
3 0xe661 30 0xc661
Output:
1.2 50785 10 42593
2 54881 20 46689
3 58977 30 50785
OLD:
Yeah, I don't think LOAD is the way to go. You could try:
a = char(importdata('testHexa.txt'));
a = hex2dec(a(:,3:end));
This is based on both gnovice's and Jacob's answers, and is a "best of breed"
For files like:
% this is my comment
% this is my other comment
1 0xc661 123
2 0xd661 456
% surprise comment
3 0xe661 789
4 0xb661 1234567
(where the number of columns within the file MUST be the same, but not known ahead of time, and all comments denoted by a '%' character), the following code is fast and easy to read:
f = fopen('hexdata.txt', 'rt');
A = textscan(f, '%s', 'Delimiter', '\n', 'MultipleDelimsAsOne', '1', 'CollectOutput', '1', 'CommentStyle', '%');
fclose(f);
A = A{1};
data = sscanf(A{1}, '%i')';
data = repmat(data, length(A), 1);
for ctr = 2:length(A)
data(ctr,:) = sscanf(A{ctr}, '%i')';
end