I'm trying to make some changes in a list of precipitation data.
The data are in a *.txt file in this format:
50810,200301010600,0.0
50810,200301010601,0.0
50810,200301010800,0.0
50810,200301010938,0.1
50810,200301010947,0.1
50810,200301010957,0.1
Each file I have contains precipitation data per minute from 1 year.
The first number is just the station number, and this is equal for each line in the file.
In my original file, I have one line with data for every minute.
I want to make a new file that contains:
Only lines with precipitation that is not zero, so I want to remove
all the lines where the last number is zero. I've figured this out.
If there is one whole hour with no precipitation, I want to remove all
of the zero-lines and create a new line which says:
50810,200301010600,0.0. If there was no precipitation between 6 and 7
am. at the 01.01.
clear all
data = load('jan-31des_2003.txt'); %opens file with data
fid=fopen('50810_2003','w'); %opens empty file to write
[nrow, ncol] = size(data); %size of data
fprintf(fid,'%5s %12s %5s \r\n','Snr','Dato - Tid','RR_01') %Header
for row = 1:nrow
y = data(row,2); %year
m = data(row,3); % month
d = data(row,4); % date
h = data(row,5); % hour
M = data(row,6); % minute
p = data(row,8); % precipitation
if p > 0
fprintf(fid,'50810,%04d%02d%02d%02d%02d,%.1f \r\n',[y,m,d,h,M,p]);
end
if p==0
HERE I NEED SOME HELP
end
end
fclose(fid);
What is the code for my desired formatting within the if p==0 condition?
Assuming you've already imported your data as an M*3 array named infile and M is divisible by 60 and your data begins at the first minute of an hour, and you have an array outfile that will be written to a new file:
outfile = []
while ~isempty(infile)
block = infile(1:60,:); % Take one hour's worth of data
if sum (block(:,3)==0) == 60 % Check if the last column are all zero
outfile = cat(1, outfile, [block(1,1:2), 0]); % If so, put one line to new array
else
filter = block(:,3)~=0; % Otherwise, pick out only rows that the last column is not zeros
outfile = cat(1, outfile, block(filter,:)); % Put these rows into the new array
end
infile(1:60,:) = []; % Remove already processed data from original array
end
Then write the entire outfile array to file.
Related
In short, I'm having a headache in multiple languages to read a txt file (linked below). My most familiar language is MATLAB so for that reason I'm using that in this example. I've found a way to read this file in ~ 5 minutes, but given I'll have tons and tons of data from my instrument shortly as it measures all day every 30 seconds this just isn't feasible.
I'm looking for a way to quickly read these irregular text files so that going forward I can knock these out with less of a time burden.
You can find my exact data at this link:
http://lb3.pandonia.net/BostonMA/Pandora107s1/L0/Pandora107s1_BostonMA_20190814_L0.txt.bz2
I've been using the "readtable" function in matlab and I have achieved a final product I want but I'm looking to increase the speed
Below is my code!
clearvars -except pan day1; % Clearing all variables except for the day and instrument variables.
close all;
clc;
pan_mat = [107 139 155 153]; % Matrix of pandora numbers for file-choosing
reasons.
pan = pan_mat(pan); % pandora number I'm choosing
pan = num2str(pan); % Turning Pandora number into a string.
%pan = '107'
pandora = strcat('C:\Users\tadams15\Desktop\Folders\Counts\Pandora_Dta\',pan)
% string that designates file location
%date = '90919'
month = '09'; % Month
day2 = strcat('0',num2str(day1)) % Creating a day name for the figure I ultimately produce
cd(pandora)
d2 = strcat('2019',num2str(month),num2str(day2)); % The final date variable
for the figure I produce
%file_pan = 'Pandora107s1_BostonMA_20190909_L0';
file_pan = strcat('Pandora',pan,'s1_BostonMA_',d2,'_L0'); % File name string
%Try reading it in line by line?
% Load in as a string and then convert the lines you want as numbers into
% number.
delimiterIn = '\t';
headerlinesIn = 41;
A = readtable(file_pan,'HeaderLines', 41, 'Delimiter', '\t'); %Reading the
file as a table
A = table2cell(A); % Converting file to a cell
A = regexp(A, ' ', 'split'); % converting cell to a structure matrix.
%%
A= array2table(A); % Converting Structure matrix back to table
row_num = 0;
pan_mat_2 = zeros(2359,4126);
datetime_mat = zeros(2359,2);
blank = 0;
%% Converting data to proper matrices
[length width] = size(A);
% The matrix below is going through "A" and writing from it to a new
% matrix, "pan_mat_2" which is my final product as well as singling out the
% rows that contain non-number variables I'd like to keep and adding them
% later.
tic
%flag1
for i = 1:length; % Make second number the length of the table, A
blank = 0;
b = table2array(A{i,1});
[rows, columns] = size(b);
if columns > 4120 && columns < 4140
row_num = row_num + 1;
blank = regexp(b(2), 'T', 'split');
blank2 = regexp(blank{1,1}(2), 'Z', 'split');
datetime_mat(row_num,1) = str2double(blank{1,1}(1));
datetime_mat(row_num,2) = str2double(blank2{1,1}(1));
for j = 1:4126;
pan_mat_2(row_num,j) = str2double(b(j));
end
end
end
toc
%flag2
In short, I'm already getting the result I want but the part of the code where I'm writing to a new array "flag 1" to "flag 2" is taking roughly 222 seconds while the entire code only takes about 248 seconds. I'd like to find a better way to create the data there than to write it to a new array and take a whole bunch of time.
Any suggestions?
Note:
There are a quite a few improvments you can make for speed but there are also corrections. You preallocate you final output variable with hard coded values:
pan_mat_2 = zeros(2359,4126);
But later you populate it in a loop which run for i = 1:length.
length is the full number of lines picked from the file. In your example file there are only 784 lines. So even if all your line were valid (ok to be parsed), you would only ever fill the first 784 lines of the total 2359 lines you allocated in your pan_mat_2. In practice, this file has only 400 valid data lines, so your pan_mat_2 could definitely be smaller.
I know you couldn't know you had only 400 line parsed before you parsed them, but you knew from the beginning that you had only 784 line to parse (you had the info in the variable length). So in case like these pre-allocate to 784 and only later discard the empty lines.
Fortunately, the solution I propose does not need to pre-allocate larger then discard. The matrices will end up the right size from the start.
The code:
%%
file_pan = 'Pandora107s1_BostonMA_20190814_L0.txt' ;
delimiterIn = '\t';
headerlinesIn = 41;
A = readtable(file_pan,'HeaderLines', 41, 'Delimiter', '\t'); %Reading the file as a table
A = table2cell(A); % Converting file to a cell
A = regexp(A, ' ', 'split'); % converting cell to a structure matrix.
%% Remove lines which won't be parsed
% Count the number of elements in each line
nelem = cell2mat( cellfun( #size , A ,'UniformOutput',0) ) ;
nelem(:,1) = [] ;
% find which lines does not have enough elements to be parsed
idxLine2Remove = ~(nelem > 4120 & nelem < 4140) ;
% remove them from the data set
A(idxLine2Remove) = [] ;
%% Remove nesting in cell array
nLinesToParse = size(A,1) ;
A = reshape( [A{:}] , [], nLinesToParse ).' ;
% now you have a cell array of size [400x4126] cells
%% Now separate the columns with different data type
% Column 1 => [String] identifier
% Column 2 => Timestamp
% Column 3 to 4125 => Numeric values
% Column 4126 => empty cell created during the 'split' operation above
% because of a trailing space character.
LineIDs = A(:,1) ;
TimeStamps = A(:,2) ;
Data = A(:,3:end-1) ; % fetch to "end-1" to discard last empty column
%% now extract the values
% You could do that directly:
% pan_mat = str2double(Data) ;
% but this takes a long time. A much computationnaly faster way (even if it
% uses more complex code) would be:
dat = strjoin(Data) ; % create a single long string made of all the strings in all the cells
nums = textscan( dat , '%f' , Inf ) ; % call textscan on it (way faster than str2double() )
pan_mat = reshape( cell2mat( nums ) , nLinesToParse ,[] ) ; % reshape to original dimensions
%% timestamps
% convert to character array
strTimeStamps = char(TimeStamps) ;
% convert to matlab own datetime numbering. This will be a lot faster if
% you have operations to do on the time stamps later
ts = datenum(strTimeStamps,'yyyymmddTHHMMSSZ') ;
%% If you really want them the way you had it in your example
strTimeStamps(:,9) = ' ' ; % replace 'T' with ' '
strTimeStamps(:,end) = ' ' ; % replace 'Z' characters with ' '
%then same again, merge into a long string, parse then reshape accordingly
strdate = reshape(strTimeStamps.',1,[]) ;
tmp = textscan( strdate , '%d' , Inf ) ;
datetime_mat = reshape( double(cell2mat(tmp)),2,[]).' ;
The performance:
As you can see on my machine your original code takes ~102 seconds to execute, with 80% of that (81s) spent on calling the function str2double() 3,302,400 times!
My solution, run on the same input file, takes ~5.5 seconds, with half of the time spent on calling strjoin() 3 times.
When you read the code above, try to understand how I limited the repetition of function call in lengthy loops by trying to keep everything as vectorised as possible.
Using the profiler, you can see that you call str2double 3302400 times in a run which takes about 80% of the total time on my pc. Now thats suboptimal, as each time you only translate 1 value and as far as your code goes you dont need the values as string again. I added this under you original code:
row_num = 0;
pan_mat_2_b = cell(2359,4126);
datetime_mat_b = cell(2359,2);%not zeros
blank = 0;
tic
%flag1
for i = 1:length % Make second number the length of the table, A
blank = 0;
b = table2array(A{i,1});
[rows, columns] = size(b);
if columns > 4120 && columns < 4140
row_num = row_num + 1;
blank = regexp(b(2), 'T', 'split');
blank2 = regexp(blank{1,1}(2), 'Z', 'split');
%datetime_mat(row_num,1) = str2double(blank{1,1}(1));
%datetime_mat(row_num,2) = str2double(blank2{1,1}(1));
datetime_mat_b(row_num,1) = blank{1,1}(1);
datetime_mat_b(row_num,2) = blank2{1,1}(1);
pan_mat_2_b(row_num,:) = b;
% for j = 1:4126
% pan_mat_2(row_num,j) = str2double(b(j));
% end
end
end
datetime_mat_b = datetime_mat_b(~all(cellfun('isempty',datetime_mat_b),2),:);
pan_mat_2_b=pan_mat_2_b(~all(cellfun('isempty',pan_mat_2_b),2),:);
datetime_mat_b=str2double(string(datetime_mat_b));
pan_mat_2_b=str2double(pan_mat_2_b);
toc
Still not great, but better. If you want to speed this up further i recommend you take a closer look at the readtable part. As you can save up quite some time if you start with reading in the format as doubles right from the beginning
I have a function which generates yout_new(5000,1) at every iteration and I want to store this data to a netcdf file and append the new data generated at every iteration into this existing file . At the 2nd iteration the stored variable size should be yout_new(5000,2) . Here is my try which doesn't work. Is there is any nice way to do it ?
neq=5000;
filename='thrust.nc';
if ~exist(filename, 'file')
%% create file
ncid=netcdf.create(filename,'NC_WRITE');
%%define dimension
tdimID = netcdf.defDim(ncid,'t',...
netcdf.getConstant('NC_UNLIMITED'));
ydimID = netcdf.defDim(ncid,'y',neq);
%%define varibale
varid = netcdf.defVar(ncid,'yout','NC_DOUBLE',[ydimID tdimID]);
netcdf.endDef(ncid);
%%put variables from workspace ( i is the iteration)
netcdf.putVar(ncid,varid,[ 0 0 ],[ neq 0],yout_new);
%%close the file
netcdf.close(ncid);
else
%% open the existing file
ncid=netcdf.open(filename,'NC_WRITE');
%Inquire variables
[varname,xtype,dimids,natts] = netcdf.inqVar(ncid,0);
varid = netcdf.inqVarID(ncid,varname);
%Enquire current dimension length
[dimname, dimlen] = netcdf.inqDim(ncid,0);
% Append new data to existing variable.
netcdf.putVar(ncid,varid,dimlen,numel(yout_new),yout_new);
netcdf.close(ncid);
There are more easy functions in MATLAB, to deal with netCDF. You read about ncdisp, ncinfo,nccreate,ncread,ncwrite. Coming to the question, you said you have to write two columns, I will take number of columns as variable (infinity), every time you can append the columns. Check the below code:
N = 3 ; % number of columns
rows = 5000 ; % number of rows
ncfile = 'myfile.nc' ; % my ncfile name
nccreate(ncfile,'yout_new','Dimensions',{'row',rows,'col',Inf},'DeflateLevel',5) ; % creat nc file
% generate your data in loop and write to nc file
for i = 1:N
yout_new = rand(rows,1) ;
ncwrite(ncfile,'yout_new',yout_new,[1,i]) ;
end
Please not that, it is not mandatory to make number of columns as unlimited, you can fix it to your desired number instead of inf.
I have a large number of text files that I have to read, find the max value for a certain column, and the corresponding time. The for loop for finding these values works fine, but my problem is writing a text file that shows the three variables I need (thisfilename, M, and wavetime) for each iteration of the for loop.
Output_FileName_MaxWaveHeights = ['C:\Users\jl44459\Desktop\QGIS_and_Basement\BASEMENT\Mesh_5_2045\Run_A\','MaxWaveHeights.txt'];
writefile = fopen(Output_FileName_MaxWaveHeights,'a');
dinfo = dir('*.dat');
for K = 1 : length(dinfo)
thisfilename = dinfo(K).name; %just the name of the file
fileID = fopen(thisfilename); %creates numerical ID for the file name
thisdata = textscan(fileID,'%f64%f64%f64%f64%f64%f64%f64',500,'HeaderLines',1); %load just this file
thisdataM = cell2mat(thisdata); %transforms file from cell array to matrix
[M,I] = max(thisdataM(:,5)); %finds max WSE and row it's in
wavetime = 2*(I-1); %converts column of max WSE to time
fprintf(writefile,'%s %8.4f %4.0f \r\n',thisfilename,M,wavetime);
fclose(fileID); %closes file to make space for next one
end
The text file ends up just giving me the values for one iteration instead of all of them. I was able to use displaytable as a workaround, but then I have problems writing "thisfilename", which includes non-numerical characters.
Although I am not able to reproduce the issue with the code provided, a possible solution might be to write to the file outside of the loop and to close the file afterwards:
Output_FileName_MaxWaveHeights = ['C:\Users\jl44459\Desktop\QGIS_and_Basement\BASEMENT\Mesh_5_2045\Run_A\','MaxWaveHeights.txt'];
writefile = fopen(Output_FileName_MaxWaveHeights,'a');
s = [];
dinfo = dir('*.dat');
for K = 1 : length(dinfo)
thisfilename = dinfo(K).name; %just the name of the file
fileID = fopen(thisfilename); %creates numerical ID for the file name
thisdata = textscan(fileID,'%f64%f64%f64%f64%f64%f64%f64',500,'HeaderLines',1); %load just this file
thisdataM = cell2mat(thisdata); %transforms file from cell array to matrix
[M,I] = max(thisdataM(:,5)); %finds max WSE and row it's in
wavetime = 2*(I-1); %converts column of max WSE to time
s = [s, fprintf(writefile,'%s %8.4f %4.0f \r\n',thisfilename,M,wavetime)];
fclose(fileID); %closes file to make space for next one
end
fprintf(writefile,s);
fclose(writefile);
Solved--it was simply me forgetting to close the output file after the loop. Thanks for the help!
I have a problem doing som changes to a *.txt file.
The data have this format:
11,2003,1,1,9,38,40.38,1
11,2003,1,1,9,47,2.5,1
11,2003,1,1,10,34,43.88,1
11,2003,1,1,10,38,14.5,1
11,2003,1,1,12,47,13.2,1
Where the columns are station number,year, month, day, hour, minute, seconds and precipitation(1 = 0.1 mm)
The times that have precipitation = 0 are not included in the list. This results in hours without rainfall will be absent. For these cases I want to make one entry for the first minute of the hour without rainfall in the New file, to show that there has been made measurements. Like this:
50810,200301010938,0.1
50810,200301010947,0.1
50810,200301011034,0.1
50810,200301011038,0.1
50810,200301011100,0.0 <---- This is what I need to get in the New file
50810,200301011247,0.1
(New station number, date/time, precipitation)
For now I've come up With this:
clear all
data = load('jan-31des_2003.txt'); %opens file with data
fid=fopen('50810_2003','w'); %opens empty file to write
[nrow, ncol] = size(data); %size of data
fprintf(fid,'%5s %12s %5s \r\n','Snr','Dato - Tid','RR_01') %Header
for row = 1:nrow
y = data(row,2); %year
m = data(row,3); % month
d = data(row,4); % date
h = data(row,5); % hour
M = data(row,6); % minute
p = data(row,8); % precipitation
p = p*0.1
end
fclose(fid);
You can use an if command to check if the next hour you are looking at is more than a single hour ahead from the last one. If that's the case you can create a new entry at this point:
if data(row,5) > (data(row-1,5)+1)
y = data(row,2); %year
m = data(row,3); % month
d = data(row,4); % date
h = data(row,5)+1; % hour
M = 00; % minute
p = 0; % precipitation
end
after this part you will need to make check again if there is another 'skipped' hour and so on. You should also replace your for-loop, by a while-loop until you reach the last entry in your dataset.
Try to implement this idea and come back to us with your code in case it didn't work out.
I'm trying to load the following ascii file into MATLAB using load()
% some comment
1 0xc661
2 0xd661
3 0xe661
(This is actually a simplified file. The actual file I'm trying to load contains an undefined number of columns and an undefined number of comment lines at the beginning, which is why the load function was attractive)
For some strange reason, I obtain the following:
K>> data = load('testMixed.txt')
data =
1 50785
2 58977
3 58977
I've observed that the problem occurs anytime there's a "d" in the hexadecimal number.
Direct hex2dec conversion works properly:
K>> hex2dec('d661')
ans =
54881
importdata seems to have the same conversion issue, and so does the ImportWizard:
K>> importdata('testMixed.txt')
ans =
1 50785
2 58977
3 58977
Is that a bug, am I using the load function in some prohibited way, or is there something obvious I'm overlooking?
Are there workarounds around the problem, save from reimplementing the file parsing on my own?
Edited my input file to better reflect my actual file format. I had a bit oversimplified in my original question.
"GOLF" ANSWER:
This starts with the answer from mtrw and shortens it further:
fid = fopen('testMixed.txt','rt');
data = textscan(fid,'%s','Delimiter','\n','MultipleDelimsAsOne','1',...
'CommentStyle','%');
fclose(fid);
data = strcat(data{1},{' '});
data = sscanf([data{:}],'%i',[sum(isspace(data{1})) inf]).';
PREVIOUS ANSWER:
My first thought was to use TEXTSCAN, since it has an option that allows you to ignore certain lines as comments when they start with a given character (like %). However, TEXTSCAN doesn't appear to handle numbers in hexadecimal format well. Here's another option:
fid = fopen('testMixed.txt','r'); % Open file
% First, read all the comment lines (lines that start with '%'):
comments = {};
position = 0;
nextLine = fgetl(fid); % Read the first line
while strcmp(nextLine(1),'%')
comments = [comments; {nextLine}]; % Collect the comments
position = ftell(fid); % Get the file pointer position
nextLine = fgetl(fid); % Read the next line
end
fseek(fid,position,-1); % Rewind to beginning of last line read
% Read numerical data:
nCol = sum(isspace(nextLine))+1; % Get the number of columns
data = fscanf(fid,'%i',[nCol inf]).'; % Note '%i' works for all integer formats
fclose(fid); % Close file
This will work for an arbitrary number of comments at the beginning of the file. The computation to get the number of columns was inspired by Jacob's answer.
New:
This is the best I could come up with. It should work for any number of comment lines and columns. You'll have to do the rest yourself if there are strings, etc.
% Define the characters representing the start of the commented line
% and the delimiter
COMMENT_START = '%%';
DELIMITER = ' ';
% Open the file
fid = fopen('testMixed.txt');
% Read each line till we reach the data
l = COMMENT_START;
while(l(1)==COMMENT_START)
l = fgetl(fid);
end
% Compute the number of columns
cols = sum(l==DELIMITER)+1;
% Split the first line
split_l = regexp(l,' ','split');
% Read all the data
A = textscan(fid,'%s');
% Compute the number of rows
rows = numel(A{:})/cols;
% Close the file
fclose(fid);
% Assemble all the data into a matrix of cell strings
DATA = [split_l ; reshape(A{:},[cols rows])']; %' adding this to make it pretty in SO
% Recognize each column and process accordingly
% by analyzing each element in the first row
numeric_data = zeros(size(DATA));
for i=1:cols
str = DATA(1,i);
% If there is no '0x' present
if isempty(findstr(str{1},'0x')) == true
% This is a number
numeric_data(:,i) = str2num(char(DATA(:,i)));
else
% This is a hexadecimal number
col = char(DATA(:,i));
numeric_data(:,i) = hex2dec(col(:,3:end));
end
end
% Display the data
format short g;
disp(numeric_data)
This works for data like this:
% Comment 1
% Comment 2
1.2 0xc661 10 0xa661
2 0xd661 20 0xb661
3 0xe661 30 0xc661
Output:
1.2 50785 10 42593
2 54881 20 46689
3 58977 30 50785
OLD:
Yeah, I don't think LOAD is the way to go. You could try:
a = char(importdata('testHexa.txt'));
a = hex2dec(a(:,3:end));
This is based on both gnovice's and Jacob's answers, and is a "best of breed"
For files like:
% this is my comment
% this is my other comment
1 0xc661 123
2 0xd661 456
% surprise comment
3 0xe661 789
4 0xb661 1234567
(where the number of columns within the file MUST be the same, but not known ahead of time, and all comments denoted by a '%' character), the following code is fast and easy to read:
f = fopen('hexdata.txt', 'rt');
A = textscan(f, '%s', 'Delimiter', '\n', 'MultipleDelimsAsOne', '1', 'CollectOutput', '1', 'CommentStyle', '%');
fclose(f);
A = A{1};
data = sscanf(A{1}, '%i')';
data = repmat(data, length(A), 1);
for ctr = 2:length(A)
data(ctr,:) = sscanf(A{ctr}, '%i')';
end