Scan a document, find a line, extract numbers below

Scan a document, find a line, extract numbers below - simulink

I have the following .csv file:
Marker,"loop_-105"
Id,1
"Time (Seconds)","Sig (ts)"
"1.920576","3.98"
"1.957907","31.58"
"1.9912","34.422"
...
Marker,"loop_-102.1"
Id,1
"Time (Seconds)","Sig (ts)"
"4.920576","3.98"
"2.07","31.58"
"1.9912","34.422"
...
I want to open the file, extract the values -105 and -102.1 along with the values in quotes in two matrices. I have tried textscan and regex but did not quite work.
str2double(regexp(filetext, '(?<=loop_[^0-9]*)[0-9]*\.?[0-9]+', 'match'));
find(~cellfun(#isempty,strfind(new_measurement,'Time (Seconds)","Sig (ts)')))

Here is one attempt, I hope I understood correctly what you'd like to achieve.
I used the following file data.csv
Marker,"loop_-105"
Id,1
"Time (Seconds)","Sig (ts)"
"1.920576","3.98"
"1.957907","31.58"
"1.9912","34.422"
"1.920576","3.98"
"1.957907","31.58"
"1.9912","34.422"
Marker,"loop_-102.1"
Id,1
"Time (Seconds)","Sig (ts)"
"4.920576","3.98"
"2.07","31.58"
"1.9912","34.422"
and the following script to read it:
filename = './data.csv';
% read all lines as strings
fid = fopen(filename, 'r');
rawdata = textscan(fid, '%s');
fclose(fid);
% get the single lines
lines = rawdata{1};
% two regexp, one for the Marker, one for the values
startregexp = 'Marker,"loop_(?<startnum>-{0,1}\d+.{0,1}\d*)"';
valueregexp = '"(?<v1>\d+.{0,1}\d*)","(?<v2>\d+.{0,1}\d*)"';
% Data is stored in this cell
result = {};
% Loop through the lines
for lcount = 1:numel(lines)
start = regexp(lines{lcount}, startregexp, 'names');
if ~isempty(start)
% The Marker-Regexp matched, start a new cell entry
result{end+1} = struct(...
'Loop', str2double(start.startnum), ...
'values', []);
continue
end
values = regexp(lines{lcount}, valueregexp, 'names');
if ~isempty(values)
% the value regexp matched. Add data
result{end}.values(end+1, :) = [str2double(values.v1) str2double(values.v2)];
end
end
result is:
>> result
result =
1×2 cell array
{1×1 struct} {1×1 struct}
>> result{1}
ans =
struct with fields:
Loop: -105
values: [6×2 double]
>> result{1}.values
ans =
1.9206 3.9800
1.9579 31.5800
1.9912 34.4220
1.9206 3.9800
1.9579 31.5800
1.9912 34.4220

Related

MATLAB reads UNICODE CSV with spaces between characters

I am using the fgetl command to read a .csv file but instead of returning the results I wanted as:
"HIST",1,1,27,PWH,"1"
it returned with additional space between each character:
" H I S T " , 1 , 1 , 2 7 , P W H , " 1 "
I know that I can replace the space with regexprep, but my file contains billions of lines so the added expression might consume considerably more time. I had a feeling that this is a unicode issue and someone pointed out the same issue when he used Java and it was related to unicode. I wonder if anyone knows a better way to deal with the problem in MATLAB?
Update:
It should be the unicode issue because the .csv file is an output from another program, and when I read it using fgetl the spaces are added. However, if I save the .csv file again using Excel and read the .csv file using fgetl again, it returns the results I want.
I am not able to provide an example because the .csv file is very large and I cannot make a small sample because when I open and save it from Excel, this problem is gone.

For the purpose of demonstration, let's consider a demo file - demo.csv:
"GIST",1,6,17,PWH,"1"
"FIST",0,4,72,WPH,"2"
"MIST",3,2,27,WHP,"3"
You have some options:
textscan (for any text file with a known structure):
fID = fopen('demo.csv');
C = textscan(fID,'%s%d%d%d%s%s','Delimiter',{',','"'},'MultipleDelimsAsOne',1);
fclose(fID);
Which results in:
C =
{3x1 cell} [3x1 int32] [3x1 int32] [3x1 int32] {3x1 cell} {3x1 cell}
Import helper + generate script (AKA overkill is an understatement):
Which results in:
%% Import data from text file.
% Script for importing data from the following text file:
%
% F:\demo.csv
%
% To extend the code to different selected data or a different text file, generate a
% function instead of a script.
% Auto-generated by MATLAB on 2016/04/20 19:51:32
%% Initialize variables.
filename = 'F:\demo.csv';
delimiter = ',';
%% Read columns of data as strings:
% For more information, see the TEXTSCAN documentation.
formatSpec = '%q%q%q%q%q%q%[^\n\r]';
%% Open the text file.
fileID = fopen(filename,'r');
%% Read columns of data according to format string.
% This call is based on the structure of the file used to generate this code. If an error
% occurs for a different file, try regenerating the code from the Import Tool.
dataArray = textscan(fileID, formatSpec, 'Delimiter', delimiter, 'ReturnOnError', false);
%% Close the text file.
fclose(fileID);
%% Convert the contents of columns containing numeric strings to numbers.
% Replace non-numeric strings with NaN.
raw = repmat({''},length(dataArray{1}),length(dataArray)-1);
for col=1:length(dataArray)-1
raw(1:length(dataArray{col}),col) = dataArray{col};
end
numericData = NaN(size(dataArray{1},1),size(dataArray,2));
for col=[2,3,4,6]
% Converts strings in the input cell array to numbers. Replaced non-numeric strings with
% NaN.
rawData = dataArray{col};
for row=1:size(rawData, 1);
% Create a regular expression to detect and remove non-numeric prefixes and suffixes.
regexstr = '(?<prefix>.*?)(?<numbers>([-]*(\d+[\,]*)+[\.]{0,1}\d*[eEdD]{0,1}[-+]*\d*[i]{0,1})|([-]*(\d+[\,]*)*[\.]{1,1}\d+[eEdD]{0,1}[-+]*\d*[i]{0,1}))(?<suffix>.*)';
try
result = regexp(rawData{row}, regexstr, 'names');
numbers = result.numbers;
% Detected commas in non-thousand locations.
invalidThousandsSeparator = false;
if any(numbers==',');
thousandsRegExp = '^\d+?(\,\d{3})*\.{0,1}\d*$';
if isempty(regexp(numbers, thousandsRegExp, 'once'));
numbers = NaN;
invalidThousandsSeparator = true;
end
end
% Convert numeric strings to numbers.
if ~invalidThousandsSeparator;
numbers = textscan(strrep(numbers, ',', ''), '%f');
numericData(row, col) = numbers{1};
raw{row, col} = numbers{1};
end
catch me
end
end
end
%% Split data into numeric and cell columns.
rawNumericColumns = raw(:, [2,3,4,6]);
rawCellColumns = raw(:, [1,5]);
%% Allocate imported array to column variable names
GIST = rawCellColumns(:, 1);
VarName2 = cell2mat(rawNumericColumns(:, 1));
VarName3 = cell2mat(rawNumericColumns(:, 2));
VarName4 = cell2mat(rawNumericColumns(:, 3));
PWH = rawCellColumns(:, 2);
VarName6 = cell2mat(rawNumericColumns(:, 4));
%% Clear temporary variables
clearvars filename delimiter formatSpec fileID dataArray ans raw col numericData rawData row regexstr result numbers invalidThousandsSeparator thousandsRegExp me rawNumericColumns rawCellColumns;
csvread (for numeric values only; which means it is not applicable here).

I happened to have the same issue. I opened a .csv file using textscan and it added 1 whitespace on both side of any character and I also noticed that when opening the variable storing the read data, the font was different than the usual in Matlab.
We managed to solve this issue by opening the '.csv' file into Notepad++ and changed the encoding to UTF-8. It solved the problem.
Hope it helps!

Edit generated 'importdata' function to import all files in directory in Matlab

Seeking help from skillful Matlab users!
I'm kind of new to Matlab and hope somebody has the time to help me. I need to import some .txt-files from a directory. I have found a way to do this trough the import tool. There are some data using comma insted of dots, so importdata will not work, but the 'import data' tool does.
So i'm wondering (and hoping) if it is possible to edit the generated function to import all the files in the directory, in such a way as the single file is imported? I want each file to be imported as matrix variable (double). I want to import all the files in one process (loop). Also there are many files and they all have some 100 000 lines or so.
If someone see an easy way to do this i would appreciate the help. Please keep the explanation on a low level, as i'm quite novice. I get the following function using the 'import data' tool:
function Streaming0x00x00158D00000E04621709201405 = importfile1(filename, startRow, endRow)
%IMPORTFILE1 Import numeric data from a text file as a matrix.
% STREAMING0X00X00158D00000E04621709201405 = IMPORTFILE1(FILENAME) Reads
% data from text file FILENAME for the default selection.
%
% STREAMING0X00X00158D00000E04621709201405 = IMPORTFILE1(FILENAME,
% STARTROW, ENDROW) Reads data from rows STARTROW through ENDROW of text
% file FILENAME.
%
% Example:
% Streaming0x00x00158D00000E04621709201405 =
% importfile1('Streaming_0_x_0_0_x_00158D00000E0462_17-09-2014_05.32.24_part000.txt',
% 17, 137834);
%
% See also TEXTSCAN.
% Auto-generated by MATLAB on 2015/02/04 09:28:07
%% Initialize variables.
delimiter = ';';
if nargin<=2
startRow = 17;
endRow = inf;
end
%% Read columns of data as strings:
% For more information, see the TEXTSCAN documentation.
formatSpec = '%s%s%[^\n\r]';
%% Open the text file.
fileID = fopen(filename,'r');
%% Read columns of data according to format string.
% This call is based on the structure of the file used to generate this
% code. If an error occurs for a different file, try regenerating the code
% from the Import Tool.
textscan(fileID, '%[^\n\r]', startRow(1)-1, 'ReturnOnError', false);
dataArray = textscan(fileID, formatSpec, endRow(1)-startRow(1)+1, 'Delimiter', delimiter, 'ReturnOnError', false);
for block=2:length(startRow)
frewind(fileID);
textscan(fileID, '%[^\n\r]', startRow(block)-1, 'ReturnOnError', false);
dataArrayBlock = textscan(fileID, formatSpec, endRow(block)-startRow(block)+1, 'Delimiter', delimiter, 'ReturnOnError', false);
for col=1:length(dataArray)
dataArray{col} = [dataArray{col};dataArrayBlock{col}];
end
end
%% Close the text file.
fclose(fileID);
%% Convert the contents of columns containing numeric strings to numbers.
% Replace non-numeric strings with NaN.
raw = repmat({''},length(dataArray{1}),length(dataArray)-1);
for col=1:length(dataArray)-1
raw(1:length(dataArray{col}),col) = dataArray{col};
end
numericData = NaN(size(dataArray{1},1),size(dataArray,2));
for col=[1,2]
% Converts strings in the input cell array to numbers. Replaced non-numeric
% strings with NaN.
rawData = dataArray{col};
for row=1:size(rawData, 1);
% Create a regular expression to detect and remove non-numeric prefixes and
% suffixes.
regexstr = '(?<prefix>.*?)(?<numbers>([-]*(\d+[\.]*)+[\,]{0,1}\d*[eEdD]{0,1}[-+]*\d*[i]{0,1})|([-]*(\d+[\.]*)*[\,]{1,1}\d+[eEdD]{0,1}[-+]*\d*[i]{0,1}))(?<suffix>.*)';
try
result = regexp(rawData{row}, regexstr, 'names');
numbers = result.numbers;
% Detected commas in non-thousand locations.
invalidThousandsSeparator = false;
if any(numbers=='.');
thousandsRegExp = '^\d+?(\.\d{3})*\,{0,1}\d*$';
if isempty(regexp(thousandsRegExp, '.', 'once'));
numbers = NaN;
invalidThousandsSeparator = true;
end
end
% Convert numeric strings to numbers.
if ~invalidThousandsSeparator;
numbers = strrep(numbers, '.', '');
numbers = strrep(numbers, ',', '.');
numbers = textscan(numbers, '%f');
numericData(row, col) = numbers{1};
raw{row, col} = numbers{1};
end
catch me
end
end
end
%% Replace non-numeric cells with NaN
R = cellfun(#(x) ~isnumeric(x) && ~islogical(x),raw); % Find non-numeric cells
raw(R) = {NaN}; % Replace non-numeric cells
%% Create output variable
Streaming0x00x00158D00000E04621709201405 = cell2mat(raw);
If something is unclear, please comment.
All help is useful, thanks :)

If all files are the same, you can make a cell array of the filenames (NOT a standard array, they do not behave correctly on strings). Then you can loop over the cell array. For instance:
fname_arr = {'file1.txt','file2.txt'}; % your filenames go here
for k in length(fname_arr):
filename = fname_arr{k};
%% Open the text file.
fileID = fopen(filename,'r'); % start of the relevant part of your codeblock
<...> % omitting the stuff in the middle of the code
fclose(fileID) % end of the relevant part of your codeblock
allDataArray{k} = DataArray
end
Then allDataArray is a cell array whose kth element contains the DataArray obtained from file fname_arr{k}.

Ok, tried something here. Implemented this in my function:
output=(dir_output);
for k=1:length(output);
filename = output{k}.name;
%% Open the text file. fileID = fopen(filename,'r');
where 'dir_output' is a struct, containing all the file names in the Directory. Also put in:
%% Close the text file.
fclose(fileID);
allDataArray{k} = DataArray;
end
Get this as error:
>> function1 Undefined function or variable 'dir_output'. Error in function1 (line 30) output=(dir_output);
Why???

Text Scanning to read in unknown number of variables and unknown number of runs

I am trying to read in a csv file which will have the format
Var1 Val1A Val1B ... Val1Q
Var2 Val2A Val2B ... Val2Q
...
And I will not know ahead of time how many variables (rows) or how many runs (columns) will be in the file.
I have been trying to get text scan to work but no matter what I try I cannot get either all the variable names isolated or a rows by columns cell array. This is what I've been trying.
fID = fopen(strcat(pwd,'/',inputFile),'rt');
if fID == -1
disp('Could not find file')
return
end
vars = textscan(fID, '%s,%*s','delimiter','\n');
fclose(fID);
Does anyone have a suggestion?

If the file has the same number of columns in each row (you just don't know how many to begin with), try the following.
First, figure out how many columns by parsing just the first row and find the number of columns, then parse the full file:
% Open the file, get the first line
fid = fopen('myfile.txt');
line = fgetl(fid);
fclose(fid);
tmp = textscan(line, '%s');
% The length of tmp will tell you how many lines
n = length(tmp);
% Now scan the file
fid = fopen('myfile.txt');
tmp = textscan(fid, repmat('%s ', [1, n]));
fclose(fid);

For any given file, are all the lines equal length? If they are, you could start by reading in the first line and use that to count the number of fields and then use textscan to read in the file.
fID = fopen(strcat(pwd,'/',inputFile),'rt');
firstLine = fgetl(fID);
numFields = length(strfind(firstLine,' ')) + 1;
fclose(fID);
formatString = repmat('%s',1,numFields);
fID = fopen(strcat(pwd,'/',inputFile),'rt');
vars = textscan(fID, formatString,' ');
fclose(fID);
Now you will have a cell array where first entry are the var names and all the other entries are the observations.
In this case I assumed the delimiter was space even though you said it was a csv file. If it is really commas, you can change the code accordingly.

Reading CSV with mixed type data

I need to read the following csv file in MATLAB:
2009-04-29 01:01:42.000;16271.1;16271.1
2009-04-29 02:01:42.000;2.5;16273.6
2009-04-29 03:01:42.000;2.599609;16276.2
2009-04-29 04:01:42.000;2.5;16278.7
...
I'd like to have three columns:
timestamp;value1;value2
I tried the approaches described here:
Reading date and time from CSV file in MATLAB
modified as:
filename = 'prova.csv';
fid = fopen(filename, 'rt');
a = textscan(fid, '%s %f %f', ...
'Delimiter',';', 'CollectOutput',1);
fclose(fid);
But it returs a 1x2 cell, whose first element is a{1}='ÿþ2', the other are empty.
I had also tried to adapt to my case the answers to these questions:
importing data with time in MATLAB
Read data files with specific format in matlab and convert date to matal serial time
but I didn't succeed.
How can I import that csv file?
EDIT After the answer of #macduff i try to copy-paste in a new file the data reported above and use:
a = textscan(fid, '%s %f %f','Delimiter',';');
and it works.
Unfortunately that didn't solve the problem because I have to process csv files generated automatically, which seems to be the cause of the strange MATLAB behavior.

What about trying:
a = textscan(fid, '%s %f %f','Delimiter',';');
For me I get:
a =
{4x1 cell} [4x1 double] [4x1 double]
So each element of a corresponds to a column in your csv file. Is this what you need?
Thanks!

Seems you're going about it the right way. The example you provide poses no problems here, I get the output you desire. What's in the 1x2 cell?
If I were you I'd try again with a smaller subset of the file, say 10 lines, and see if the output changes. If yes, then try 100 lines, etc., until you find where the 4x1 cell + 4x2 array breaks down into the 1x2 cell. It might be that there's an empty line or a single empty field or whatever, which forces textscan to collect data in an additional level of cells.
Note that 'CollectOutput',1 will collect the last two columns into a single array, so you'll end up with 1 cell array of 4x1 containing strings, and 1 array of 4x2 containing doubles. Is that indeed what you want? Otherwise, see #macduff's post.

I've had to parse large files like this, and I found I didn't like textscan for this job. I just use a basic while loop to parse the file, and I use datevec to extract the timestamp components into a 6-element time vector.
%% Optional: initialize for speed if you have large files
n = 1000 %% <# of rows in file - if known>
timestamp = zeros(n,6);
value1 = zeros(n,1);
value2 = zeros(n,1);
fid = fopen(fname, 'rt');
if fid < 0
error('Error opening file %s\n', fname); % exit point
end
cntr = 0
while true
tline = fgetl(fid); %% get one line
if ~ischar(tline), break; end; % break out of loop at end of file
cntr = cntr + 1;
splitLine = strsplit(tline, ';'); %% split the line on ; delimiters
timestamp(cntr,:) = datevec(splitLine{1}, 'yyyy-mm-dd HH:MM:SS.FFF'); %% using datevec to parse time gives you a standard timestamp vector
value1(cntr) = splitLine{2};
value2(cntr) = splitLine{3};
end
%% Concatenate at the end if you like
result = [timestamp value1 value2];

Outputing cell array to CSV file ( MATLAB )

I've created a m x n cell array using cell(m,n), and filled each of the cells with arbitrary strings.
How do I output the cell array as a CSV file, where each cell in the array is a cell in the CSV 'spreadsheet'.
I've tried using cell2CSV, but I get errors ...
Error in ==> cell2csv at 71
fprintf(datei, '%s', var);
Caused by:
Error using ==> dlmwrite at 114
The input cell array cannot be converted to a matrix.
Any guidance will be well received :)

Here is a somewhat vectorized solution:
%# lets create a cellarray filled with random strings
C = cell(10,5);
chars = char(97:122);
for i=1:numel(C)
C{i} = chars(ceil(numel(chars).*rand(1,randi(10))));
end
%# build cellarray of lines, values are comma-separated
[m n] = size(C);
CC = cell(m,n+n-1);
CC(:,1:2:end) = C;
CC(:,2:2:end,:) = {','};
CC = arrayfun(#(i) [CC{i,:}], 1:m, 'UniformOutput',false)'; %'
%# write lines to file
fid = fopen('output.csv','wt');
fprintf(fid, '%s\n',CC{:});
fclose(fid);
The strings:
C =
'rdkhshx' 'egxpnpvnfl' 'qnwcxcndo' 'gubkafae' 'yvsejeaisq'
'kmsvpoils' 'zqssj' 't' 'ge' 'lhntto'
'sarlldvig' 'oeoslv' 'xznhcnptc' 'px' 'qdnjcdfr'
'jook' 'jlkutlsy' 'neyplyr' 'fmjngbleay' 'sganh'
'nrys' 'sckplbfv' 'vnorj' 'ztars' 'xkarvzblpr'
'vdbce' 'w' 'pwk' 'ofufjxw' 'qsjpdbzh'
'haoc' 'r' 'lh' 'ipxxp' 'zefiyk'
'qw' 'fodrpb' 'vkkjd' 'wlxa' 'dkj'
'ozonilmbxb' 'd' 'clg' 'seieik' 'lc'
'vkpvx' 'l' 'ldm' 'bohgge' 'aouglob'
The resulting CSV file:
rdkhshx,egxpnpvnfl,qnwcxcndo,gubkafae,yvsejeaisq
kmsvpoils,zqssj,t,ge,lhntto
sarlldvig,oeoslv,xznhcnptc,px,qdnjcdfr
jook,jlkutlsy,neyplyr,fmjngbleay,sganh
nrys,sckplbfv,vnorj,ztars,xkarvzblpr
vdbce,w,pwk,ofufjxw,qsjpdbzh
haoc,r,lh,ipxxp,zefiyk
qw,fodrpb,vkkjd,wlxa,dkj
ozonilmbxb,d,clg,seieik,lc
vkpvx,l,ldm,bohgge,aouglob

Last commment was written in "pure" C. So It doesnt work in Matlab.
Here it is the right solution.
function [ ] = writecellmatrixtocsvfile( filename, matrix )
%WRITECELLMATRIXTOCSVFILE Summary of this function goes here
% Detailed explanation goes here
fid = fopen(filename,'w');
for i = 1:size(matrix,1)
for j = 1:size(matrix,2)
fprintf(fid,'%s',matrix{i,j});
if j~=size(matrix,2)
fprintf(fid,'%s',',');
else
fprintf(fid,'\n');
end
end
end
fclose(fid);
end

easy enough to write your own csv writer.
-- edited to reflect comments --
fid = fopen('myfilename.csv','w');
for i = 1:size(A,1)
for j = 1:size(A,2)
fprintf(fid,'%s',A{i,j});
if(j!=size(A,2)
fprintf(fid,',',A{i,j})
else
fprintf(fid,'\n')
end
end
end
fclose(fid);

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Scan a document, find a line, extract numbers below - simulink

Related

MATLAB reads UNICODE CSV with spaces between characters

Edit generated 'importdata' function to import all files in directory in Matlab

Text Scanning to read in unknown number of variables and unknown number of runs

Reading CSV with mixed type data

Outputing cell array to CSV file ( MATLAB )

Categories

Resources