Matlab: how to read a .txt file with many separators - matlab

this is my first question here on stackoverflow. I have a problem reading a .txt file with Matlab using textread. The .txt, really messy, has a structure as below.
"ALMEMO";"BEREICH:";"L420";"DIGI";"DIGI";"DIGI";"DIGI";;;;;;;"DIGI";"DIGI";"DIGI";"DIGI";;;;;;;"DIGI";"DIGI";"DIGI";"DIGI";;;;;;;"DIGI";"DIGI";"DIGI";"DIGI";;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;"CoCo";"CoCo";"CoCo";"CoCo";"CuCo";"CoCo";"CoCo";"CoCo";"CoCo";"CoCo";;;;;;;;;;;"CoCo";"CoCo";"CoCo";"CoCo";"CoCo";"CoCo";"CoCo";"CoCo";"CoCo";"CoCo"
"5690-2";"KOMMENTAR:";"";"T,t ";"T,t ";"Temperatur";"T,t ";;;;;;;"RH,Uw ";"RH,Uw ";"Feuchte ";"RH,Uw ";;;;;;;"DT,td ";"DT,td ";"Taupunkt ";"DT,td ";;;;;;;"MH,r g/kg ";"MH,r g/kg ";"Mischung ";"MH,r g/kg ";;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;"";"";"";"";"";"";"";"";"";"";;;;;;;;;;;"";"";"";"";"";"";"";"";"";""
"SD3.10";"GW-MAX:";;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
"ALMEMO.001";"GW-MIN:";;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
"DATUM:";"ZEIT:";"M00: ms";"M01: øC";"M02: øC";"M03: øC";"M04: øC";;;;;;;"M11: %H";"M12: %H";"M13: %H";"M14: %H";;;;;;;"M21: øC";"M22: øC";"M23: øC";"M24: øC";;;;;;;"M31: gk";"M32: gk";"M33: gk";"M34: gk";;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;"M70: øC";"M71: øC";"M72: øC";"M73: øC";"M74: øC";"M75: øC";"M76: øC";"M77: øC";"M78: øC";"M79: øC";;;;;;;;;;;"M90: øC";"M91: øC";"M92: øC";"M93: øC";"M94: øC";"M95: øC";"M96: øC";"M97: øC";"M98: øC";"M99: øC"
07.03.21;11:29:24;0,;22,91;23,15;23,68;22,75;;;;;;;38,3;74,1;70,;38,8;;;;;;;8,;18,3;17,8;8,1;;;;;;;6,6;13,2;12,8;6,6;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;-;-;-;-;-;-;-;-;-;-;;;;;;;;;;;-;-;-;-;-;-;-;-;-;-
;11:30:24;0,;22,9;23,14;23,69;22,82;;;;;;;38,4;72,6;71,9;38,5;;;;;;;8,;18,;18,3;8,;;;;;;;6,6;12,9;13,2;6,6;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;-;-;-;-;-;-;-;-;-;-;;;;;;;;;;;-;-;-;-;-;-;-;-;-;-
;11:31:24;0,;22,94;23,14;23,68;22,88;;;;;;;38,3;75,4;71,5;38,5;;;;;;;8,;18,6;18,2;8,1;;;;;;;6,6;13,4;13,1;6,6;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;-;-;-;-;-;-;-;-;-;-;;;;;;;;;;;-;-;-;-;-;-;-;-;-;-
;11:32:24;0,;23,;23,13;23,68;22,95;;;;;;;38,2;73,;72,3;38,5;;;;;;;8,;18,1;18,4;8,1;;;;;;;6,6;13,;13,3;6,7;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;-;-;-;-;-;-;-;-;-;-;;;;;;;;;;;-;-;-;-;-;-;-;-;-;-
six lines of header are followed by the actual data, that are separated by ';' and have the floating point numbers formatted with comas instead of dots. The data I need is not represented by the whole line but only the first nine elements (date, hour, 9 floating point numbers).
The code I wrote to read the file, a bit naively, looking at other codes, is the following:
[date1, hour1, V0, Temp1, Temp2, Temp3, Temp4, RH1, RH2, RH3, RH4] = textread('file.txt', '%c %c %f %f %f %f %f %c* %c* %c* %c* %c* %c* %f %f %f %f', 'headerlines', 7, 'delimiter', ';');
obviously it does not work. I think the headers should be skipped already in my version of the code, so, to summarize, the following questions remain:
How can I treat many separators as one? (or ignore them, as I tried to do in my code)
How can I make the date, that appears only in the first line after the header, appear in the whole code? (I think i can fill the first column of the output matrix afterwards with a for cycle)
How can I cut the lines of the text file, ignoring everything that comes after the ninth floating point number?
-How can I read coma separated floating point numbers? (I tried to convert them to dots with the notepad "Replace" function, this is a valid solution in my case, but still does not solve the problem)
Thanks in advance to everyone who will answer, take care,
Giuseppe

You could take advantage of the built-in arguments for textscan to correctly handle the lines of the header and parse out the multiple delimiters. Then handle the commas for dots replacement with strrep. Finally, you can convert your cell array of strings to an array of numbers with str2double.
fid = fopen('foo.txt');
C = textscan(fid, repmat('%s',1,9), 'Headerlines', 6, 'Delimiter', ';', 'MultipleDelimitersAsOne', 1);
col1 = str2double( strrep(C{1}, ',', '.') );
Very roundabout way of accomplishing your task but text handling is not exactly matlab's strong point.

Related

How to skip first few rows when reading from a file in matlab [duplicate]

When I try to use headerlines with textscan to skip the first line of the text file, all of my data cells are stored as empty.
fid = fopen('RYGB.txt');
A = textscan(fid, '%s %s %s %f', 'HeaderLines', '1');
fclose(fid);
This code gives
1x4 Cell
[] [] [] []
Without the headerlines part and without a first line that needs to be skipped in the text file, the data is read in with no problem. It creates a 1x4 cell with data cells containing all of the information from the text file in columns.
What can I do to to skip the first line of the text file and read my data in normally?
Thanks
I think your problem is that you have specified a string instead of an integer value for HeaderLines. The character '1' is interpreted as its ASCII value, 0x31 (49 decimal), so the first 49 lines are skipped. Your file probably contains 49 lines or less, so everything ends up being discarded. This is why you're getting empty cells.
The solution is to replace '1' with 1 (i.e. remove the quotes), like so:
A = textscan(fid, '%s %s %s %f', 'HeaderLines', 1);
and this should do the trick.

Why can't the matlab textscan function read + 22.24 as a float?

I'm currently having a problem with the matlab function textscan.
I got a data file which looks like this:
1,2018/08/14 17:06:15, 0,+ 22.24,+ 22.46,+ 18.18,+0.0000,+0.0005,LLLLLLLLLL,LLLLLLLLLL,LLLL
or sometimes when a sensor isn't working properly it looks like this:
1,2018/07/11 17:02:53, 0,+ 23.88,+ 24.78,+ 23.65,+++++++,+ 23.94,+ 23.01,+ 24.33,LLLLLLLLLL,LLLLLLLLLL,LLLL
Since the data varies from file to file I am creating a matching formatSpec from the headerline.
In the 1st case it would look like
formatSpec = '%*u %s %*u%f%f%f%f%f%*[^\n]'
and in the 2nd case like
formatSpec = '%*u %s %*u%f%f%f%f%f%f%f%*[^\n]'
I am using the texscan function like this:
textscan(fileID, formatSpec_data, data_rows, 'Delimiter', ',', 'TreatAsEmpty', {'+++++++'},'EmptyValue', NaN, 'ReturnOnError', 0 );
but it keeps throwing an error on me with the message
Error using textscan
Mismatch between file and format character vector.
Trouble reading 'Numeric' field from file (row number 1, field number 4) ==> + 23.88,+ 24.78,+ 23.65,+++++++,+ 23.94,+ 23.01,+ 24.33,LLLLLLLLLL,LLLLLLLLLL,LLLL\n
Error in data_logger (line 31)
dataArray = textscan(fileID, formatSpec_data, data_rows, 'Delimiter', delimiter, 'HeaderLines' ,startRow, 'TreatAsEmpty', {'+++++++'},'EmptyValue', NaN, 'ReturnOnError', 0 );
When I deactivate 'returnOnError' then textscan reads only the first row and except the date/time string everything is just empty. I also tried to use textscan without TreatAsEmpty and / or EmptyValue but I get the same result.
I really don't get why textscan got problems to read e.g. ,+ 22.24 as a float.
When I specify formatSpec to read all the data as strings it works but then I have to use str2num afterwards which I don't really want to do.
I'm thankful for every help and looking forward to understand this behaviour.
Short answer: Matlab doesn't like the space between the + and the number in those fields. I think the simplest solution may be to just tell Matlab to ignore the + by calling it white space. Add the arguments 'WhiteSpace','+' when you call textscan, like this:
textscan(fileID, formatSpec_data, data_rows, 'Delimiter', ',', 'EmptyValue', NaN, 'ReturnOnError', 0 , 'WhiteSpace', '+');
Note that I also removed the 'TreatAsEmpty' argument, because once you consider all the + as white space, it is empty anyway.
Another option would be to pre-parse the file and remove the space between the + and the number. You could read the file using fileread, do a replacement using strrep or regexprep, then run textscan on the result.
datain = fileread('mydatafile.csv')
datain = strrep(datain,'+ ','+');
textscan(datain, formatSpec_data, data_rows, 'Delimiter', ',', 'TreatAsEmpty', {'+++++++'},'EmptyValue', NaN, 'ReturnOnError', 0 );
Finally, if you get stuck where you absolutely have to read as text then convert to numeric values, try str2doubleq, available on the Matlab File Exchange. It is much faster than str2double or str2num.

Blank cells while reading substring and numbers from with a string with textscan

I have a text file that consists of line after line of data in an xml-like format like this:
<item type="newpoint1" orient_zx="0.8658983248810842" orient_zy="0.4371062806139187" orient_zz="0.2432245678709263" electrostatic_force_x="0" electrostatic_force_y="0" electrostatic_force_z="0" cust_attr_HMTorque_0="0" cust_attr_HMTorque_1="0" cust_attr_HMTorque_2="0" vel_x="0" vel_y="0" vel_z="0" orient_xx="-0.2638371745169712" orient_xy="-0.01401379799313232" orient_xz="0.9644654264455047" pos_x="0" cust_attr_BondForce_0="0" pos_y="0" cust_attr_BondForce_1="0" pos_z="0.16" angvel_x="0" cust_attr_BondForce_2="0" angvel_y="0" id="1" angvel_z="0" charge="0" scaling_factor="1" cust_attr_BondTorque_0="0" cust_attr_BondTorque_1="0" cust_attr_BondTorque_2="0" cust_attr_Damage_0="0" orient_yx="0.4249823952954215" cust_attr_HMForce_0="0" cust_attr_Damage_1="0" orient_yy="-0.8993006799250595" cust_attr_HMForce_1="0" orient_yz="0.1031903618333235" cust_attr_HMForce_2="0" />
I'm only interested in the values within the " " so I'm trying to read this with textscan. To do this I take the first line and do regex find/replace to swap all number for %f and strings for %s, like this:
expression = '"[-+]?\d*\.?\d*"';
expression2 = '"\w*?"';
newStr = regexprep(firstline,expression,'"%f"');
FormatString = sprintf('%s',regexprep(newStr,expression2,'"%s"'));
The I re-open the file to read the files with string with the following call:
while ~feof(InputFile) % Read all lines in file
data = textscan(InputFile,FormatString,'delimiter','\n');
end
But all i get is an array of empty cells. I can't see what my mistake is - can someone point me in the right direction?
Clarification:
Mathworks provides this following example for textscan to remove literal text, which is what I'm trying to do.
"Remove the literal text 'Level' from each field in the second column of the data from the previous example."
filename = fullfile(matlabroot,'examples','matlab','scan1.dat');
fileID = fopen(filename);
C = textscan(fileID,'%s Level%d %f32 %d8 %u %f %f %s %f');
fclose(fileID);
C{2}
Ok, after looking at this with some fresh eyes today I spotted my problem.
newStr = regexprep(firstline,expression,'"%f"');
FormatString = sprintf('%s',regexprep(newStr,expression2,'%q'));
data = textscan(InputFile,FormatString,'delimiter',' ');
The replacement of the string need to be switched to the %q option which allows a string within double quotes to be read and the delimiter for textscan needed to be reverted to a single space. Code working fine now.

MATLAB: Reading space separated float values from tex file

I am reading a text file using the textscan function of MATLAB. Problem here is that nothing is being read in value as the floating points are separated with three spaces and I am quite new in MATLAB programming to use some efficient syntax. My current code is given below:
Code:
values = textscan(input_file, '%f %f %f %f %f\n %*[^\n]');
The input file follows the following format:
File:
0.781844 952.962130 2251.430836 3412.734125 4456.016362
0.788094 983.834855 2228.432996 3196.415590 4378.885466
0.794344 967.653718 2200.798973 3119.844502 4374.097695
If the floating point values are # separated then the below statement works fine:
values = textscan(input_file, '%f#%f#%f#%f#%f\n %*[^\n]');
Is there any solution except for tokenization ?
You need to specify a delimiter, also you should activate the MultipleDelimsAsOne in order to treat the repeated space as a single delimiter:
value = textscan(input_file, '%f %f %f %f %f \n ','Delimiter',' ','MultipleDelimsAsOne',1);
If needed you can also specify several delimiters at the same time:
del = {';',' '};
If you don't have to use textscan, you could probably use importdata. There you can specify the delimiter as a parameter.
Documentation http://se.mathworks.com/help/matlab/ref/importdata.html
Code example
filename = 'myfile01.txt';
delimiterIn = ' ';
A = importdata(filename,delimiterIn);

Matlab: Parsing strings

How can I turn strings like 1-14 into 01_014? (and for 2-2 into 02_002)?
I can do something like this:
testpoint_number = '5-16';
temp = textscan(testpoint_number, '%s', 'delimiter', '-');
temp = temp{1};
first_part = temp{1};
second_part = temp{2};
output_prefix = strcat('0',first_part);
parsed_testpoint_number = strcat(output_prefix, '_',second_part);
parsed_testpoint_number
But I feel this is very tedious, and I don't know how to handle the second part (16 to 016)
As you are handling integer numbers, I would suggest to change the textscan to %d (integer numbers). With that, you can use the formatting power of the *printf commands (e.g. sprintf).
*printf allows you to specify the width of the integer. With %02d, a 2 chars wide integer, which will be filled up with zeros, is printed.
The textscan returns a {1x1} cell, which contains a 2x1 array of integers. *printf can handle this itsself, so you just have to supply the argument temp{1}:
temp = textscan(testpoint_number, '%d', 'delimiter', '-');
parsed_testpoint_number = sprintf('%02d_%03d',temp{1});
Your textscanning is probably the most intuitive way to do this, but from then on what I would recommend doing is instead converting the scanned first_part and second_part into numerical format, giving you integers.
You can then sprintf these into your target string using the correct 'c'-style formatters to indicate your zero-padding prefix width, e.g.:
temp = textscan(testpoint_number, '%d', 'delimiter', '-');
parsed_testpoint_number = sprintf('%02d_%03d', temp{1});
Take a look at the C sprintf() documentation for an explanation of the string formatting options.