Trouble loading in mixed data from txt file MATLAB - matlab

I'm trying to load in data from a text file. The first two rows are headers, following the headers the first two columns are date and time. The rest of the columns are floats.
data should have 11 columns, however, whos returns that size is only 1x3
Data txt file:
fid = fopen('allunderway.txt', 'rt');
data = textscan(fid, '%{M/dd/yyyy}D %{HH:mm:ss}D %4.2f %2.4f %2.5f %2.4f %2.4f %2.2f %4.2f %3.1f %1.4f', 'HeaderLines', 2, 'CollectOutput', true);
fclose(fid);
whos data
date = data{1};
time = data{2};
wnd_td = data{10};
wnd_ts = data{11};

You could try using a delimiter instead, seems like this is a tab separated file.
you might have to try both 'rt' and 'r' in the fopen command.
As for the textscan part try adding this
'Delimiter','\t','EmptyValue',NaN
It adds tabs as a delimiter and replaces empty values with NaN.
or uses spaces as delimiters and set it so that it doesn't matter if there's 1 or multiple spaces
'Delimiter',' ','MultipleDelimsAsOne',1
Or use 'Whitespace' as the delimiter (uses both tabs and spaces).

Related

String vector to array

I am trying to make a script in Matlab that pulls data from a file and generates an array of data. Since the data is a string I've tried to split it into columns, take the transpose, and split it into columns again to populate an array.
When I run the script I don't get any errors, but I also don't get any useful data. I tell it to display the final vector (Full_Array) and I get {1×4 cell} 8 times. When I try to use strsplit I get the error:
'Error using strsplit (line 80) First input must be either a character vector or a string scalar.'
I'm pretty new to Matlab and I honestly have no clue how to fix it after reading through similar threads and the documentation I'm out of ideas. I've attached the code and the data to read in below. Thank you.
clear
File_Name = uigetfile; %Brings up windows file browser to locate .xyz file
Open_File = fopen(File_Name); %Opens the file given by File_Name
File2Vector = fscanf(Open_File,'%s'); %Prints the contents of the file to a 1xN vector
Vector2ColumnArray = strsplit(File2Vector,';'); %Splits the string vector from
%File2Vector into columns, forming an array
Transpose = transpose(Vector2ColumnArray); %Takes the transpose of Vector2ColumnArray
%making a column array into a row array
FullArray = regexp(Transpose, ',', 'split');
The data I am trying to read in comes from a .xyz file that I have titled methylformate.xyz, here is the data:
O2,-0.23799,0.65588,-0.69492;
O1,0.50665,0.83915,1.47685;
C2,-0.32101,2.08033,-0.75096;
C1,0.19676,0.17984,0.49796;
H4,0.66596,2.52843,-0.59862;
H3,-0.67826,2.36025,-1.74587;
H2,-1.03479,2.45249,-0.00927;
H1,0.23043,-0.91981,0.45346;
When I started using Matlab I also had problems with the data structure. The last line
FullArray = regexp(Transpose, ',', 'split');
splits each line and stores it in a cell array. In order to access the individual strings you have to index with curly brackets into FullArray:
FullArray{1}{1} % -> 'O2'
FullArray{1}{2} % -> '-0.23799'
FullArray{2}{1} % -> 'O1'
FullArray{2}{2} % -> '0.50665'
Thereby the first number corresponds to the row and the second to the particular element in the row.
However, there are easier functions in Matlab which load text files based on regular expressions.
Usually, the easiest function for reading mixed data is readtable.
data = readtable('methylformate.txt');
However, in your case this is more complex because
readtable can't cope with .xyz files, so you'd have to copy to .txt
The semi-colons confuse the read and make the last column characters
You can loop through each row and use textscan like so:
fid = fopen('methylformate.xyz');
tline = fgetl(fid);
myoutput = cell(0,4);
while ischar(tline)
myoutput(end+1,:) = textscan(tline, '%s %f %f %f %*[^\n]', 'delim', ',');
tline = fgetl(fid);
end
fclose(fid);
Output is a cell array of strings or doubles (as appropriate).

MATLAB reads UNICODE CSV with spaces between characters

I am using the fgetl command to read a .csv file but instead of returning the results I wanted as:
"HIST",1,1,27,PWH,"1"
it returned with additional space between each character:
" H I S T " , 1 , 1 , 2 7 , P W H , " 1 "
I know that I can replace the space with regexprep, but my file contains billions of lines so the added expression might consume considerably more time. I had a feeling that this is a unicode issue and someone pointed out the same issue when he used Java and it was related to unicode. I wonder if anyone knows a better way to deal with the problem in MATLAB?
Update:
It should be the unicode issue because the .csv file is an output from another program, and when I read it using fgetl the spaces are added. However, if I save the .csv file again using Excel and read the .csv file using fgetl again, it returns the results I want.
I am not able to provide an example because the .csv file is very large and I cannot make a small sample because when I open and save it from Excel, this problem is gone.
For the purpose of demonstration, let's consider a demo file - demo.csv:
"GIST",1,6,17,PWH,"1"
"FIST",0,4,72,WPH,"2"
"MIST",3,2,27,WHP,"3"
You have some options:
textscan (for any text file with a known structure):
fID = fopen('demo.csv');
C = textscan(fID,'%s%d%d%d%s%s','Delimiter',{',','"'},'MultipleDelimsAsOne',1);
fclose(fID);
Which results in:
C =
{3x1 cell} [3x1 int32] [3x1 int32] [3x1 int32] {3x1 cell} {3x1 cell}
Import helper + generate script (AKA overkill is an understatement):
Which results in:
%% Import data from text file.
% Script for importing data from the following text file:
%
% F:\demo.csv
%
% To extend the code to different selected data or a different text file, generate a
% function instead of a script.
% Auto-generated by MATLAB on 2016/04/20 19:51:32
%% Initialize variables.
filename = 'F:\demo.csv';
delimiter = ',';
%% Read columns of data as strings:
% For more information, see the TEXTSCAN documentation.
formatSpec = '%q%q%q%q%q%q%[^\n\r]';
%% Open the text file.
fileID = fopen(filename,'r');
%% Read columns of data according to format string.
% This call is based on the structure of the file used to generate this code. If an error
% occurs for a different file, try regenerating the code from the Import Tool.
dataArray = textscan(fileID, formatSpec, 'Delimiter', delimiter, 'ReturnOnError', false);
%% Close the text file.
fclose(fileID);
%% Convert the contents of columns containing numeric strings to numbers.
% Replace non-numeric strings with NaN.
raw = repmat({''},length(dataArray{1}),length(dataArray)-1);
for col=1:length(dataArray)-1
raw(1:length(dataArray{col}),col) = dataArray{col};
end
numericData = NaN(size(dataArray{1},1),size(dataArray,2));
for col=[2,3,4,6]
% Converts strings in the input cell array to numbers. Replaced non-numeric strings with
% NaN.
rawData = dataArray{col};
for row=1:size(rawData, 1);
% Create a regular expression to detect and remove non-numeric prefixes and suffixes.
regexstr = '(?<prefix>.*?)(?<numbers>([-]*(\d+[\,]*)+[\.]{0,1}\d*[eEdD]{0,1}[-+]*\d*[i]{0,1})|([-]*(\d+[\,]*)*[\.]{1,1}\d+[eEdD]{0,1}[-+]*\d*[i]{0,1}))(?<suffix>.*)';
try
result = regexp(rawData{row}, regexstr, 'names');
numbers = result.numbers;
% Detected commas in non-thousand locations.
invalidThousandsSeparator = false;
if any(numbers==',');
thousandsRegExp = '^\d+?(\,\d{3})*\.{0,1}\d*$';
if isempty(regexp(numbers, thousandsRegExp, 'once'));
numbers = NaN;
invalidThousandsSeparator = true;
end
end
% Convert numeric strings to numbers.
if ~invalidThousandsSeparator;
numbers = textscan(strrep(numbers, ',', ''), '%f');
numericData(row, col) = numbers{1};
raw{row, col} = numbers{1};
end
catch me
end
end
end
%% Split data into numeric and cell columns.
rawNumericColumns = raw(:, [2,3,4,6]);
rawCellColumns = raw(:, [1,5]);
%% Allocate imported array to column variable names
GIST = rawCellColumns(:, 1);
VarName2 = cell2mat(rawNumericColumns(:, 1));
VarName3 = cell2mat(rawNumericColumns(:, 2));
VarName4 = cell2mat(rawNumericColumns(:, 3));
PWH = rawCellColumns(:, 2);
VarName6 = cell2mat(rawNumericColumns(:, 4));
%% Clear temporary variables
clearvars filename delimiter formatSpec fileID dataArray ans raw col numericData rawData row regexstr result numbers invalidThousandsSeparator thousandsRegExp me rawNumericColumns rawCellColumns;
csvread (for numeric values only; which means it is not applicable here).
I happened to have the same issue. I opened a .csv file using textscan and it added 1 whitespace on both side of any character and I also noticed that when opening the variable storing the read data, the font was different than the usual in Matlab.
We managed to solve this issue by opening the '.csv' file into Notepad++ and changed the encoding to UTF-8. It solved the problem.
Hope it helps!

Textscan until end of line

I'm trying to textscan a file and read a single line until the end of it, undependently of the number of elements in that line.
My file is a .txt file formatted like this :
602,598,302,456,1023,523,....
293,291,566,331,987,56,....
589,202,429,2911,294,567,...
And so on. I have the number of the line, and all lines have the same number of elements, but it can vary from one file to another.
I wrote something like:
fid = fopen('somefile.txt');
C = textscan(fid, formatSpec,'HeaderLines',Row-1);
TheLine = C{1};
fclose(fid);
X = numel(TheLine);
plot(1:X,TheLine);
I really don't know what to type in the formatSpec field. I've tried a few things in the way of %[^\n] but I didn't get much sucess.
Try this -
C = textscan(fid, '%d,','HeaderLines',Row-1);
Row will specify the row of data that you want to extract from the text file.

Text Scanning to read in unknown number of variables and unknown number of runs

I am trying to read in a csv file which will have the format
Var1 Val1A Val1B ... Val1Q
Var2 Val2A Val2B ... Val2Q
...
And I will not know ahead of time how many variables (rows) or how many runs (columns) will be in the file.
I have been trying to get text scan to work but no matter what I try I cannot get either all the variable names isolated or a rows by columns cell array. This is what I've been trying.
fID = fopen(strcat(pwd,'/',inputFile),'rt');
if fID == -1
disp('Could not find file')
return
end
vars = textscan(fID, '%s,%*s','delimiter','\n');
fclose(fID);
Does anyone have a suggestion?
If the file has the same number of columns in each row (you just don't know how many to begin with), try the following.
First, figure out how many columns by parsing just the first row and find the number of columns, then parse the full file:
% Open the file, get the first line
fid = fopen('myfile.txt');
line = fgetl(fid);
fclose(fid);
tmp = textscan(line, '%s');
% The length of tmp will tell you how many lines
n = length(tmp);
% Now scan the file
fid = fopen('myfile.txt');
tmp = textscan(fid, repmat('%s ', [1, n]));
fclose(fid);
For any given file, are all the lines equal length? If they are, you could start by reading in the first line and use that to count the number of fields and then use textscan to read in the file.
fID = fopen(strcat(pwd,'/',inputFile),'rt');
firstLine = fgetl(fID);
numFields = length(strfind(firstLine,' ')) + 1;
fclose(fID);
formatString = repmat('%s',1,numFields);
fID = fopen(strcat(pwd,'/',inputFile),'rt');
vars = textscan(fID, formatString,' ');
fclose(fID);
Now you will have a cell array where first entry are the var names and all the other entries are the observations.
In this case I assumed the delimiter was space even though you said it was a csv file. If it is really commas, you can change the code accordingly.

Matlab: How to read commented portion of ascii file

I have an ascii file whose first couple hundred lines are commented (followed by the data) that give some information about the data. For example these are couple of lines I snipped out from large number of lines which are commented:
Right now I am only reading the data without comments by using load as:
filename = uigetfile('*.dat', 'Select Input data');
Data = load(filename, '-ascii');
How can I read the commented lines (which end just before the data starts) and pick some comments out of all comments based on some identifications such as Program name and version, Creation date etc. ?
Use textscan to read the lines into a cell array:
fid = fopen(filename, 'r');
C = textscan(fid, '%s', 'Delimiter', '\n');
C = C{:}; %// Flatten cell array
fclose(fid);
Now you can use regexp to manipulate the textual data. For instance, to find the comment lines that contain the string "Creation date", you can do this:
idx = ~cellfun('isempty', regexp(C, "^\s*%.*Creation date"));
where "^\s*% matches the percent sign (%) at the beginning of the line along with any leading whitespace, and the .* matches any number of characters until the occurrence of "Creation date". Needless to say, you can adjust the regular expression pattern to your liking.
The resulting variable idx stores a logical (i.e boolean) vector with "1"s at the positions of the lines matching the pattern (you can obtain their explicit numerical indices with find(idx)). Next you can filter those lines with C(idx) or iterate over them with a for loop.
fid = fopen(filename);
nHeaderRows = 412;
headerCell = cell(nHeaderRows, 1);
for i=1:nHeaderRows
headerCell{i} = fgets(fid);
end
headerText = char(headerCell);