howto process non-numeric text file and convert into struct most efficiently? - matlab

I have a bunch of .txt files that have the following format:
|file | time | color | tags |
|1 | 1:10 | red | ok, correct|
|2 | 2:20 | blue | bad |
|3 | 1:20 | yellow | sometag |
The first row specifies the column names.
The subsequent rows are 'database' entries.
I want to read in this file, and put all the information into a Matlab structure. I'm wondering what the most efficient way of doing this is.
textread with 'delimiter', '\n' and process each line individually?
textread with 'delimiter', '|' and having to determine which entries belong together?
fread line by line?
I love the convenience of textread with 'delimiter', '\n' but then it's quite a pain to get out the individual entries for each column (with a for loop). Alternatively I can split each row up using regexp:
regexp(file{1}, '\|', 'split')
But this will only split up each row, and won't take care of the whitespaces (for which I would need another regexp call to in order to get rid of).
So what's the most straight forward (and maybe even most efficient) way of doing this?
EDIT1 I want to build a struct like db = struct('file', [], 'time', [], 'color', [], 'tags') which is easy to create once I've read-in the first line.

Let's use textscan:
fid = fopen('asdf.txt');
header = textscan(fid, '%s', 5, 'delimiter', '|');
data=textscan(fid, '|%d %s %s %s','delimiter','|');
fclose(fid);
which gives:
data =
[3x1 double] {3x1 cell} {3x1 cell} {3x1 cell}
From here it's easy to go to the struct you wanted. The strings also have some extra spaces, which need trimming:
data{2} = strtrim(data{2});

Related

How to read comma-delimited data with some values using commas between quotes

I have a data file that includes comma-delimited data that I am trying to read into Octave. Most of the data is fine, but some includes numbers between double quotes that use commas between the quotes. Here's a sample section of data:
.123,4.2,"4,123",700,12pie
.34,4.23,602,701,23dj
.4345,4.6,"3,623,234",700,134nfg
.951,68.5,45,699,4lkj
I've been using textscan to read the data (since there's a mix of number and strings), specifying comma delimiters, and that works most of the time, but occasionally the file contains these bigger integers in quotes scattered through that column. I was able to get around one of these quoted numbers earlier in the data file because I knew where it would be, but it wasn't pretty:
sclose = textscan(fid, '%n %n', 1, 'delimiter', ',');
junk = fgetl(fid, 1);
junk = textscan(fid, '%s', 1, 'delimiter', '"');
junk = fgetl(fid, 1);
sopen = textscan(fid, '%n %s', 1, 'delimiter', ',');
I don't care about the data in that column, but because it changes size and sometimes contains the quoted with extra commas that I want to ignore, I'm struggling with how to read/skip it. Any suggestions on how to handle it?
Here's my current (ugly) approach that reads the column as a string, then uses strfind to check for a " within the string. If it's present then it reads another comma-delimited string and repeats the check until the closing " is found and then resumes reading the data.
fid = fopen('sample.txt', 'r');
for k=1:4
expdata1(k, :) = textscan(fid, '%n %n %s', 1, 'delimiter', ','); #read first 3 data pts
qcheck = char(expdata1(k,3));
idx = strfind(qcheck, '"'); #look for "
dloc = ftell(fid);
for l=1:4
if isempty(idx) #if no " present, continue reading data
break
endif
dloc = ftell(fid); #save location so can return to next data point
expdata1(k, 3) = textscan(fid, '%s', 1, 'delimiter', ','); #if " present, read next comma segment and check for "
qcheck = char(expdata1(k,3));
idx = strfind(qcheck, '"');
endfor
fseek(fid, dloc);
expdata2(k, :) = textscan(fid, '%n %s', 1, 'delimiter', ',');
endfor
fclose(fid);
There's gotta be a better way...
I see this has a matlab tag on it, are you using matlab textscan or octave?
If in matlab, I would suggest using either readmatrix or readtable.
Also note, the format specifier for quoted string is %q. This should be applicable to both languages even for textscan.
Putting your sample data in data.csv, the following is possible:
>> readtable("data.csv", 'Format','%f%f%q%d%s');
ans =
4×5 table
Var1 Var2 Var3 Var4 Var5
______ ____ _____________ ____ __________
0.123 4.2 {'4,123' } 700 {'12pie' }
0.34 4.23 {'602' } 701 {'23dj' }
0.4345 4.6 {'3,623,234'} 700 {'134nfg'}
0.951 68.5 {'45' } 699 {'4lkj' }

matlab: delimit .csv file where no specific delimiter is available

i wonder if there is the possibility to read a .csv file looking like:
0,0530,0560,0730,....
90,15090,15290,157....
i should get:
0,053 0,056 0,073 0,...
90,150 90,152 90,157 90,...
when using dlmread(path, '') matlab spits out an error saying
Mismatch between file and Format character vector.
Trouble reading 'Numeric' field frin file (row 1, field number 2) ==> ,053 0,056 0,073 ...
i also tried using "0," as the delimiter but matlab prohibits this.
Thanks,
jonnyx
str= importdata('file.csv',''); %importing the data as a cell array of char
for k=1:length(str) %looping till the last line
str{k}=myfunc(str{k}); %applying the required operation
end
where
function new=myfunc(str)
old = str(1:regexp(str, ',', 'once')); %finding the characters till the first comma
%old is the pattern of the current line
new=strrep(str,old,[' ',old]); %adding a space before that pattern
new=new(2:end); %removing the space at the start
end
and file.csv :
0,0530,0560,073
90,15090,15290,157
Output:
>> str
str=
'0,053 0,056 0,073'
'90,150 90,152 90,157'
You can actually do this using textscan without any loops and using a few basic string manipulation functions:
fid = fopen('no_delim.csv', 'r');
C = textscan(fid, ['%[0123456789' 10 13 ']%[,]%3c'], 'EndOfLine', '');
fclose(fid);
C = strcat(C{:});
output = strtrim(strsplit(sprintf('%s ', C{:}), {'\n' '\r'})).';
And the output using your sample input file:
output =
2×1 cell array
'0,053 0,056 0,073'
'90,150 90,152 90,157'
How it works...
The format string specifies 3 items to read repeatedly from the file:
A string containing any number of characters from 0 through 9, newlines (ASCII code 10), or carriage returns (ASCII code 13).
A comma.
Three individual characters.
Each set of 3 items are concatenated, then all sets are printed to a string separated by spaces. The string is split at any newlines or carriage returns to create a cell array of strings, and any spaces on the ends are removed.
If you have access to a GNU / *NIX command line, I would suggest using sed to preprocess your data before feeding into matlab. The command would be in this case : sed 's/,[0-9]\{3\}/& /g' .
$ echo "90,15090,15290,157" | sed 's/,[0-9]\{3\}/& /g'
90,150 90,152 90,157
$ echo "0,0530,0560,0730,356" | sed 's/,[0-9]\{3\}/& /g'
0,053 0,056 0,073 0,356
also, you easily change commas , to decimal point .
$ echo "0,053 0,056 0,073 0,356" | sed 's/,/./g'
0.053 0.056 0.073 0.356

MATLAB writing csv with mixed alphanumeric strings, scalar arrays, nans

Ok, coming from Python and never having used MATLAB before, it seems like it is unnecessarily hard to write data to a csv using MATLAB...
So my data looks like this:
col1 A2A B2 CC3 D5
asd189 123 33 71119 18291
as33d 1311 31 NaN 1011
asd189 NaN 44 79 191
It has N header columns that are made of alphanumeric strings.
It has a leftmost column of length M which is made of alphanumeric strings.
It has an (M-1) x (N-1) array of NUMERIC data, with possible NaNs.
Can you please provide code to write this to a csv? I cannot use the xlswrite function because I'm on a cluster without Excel installed. Really just want to get on with the actual data analysis. Thanks
You can only write matrices (not cell arrays) directly using csvwrite, and as you say you need Excel installed for xlswrite, so that leaves you with low level operations. You can see a walkthrough for writing to text files here, and code for your example below:
% Initialise example cell array
M = {'col1', 'A2A', 'B2', 'CC3', 'D5'
'asd189', 123, 33, 71119, 18291
'as33d', 1311, 31, NaN, 1011
'asd189', NaN, 44, 79, 191};
% Open a file for writing to (doesn't have to already exist, can specify full directory)
fID = fopen('test.csv','w');
% Write header line, formatted as strings with comma delimiter. Note \r\n for new line
fprintf(fID, [repmat('%s, ', 1, size(M,2)-1),'%s\r\n'], M{1,:});
% Loop through other rows
for row = 2:size(M,1)
% Write each line of cell array, with first column formatted as string
% and other columns formatted as floats
fprintf(fID, ['%s, ', repmat('%f, ', 1, size(M,2)-2),'%f\r\n'], M{row,:});
end
% Close file after writing
fclose(fID);
Result:
Use writetable. It makes writing to CSV (or to an Excel file, or to other text-delimited file formats) much easier than using csvwrite, or xlswrite, or low-level commands such as fprintf.
>> t = table({'asd189';'as33d';'asd189'},[123;1311;NaN],[33;31;44],[71119;NaN;79],[18291;1011;191]);
>> t.Properties.VariableNames = {'col1','A2A','B2','CC3','D5'}
t =
col1 A2A B2 CC3 D5
________ ____ __ _____ _____
'asd189' 123 33 71119 18291
'as33d' 1311 31 NaN 1011
'asd189' NaN 44 79 191
>> writetable(t,'myfile.csv')
If your data is currently not stored as a table (maybe it's in an array or cell array), it's pretty easy to convert to a table using utility functions such as array2table or cell2table. You will only pay a small time penalty for doing this.
PS - you don't need Excel to be installed in order to write to an Excel file. You may not be able to read them afterwards, but MATLAB can still write them. But it sounds like you'd prefer .csv anyway.

How to load a cell array that has both strings and numbers?

I have a cell array that has both strings and numbers. I want to load all the elements of the cell array. For the same I used the following method:
load(filename);
This command is loading only strings and excluding the columns that has numbers. Basically since my file is not .mat extension, it is treating it as ASCII file and loading only text.
I tried importdata(filename). But that gives me struct of 1*1. I need the elements to be imported into another cell array of same dimension.
Is there a way to load all the values?
load is used to import .mat-files with workspace variables. Since your data is not an actual .mat-file, you need to use a different method.
Let's assume you have the file filename with tab-delimited data:
str1 1
str2 2
str3 3
str4 4
To get a cell-array where the first column is a string (using %s) and the second a double (using %f), you can use textscan. Check out the result, maybe it's already what you're searching for.
filename = 'data';
F = fopen(filename, 'r');
data = textscan(F, '%s %f', 'Delimiter', '\t');
If not, you can create a cell-array CA where the first column is a string (using cellstr) and the second one is a double (using num2cell).
CA = cell(size(data{1},1),2);
CA(:,1) = cellstr(data{1});
CA(:,2) = num2cell(data{2});
Result:
CA =
'str1' [1]
'str2' [2]
'str3' [3]
'str4' [4]

Produce table with text and number with fprintf in Matlab

I need to produce a table whose first 2 columns have text, and the remaining 2 have numbers. Something like this:
| Ford | Mustang | 1975 | 35 |
| Chev | Camaro | 1976 | 38 |
I have the string in a cell, and the numeric variables in a matrix. I've tried with fprintf but can't make it work. I have no problems doing it in xlswrite, but I don't want to go that way. Any ideas please?
Thanks!
You could use fprintf in a loop like this:
fprintf(1, '| %8s | %8s | %4d | %2d |\n', ...
company{i}, model{i}, year(i), otherNumber(i));
to write to stdout. You can also modify the %#s if you want different spacing in your table, or provide a different file descriptor to the first argument.