Loading text data in Octave with specific format - matlab

I have a data set that I would like to store and be able to load in Octave
18.0 8 307.0 130.0 3504. 12.0 70 1 "chevrolet chevelle malibu"
15.0 8 350.0 165.0 3693. 11.5 70 1 "buick skylark 320"
18.0 8 318.0 150.0 3436. 11.0 70 1 "plymouth satellite"
16.0 8 304.0 150.0 3433. 12.0 70 1 "amc rebel sst"
17.0 8 302.0 140.0 3449. 10.5 70 1 "ford torino"
15.0 8 429.0 198.0 4341. 10.0 70 1 "ford galaxie 500"
14.0 8 454.0 220.0 4354. 9.0 70 1 "chevrolet impala"
14.0 8 440.0 215.0 4312. 8.5 70 1 "plymouth fury iii"
14.0 8 455.0 225.0 4425. 10.0 70 1 "pontiac catalina"
15.0 8 390.0 190.0 3850. 8.5 70 1 "amc ambassador dpl"
It does not work immediately when I try to use:
data = load('auto.txt')
Is there a way to load from a text files with the given format or do I need to convert it to e.g
18.0,8,307.0,130.0,3504.0,12.0,70,1
...
EDIT:
Deleting the last row and fixing the 'half' number e.g. 3504. -> 3504.0
and then used:
data = load('-ascii','autocleaned.txt');
Loaded the data as wanted in to a matrix in Octave.

load is usually meant for loading octave and matlab binary files but can be used for loading textual data like yours. You can load your data using the "-ascii" option but you'd have to reformat your file slightly before putting it into load even with the "-ascii" option enabled. Use a consistent column separator ie. just a tab or a comma, use full numbers not 3850. and don't use strings.
Then you can do something like this to get it to work
DATA = load("-ascii", "auto.txt");

If the final string field is removed from each line, the file can be read with:
filename='stack25148040_1.txt'
fid = fopen(filename, 'r');
[x, count] = fscanf(fid, '%f', [10, Inf])
endif
fclose(fid);
Alternatively the whole file could read in as one column and reshaped.
I haven't figured out how to read both the numeric fields and the string field. For that I've had to fall back on Python with more general purpose file reading tools.
Here is a Python script that reads the file, creates a numpy structured array, writes that to a .mat file, which Octave can then read:
import csv
import numpy as np
data=[]
with open('stack25148040.txt','rb') as f:
r = csv.reader(f, delimiter=' ')
# csv handles quoted strings with white space
for l in r:
# remove empty strings from the split on ' '
data.append([x for x in l if x])
print data[0]
for dd in data:
# convert 8 of the strings (per line) to float
dd[:]=[float(d) for d in dd[:8]]+dd[-1:]
data=data[:-1] # remove empty last line
print data[0]
print
# make a structured array, with numbers and a string
dt=np.dtype("f8,i4,f8,f8,f8,f8,i4,i4,|S25")
A=np.array([tuple(d) for d in data],dtype=dt)
print A
from scipy.io import savemat
savemat('stack25148040.mat',{'A':A})
In Octave this could read with
load stack25148040.mat
A
# A = 1x10 struct array containing the fields:
# f0 f1 ... f8
A.f8 # string field
A(1) # 1st row
# scalar structure containing the fields:
# f0 = 18
# f1 = 8
...
# f8 = chevrolet chevelle malibu
Newer Octave (3.8) has an importdata function. It handles the original data file without any extra arguments. It returns a structure with 2 fields
x.data is a (10,11) matrix. x.data(:,1:8) is the desire numerical data. x.data(:,9:11) is a mix of NA and random numbers. The NA stand in for the words at the end of the lines. x.textdata is a (24,1) cell with those words. The quoted string s could be reassembled from those words, using the NA and quotes to determine how many words belong to which line.
To read the numeric data it uses dlmread. Since the rest of importdata is written in Octave, it could be used as the starting point for a custom function that handles the string data properly.
dlmread ('stack25148040.txt')(:,1:8)
importread ('stack25148040.txt').data(:,1:8)
textread ('stack25148040.txt','')(:,1:8)

https://octave.org/doc/v4.0.0/Simple-File-I_002fO.html
Try this,
data = importdata('Auto.data')

Related

MATLAB / Octave - how to parse CSV file with numbers and strings that contain commas

I have a CSV file that has 20 columns. Some of the columns have number values, others have text values, and the text ones may or may not contain commas.
CSV content example:
column1, column2, column3, column4
"text value 1", 123, "text, with a comma", 25
"another, comma", 456, "other text", 78
I'm using textscan function, but I'm getting the most buggy and weird behavior. With some arguments, it reads all the values in only one column, sometimgs it repeats columns, and most of the things I've tried lead to the commas being incorrectly interpreted as column separators (despite text being enclosed in double quotes). That is, I've tried specifying 'delimiter' argument, and also including literals in the format specification, to no avail.
What's the correct way of invoking textscan to deal with a CSV file as the example above? I'm looking for a solution that runs both on MATLAB and on Octave (or, if that's not possible, the equivalent solution in each one).
For GNU Octave, using io package
pkg load io
c = csv2cell ("jota.csv")
gives
c =
{
[1,1] = column1
[2,1] = text value 1
[3,1] = another, comma
[1,2] = column2
[2,2] = 123
[3,2] = 456
[1,3] = column3
[2,3] = text, with a comma
[3,3] = other text
[1,4] = column4
[2,4] = 25
[3,4] = 78
}
btw, you should explicitly mention if the solution should run on GNU Octave, Matlab or both
First, read the column headers using the format '%s' four times:
fileID = fopen(filename);
C_text = textscan(fileID,'%s', 4,'Delimiter',',');
Then use the conversion specifier, %q to read the text enclosed by double quotation marks ("):
C = textscan(fileID,'%q %d %q %d','Delimiter',',');
fclose(fileID);
(This works for reading your sample data on Octave. It should work on MATLAB, too.)
Edit: removed redundant fopen.

MATLAB writing csv with mixed alphanumeric strings, scalar arrays, nans

Ok, coming from Python and never having used MATLAB before, it seems like it is unnecessarily hard to write data to a csv using MATLAB...
So my data looks like this:
col1 A2A B2 CC3 D5
asd189 123 33 71119 18291
as33d 1311 31 NaN 1011
asd189 NaN 44 79 191
It has N header columns that are made of alphanumeric strings.
It has a leftmost column of length M which is made of alphanumeric strings.
It has an (M-1) x (N-1) array of NUMERIC data, with possible NaNs.
Can you please provide code to write this to a csv? I cannot use the xlswrite function because I'm on a cluster without Excel installed. Really just want to get on with the actual data analysis. Thanks
You can only write matrices (not cell arrays) directly using csvwrite, and as you say you need Excel installed for xlswrite, so that leaves you with low level operations. You can see a walkthrough for writing to text files here, and code for your example below:
% Initialise example cell array
M = {'col1', 'A2A', 'B2', 'CC3', 'D5'
'asd189', 123, 33, 71119, 18291
'as33d', 1311, 31, NaN, 1011
'asd189', NaN, 44, 79, 191};
% Open a file for writing to (doesn't have to already exist, can specify full directory)
fID = fopen('test.csv','w');
% Write header line, formatted as strings with comma delimiter. Note \r\n for new line
fprintf(fID, [repmat('%s, ', 1, size(M,2)-1),'%s\r\n'], M{1,:});
% Loop through other rows
for row = 2:size(M,1)
% Write each line of cell array, with first column formatted as string
% and other columns formatted as floats
fprintf(fID, ['%s, ', repmat('%f, ', 1, size(M,2)-2),'%f\r\n'], M{row,:});
end
% Close file after writing
fclose(fID);
Result:
Use writetable. It makes writing to CSV (or to an Excel file, or to other text-delimited file formats) much easier than using csvwrite, or xlswrite, or low-level commands such as fprintf.
>> t = table({'asd189';'as33d';'asd189'},[123;1311;NaN],[33;31;44],[71119;NaN;79],[18291;1011;191]);
>> t.Properties.VariableNames = {'col1','A2A','B2','CC3','D5'}
t =
col1 A2A B2 CC3 D5
________ ____ __ _____ _____
'asd189' 123 33 71119 18291
'as33d' 1311 31 NaN 1011
'asd189' NaN 44 79 191
>> writetable(t,'myfile.csv')
If your data is currently not stored as a table (maybe it's in an array or cell array), it's pretty easy to convert to a table using utility functions such as array2table or cell2table. You will only pay a small time penalty for doing this.
PS - you don't need Excel to be installed in order to write to an Excel file. You may not be able to read them afterwards, but MATLAB can still write them. But it sounds like you'd prefer .csv anyway.

skip header in non-rectangular matrix

I have consecutive .dat files which I want to read and input into a single matrix by concatenating the files vertically. The code I have so far works fine for simple numeric files with only tabs as delimiter.
import=[];
data=[];
for i = 1:32
data1=[import dlmread(sprintf('%d.dat',i))];
data=vertcat(data, data1);
clear data1;
end
and I take the correct output into the data matrix. But my file format is as follows:
first second third
0 11/15 08:57:42.000 54 67 82
1 11/15 09:48:47.010 49 32 31
...
As you can see I have three delimiters (: \t /) and headers only in the last three columns which are essentially the ones I want to read, that is I want a matrix:
54 67 82
49 32 31
...
I tried specifying the delimiters into the dlmwrite and how many rows/columns to skip but an error occurs in sprintf ('delimiter = sprintf(delimiter); % Interpret \t (if necessary)'). Does anyone have any idea how to go about this?
UPDATE:
I managed to get a little further
data=[];
for i = 1:32
filename = sprintf( '%d.dat',i );
data1=importdata(filename);%creates a cell array
data2=cell2mat(data1(3:end,:));%converts it to char
%The data, without the header, start from the 3rd row.
data=vertcat(data, data2); %concatenate vertically all the files
clear data1; clear data2;
end
%the data
a1=str2num(data(1:end,20:25));%the first data column is in char 20-25
a2=str2num(data(1:end,30:35));%the second data column is in char 30-35
The thing is that the last part takes too much time, over an hour has passed until I manually stopped it. Does anyone know a simpler and faster way to do this?
I managed to solve this myself so I post it here for future reference:
for i = 1:32
filename = sprintf( '%d.dat',i );
data1 = dlmread(filename,'',2,3);%start from row 2, headercolumn 3
data=vertcat(data, data1);
clear data1;
end
Now the data matrix contains only my data columns and it runs in a few seconds.

Matlab: output format with fprintf

I want to have a list of data in a text file, and for that I use:
fprintf(fid, '%d %s %d\n',ii, names{ii},vals(ii));
the problem in my data, there are names that are longer than other. so I get results in this form:
1 XXY 5
2 NHDMUCY 44
3 LL 96
...
How i can change the fprintf line of code to make the results in this form:
1 XXY 5
2 NHDMUCY 44
3 LL 96
...
Something like this before the start of the loop -
%// extents of each vals string and the corresponding whitespace padding
lens0 = cellfun('length',cellfun(#(x) num2str(x),num2cell(1:numel(names)),'Uni',0))
pad_ws_col1 = max(lens0) - lens0
%// extents of each names string and the corresponding whitespace padding
lens1 = cellfun('length',names)
pad_ws_col2 = max(lens1) - lens1
Then, inside the loop -
fprintf(fid, '%d %s %s %s %d\n',col1(ii), repmat(' ',1,pad_ws_col1(ii)), ...
names{ii},repmat(' ',1,pad_ws_col2(ii)),vals(ii));
Output would be -
1 XXY 5
2 NHDMUCY 44
3 LL 96
For a range 99 - 101, it would be -
99 XXY 5
100 NHDMUCY 44
100 LL 96
Please note that the third column numerals start at a fixed distance instead of ending at a fixed distance from the start of each row as asked in the question. But, assuming that the whole idea of the question was to present the data in a more readable way, this could work for you.
You can use the function char to convert a cell array of string into a character array where all rows will be padded to be the length of the longest one.
So for you:
charNames = char( names ) ;
then you can use fprintf :
fprintf(fid, '%d %s %d\n',ii, charNames(ii,:) , vals(ii)) ;
Just make sure your cell array is a colum before you convert it to char.

Matlab - how to read 2 bytes at a time

I have a number like this - 778310098 - and I want to read 2 bytes at a time. So, I am expecting my output to be 77; 83; 10; 09; 8. I tried using the below:
uint16(fread(fileID,inf, 'ubit8')) and the output I get is the ASCII value of the individual numbers:
55
55
56
51
49
48
48
57
56
What do I need to do to get the desired output?
To read pairs of ASCII digits from a text file (we tend not to describe text files in byets, but in characters), use:
[10 1] * (fread(fileID,[2 inf], 'char') - 48)
To read bytes pairwise from a binary file, try
fread(fileID,inf, '*uint16')
One method is to convert it to a string, then process the string, then convert it back to an integer. While this may not be particularly elegant or perfect, will this do the trick?
a = 778310098;
b = num2str(a);
for i = 1:2:length(b)
if i == length(b) % to handle the case for odd input
split = str2num(b(i))
else
split = str2num(b(i:i+1)) % handle all others
end
end