MATLAB writing csv with mixed alphanumeric strings, scalar arrays, nans - matlab

Ok, coming from Python and never having used MATLAB before, it seems like it is unnecessarily hard to write data to a csv using MATLAB...
So my data looks like this:
col1 A2A B2 CC3 D5
asd189 123 33 71119 18291
as33d 1311 31 NaN 1011
asd189 NaN 44 79 191
It has N header columns that are made of alphanumeric strings.
It has a leftmost column of length M which is made of alphanumeric strings.
It has an (M-1) x (N-1) array of NUMERIC data, with possible NaNs.
Can you please provide code to write this to a csv? I cannot use the xlswrite function because I'm on a cluster without Excel installed. Really just want to get on with the actual data analysis. Thanks

You can only write matrices (not cell arrays) directly using csvwrite, and as you say you need Excel installed for xlswrite, so that leaves you with low level operations. You can see a walkthrough for writing to text files here, and code for your example below:
% Initialise example cell array
M = {'col1', 'A2A', 'B2', 'CC3', 'D5'
'asd189', 123, 33, 71119, 18291
'as33d', 1311, 31, NaN, 1011
'asd189', NaN, 44, 79, 191};
% Open a file for writing to (doesn't have to already exist, can specify full directory)
fID = fopen('test.csv','w');
% Write header line, formatted as strings with comma delimiter. Note \r\n for new line
fprintf(fID, [repmat('%s, ', 1, size(M,2)-1),'%s\r\n'], M{1,:});
% Loop through other rows
for row = 2:size(M,1)
% Write each line of cell array, with first column formatted as string
% and other columns formatted as floats
fprintf(fID, ['%s, ', repmat('%f, ', 1, size(M,2)-2),'%f\r\n'], M{row,:});
end
% Close file after writing
fclose(fID);
Result:

Use writetable. It makes writing to CSV (or to an Excel file, or to other text-delimited file formats) much easier than using csvwrite, or xlswrite, or low-level commands such as fprintf.
>> t = table({'asd189';'as33d';'asd189'},[123;1311;NaN],[33;31;44],[71119;NaN;79],[18291;1011;191]);
>> t.Properties.VariableNames = {'col1','A2A','B2','CC3','D5'}
t =
col1 A2A B2 CC3 D5
________ ____ __ _____ _____
'asd189' 123 33 71119 18291
'as33d' 1311 31 NaN 1011
'asd189' NaN 44 79 191
>> writetable(t,'myfile.csv')
If your data is currently not stored as a table (maybe it's in an array or cell array), it's pretty easy to convert to a table using utility functions such as array2table or cell2table. You will only pay a small time penalty for doing this.
PS - you don't need Excel to be installed in order to write to an Excel file. You may not be able to read them afterwards, but MATLAB can still write them. But it sounds like you'd prefer .csv anyway.

Related

MATLAB / Octave - how to parse CSV file with numbers and strings that contain commas

I have a CSV file that has 20 columns. Some of the columns have number values, others have text values, and the text ones may or may not contain commas.
CSV content example:
column1, column2, column3, column4
"text value 1", 123, "text, with a comma", 25
"another, comma", 456, "other text", 78
I'm using textscan function, but I'm getting the most buggy and weird behavior. With some arguments, it reads all the values in only one column, sometimgs it repeats columns, and most of the things I've tried lead to the commas being incorrectly interpreted as column separators (despite text being enclosed in double quotes). That is, I've tried specifying 'delimiter' argument, and also including literals in the format specification, to no avail.
What's the correct way of invoking textscan to deal with a CSV file as the example above? I'm looking for a solution that runs both on MATLAB and on Octave (or, if that's not possible, the equivalent solution in each one).
For GNU Octave, using io package
pkg load io
c = csv2cell ("jota.csv")
gives
c =
{
[1,1] = column1
[2,1] = text value 1
[3,1] = another, comma
[1,2] = column2
[2,2] = 123
[3,2] = 456
[1,3] = column3
[2,3] = text, with a comma
[3,3] = other text
[1,4] = column4
[2,4] = 25
[3,4] = 78
}
btw, you should explicitly mention if the solution should run on GNU Octave, Matlab or both
First, read the column headers using the format '%s' four times:
fileID = fopen(filename);
C_text = textscan(fileID,'%s', 4,'Delimiter',',');
Then use the conversion specifier, %q to read the text enclosed by double quotation marks ("):
C = textscan(fileID,'%q %d %q %d','Delimiter',',');
fclose(fileID);
(This works for reading your sample data on Octave. It should work on MATLAB, too.)
Edit: removed redundant fopen.

skip header in non-rectangular matrix

I have consecutive .dat files which I want to read and input into a single matrix by concatenating the files vertically. The code I have so far works fine for simple numeric files with only tabs as delimiter.
import=[];
data=[];
for i = 1:32
data1=[import dlmread(sprintf('%d.dat',i))];
data=vertcat(data, data1);
clear data1;
end
and I take the correct output into the data matrix. But my file format is as follows:
first second third
0 11/15 08:57:42.000 54 67 82
1 11/15 09:48:47.010 49 32 31
...
As you can see I have three delimiters (: \t /) and headers only in the last three columns which are essentially the ones I want to read, that is I want a matrix:
54 67 82
49 32 31
...
I tried specifying the delimiters into the dlmwrite and how many rows/columns to skip but an error occurs in sprintf ('delimiter = sprintf(delimiter); % Interpret \t (if necessary)'). Does anyone have any idea how to go about this?
UPDATE:
I managed to get a little further
data=[];
for i = 1:32
filename = sprintf( '%d.dat',i );
data1=importdata(filename);%creates a cell array
data2=cell2mat(data1(3:end,:));%converts it to char
%The data, without the header, start from the 3rd row.
data=vertcat(data, data2); %concatenate vertically all the files
clear data1; clear data2;
end
%the data
a1=str2num(data(1:end,20:25));%the first data column is in char 20-25
a2=str2num(data(1:end,30:35));%the second data column is in char 30-35
The thing is that the last part takes too much time, over an hour has passed until I manually stopped it. Does anyone know a simpler and faster way to do this?
I managed to solve this myself so I post it here for future reference:
for i = 1:32
filename = sprintf( '%d.dat',i );
data1 = dlmread(filename,'',2,3);%start from row 2, headercolumn 3
data=vertcat(data, data1);
clear data1;
end
Now the data matrix contains only my data columns and it runs in a few seconds.

Loading text data in Octave with specific format

I have a data set that I would like to store and be able to load in Octave
18.0 8 307.0 130.0 3504. 12.0 70 1 "chevrolet chevelle malibu"
15.0 8 350.0 165.0 3693. 11.5 70 1 "buick skylark 320"
18.0 8 318.0 150.0 3436. 11.0 70 1 "plymouth satellite"
16.0 8 304.0 150.0 3433. 12.0 70 1 "amc rebel sst"
17.0 8 302.0 140.0 3449. 10.5 70 1 "ford torino"
15.0 8 429.0 198.0 4341. 10.0 70 1 "ford galaxie 500"
14.0 8 454.0 220.0 4354. 9.0 70 1 "chevrolet impala"
14.0 8 440.0 215.0 4312. 8.5 70 1 "plymouth fury iii"
14.0 8 455.0 225.0 4425. 10.0 70 1 "pontiac catalina"
15.0 8 390.0 190.0 3850. 8.5 70 1 "amc ambassador dpl"
It does not work immediately when I try to use:
data = load('auto.txt')
Is there a way to load from a text files with the given format or do I need to convert it to e.g
18.0,8,307.0,130.0,3504.0,12.0,70,1
...
EDIT:
Deleting the last row and fixing the 'half' number e.g. 3504. -> 3504.0
and then used:
data = load('-ascii','autocleaned.txt');
Loaded the data as wanted in to a matrix in Octave.
load is usually meant for loading octave and matlab binary files but can be used for loading textual data like yours. You can load your data using the "-ascii" option but you'd have to reformat your file slightly before putting it into load even with the "-ascii" option enabled. Use a consistent column separator ie. just a tab or a comma, use full numbers not 3850. and don't use strings.
Then you can do something like this to get it to work
DATA = load("-ascii", "auto.txt");
If the final string field is removed from each line, the file can be read with:
filename='stack25148040_1.txt'
fid = fopen(filename, 'r');
[x, count] = fscanf(fid, '%f', [10, Inf])
endif
fclose(fid);
Alternatively the whole file could read in as one column and reshaped.
I haven't figured out how to read both the numeric fields and the string field. For that I've had to fall back on Python with more general purpose file reading tools.
Here is a Python script that reads the file, creates a numpy structured array, writes that to a .mat file, which Octave can then read:
import csv
import numpy as np
data=[]
with open('stack25148040.txt','rb') as f:
r = csv.reader(f, delimiter=' ')
# csv handles quoted strings with white space
for l in r:
# remove empty strings from the split on ' '
data.append([x for x in l if x])
print data[0]
for dd in data:
# convert 8 of the strings (per line) to float
dd[:]=[float(d) for d in dd[:8]]+dd[-1:]
data=data[:-1] # remove empty last line
print data[0]
print
# make a structured array, with numbers and a string
dt=np.dtype("f8,i4,f8,f8,f8,f8,i4,i4,|S25")
A=np.array([tuple(d) for d in data],dtype=dt)
print A
from scipy.io import savemat
savemat('stack25148040.mat',{'A':A})
In Octave this could read with
load stack25148040.mat
A
# A = 1x10 struct array containing the fields:
# f0 f1 ... f8
A.f8 # string field
A(1) # 1st row
# scalar structure containing the fields:
# f0 = 18
# f1 = 8
...
# f8 = chevrolet chevelle malibu
Newer Octave (3.8) has an importdata function. It handles the original data file without any extra arguments. It returns a structure with 2 fields
x.data is a (10,11) matrix. x.data(:,1:8) is the desire numerical data. x.data(:,9:11) is a mix of NA and random numbers. The NA stand in for the words at the end of the lines. x.textdata is a (24,1) cell with those words. The quoted string s could be reassembled from those words, using the NA and quotes to determine how many words belong to which line.
To read the numeric data it uses dlmread. Since the rest of importdata is written in Octave, it could be used as the starting point for a custom function that handles the string data properly.
dlmread ('stack25148040.txt')(:,1:8)
importread ('stack25148040.txt').data(:,1:8)
textread ('stack25148040.txt','')(:,1:8)
https://octave.org/doc/v4.0.0/Simple-File-I_002fO.html
Try this,
data = importdata('Auto.data')

Parse a PC-Axis (.px) file in Matlab

Background: PC-Axis is a file format format used for dissemination of statistical information. The format is used by a number of national statistical organisations to disseminate official statistics.
A PC-Axis file looks a little like this, although they're usually a lot longer:
CHARSET=”ANSI”;
MATRIX="BE001";
SUBJECT-CODE="BE";
SUBJECT-AREA="Population";
TITLE="Population by region, time, marital status and sex.";
Data=
".." ".." ".." ".." ".."
".." ".." ".." ".." ".."
".." 24.80 34.20 52.00 23.00
".." 32.10 40.30 50.70 1.00
".." 31.60 35.00 49.10 2.30
41.20 43.00 50.80 60.10 0.00
50.90 52.00 53.90 65.90 0.00
28.90 31.80 39.60 51.00 0.00;
More details about PC-Axis files can be found at the Statistics Sweden website, but the basic gist is that the metadata is positioned at the top of the file and after "DATA=" is the actual data itself. It's also worth noting that the data is organized more like a data-table rather than in columns.
The Problem: I'd like to parse a PC-Axis file using Matlab, but I'm a little stumped as to how to go about doing it. Does anyone know how to parse one of these files in Matlab? Would it be easier to parse this type of file using some other language, like Perl, and then import the data into Matlab, or, would Matlab be a suitable enough tool for the job? Note that the plan would be to analyze the data in Matlab after the text processing stage.
I've tried using Matlab's text processing tools such as fgetl, textscan, fscanf, and a few others, but it's terribly tricky. Does anyone have any pointers on how to go about doing it?
Essentially, I'd like to store each of the keywords (CHARSET, MATRIX, etc.) and their corresponding values (ANSI, BE001, etc.) as metadata in Matlab - as a structure, perhaps. I'd like to have the data stored in Matlab also - as a matrix, for example.
Note: I'm aware of the pxR package (CRAN) in R, which works a treat for reading .px files into the workspace as a data.frame object. There's also a Perl module called Data::PcAxis (CPAN) that is also very good, but I'm specifically wanting to know how to parse a .px file using Matlab.
UPDATE: I should have mentioned that in addition to metadata and data, there are also variables. This is best explained by an example. The example PC-Axis file below is the same as the one above except I've added two variables. They're named VALUES("Month") and VALUES("region") and are positioned after the metadata and before the data.
CHARSET=”ANSI”;
MATRIX="BE001";
SUBJECT-CODE="BE";
SUBJECT-AREA="Population";
TITLE="Population by region, time, marital status and sex.";
VALUES("Month")="1976M01","1976M02","1976M03","1976M04",
"1976M05","1976M06","1976M07","1976M08",
"1976M09","1976M10","1976M11","1976M12";
VALUES("region")="Sweden","Germany","France",
"Ireland","Finland";
Data=
".." ".." ".." ".." ".."
".." ".." ".." ".." ".."
".." 24.80 34.20 52.00 23.00
".." 32.10 40.30 50.70 1.00
".." 31.60 35.00 49.10 2.30
41.20 43.00 50.80 60.10 0.00
50.90 52.00 53.90 65.90 0.00
28.90 31.80 39.60 51.00 0.00;
Textscan works a treat when reading in each line of the text file as a string (in a cell array). However, the elements after the "=" sign for both of the variables (i.e. VALUES("Month") and VALUES("region")) span more than one line. It seems that using textscan in this case means that some strings would have to be concatenated, say, for example, in order to collect the list of months (1976M01 to 1976M12).
Question: What's the best way to collect the variables data? Read the text file as a single string and then use strtok twice to extract the substring of dates? Perhaps, there's a better (more systematic) way?
Usually textscan and regexp is the way to go when parsing string fields (as shown here):
Read the input lines as strings with textscan:
fid = fopen('input.px', 'r');
C = textscan(fid, '%s', 'Delimiter', '\n');
fclose(fid);
Parse the header field names and values using regexp. Picking the right regular expression should do the trick!
X = regexp(C{:}, '^\s*([^=\(\)]+)\s*=\s*"([^"]+)"\s*', 'tokens');
X = [X{:}]; %// Flatten the cell array
X = reshape([X{:}], 2, []); %// Reshape into name-value pairs
The "VALUE" fields may span over multiple lines, so they need to be concatenated first:
idx_data = find(~cellfun('isempty', regexp(C{:}, '^\s*Data')), 1);
idx_values = find(~cellfun('isempty', regexp(C{:}, '^\s*VALUES')));
Y = arrayfun(#(m, n){[C{:}{m:m + n - 1}]}, ...
idx_values(idx_values < idx_data), diff([idx_values; idx_data]));
... and then tokenized:
Y = regexp(Y, '"([^,"]+)"', 'tokens'); %// Tokenize values
Y = cellfun(#(x){{x{1}{1}, {[x{2:end}]}}}, Y); %// Group values in one array
Y = reshape([Y{:}], 2, []); %// Reshape into name-value pairs
Make sure the field names are legal (I've decided to convert everything to lowercase and replace apostrophes and any whitespace with underscores), and plug them into a struct:
X = [X, Y]; %// Store all fields in one array
X(1, :) = lower(regexprep(X(1, :), '-+|\s+', '_'));
S = struct(X{:});
Here's what I get for your input file (only the header fields):
S =
charset: 'ANSI'
matrix: 'BE001'
subject_code: 'BE'
subject_area: 'Population'
title: 'Population by region, time, marital status and sex.'
month: {1x12 cell}
region: {1x5 cell}
As for the data itself, it needs to be handled separately:
Extract data lines after the "Data" field and replace all ".." values with default values (say, NaN):
D = strrep(C{:}(idx_data + 1:end), '".."', 'NaN');
Obviously this assumes that there are only numerical data after the "Data" field. However, this can be easily modified if this is not case.
Convert the data to a numerical matrix and add it to the structure:
D = cellfun(#str2num, D, 'UniformOutput', false);
S.data = vertcat(D{:})
And here's S.data for your input file:
S.data =
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
NaN 24.80000 34.20000 52.00000 23.00000
NaN 32.10000 40.30000 50.70000 1.00000
NaN 31.60000 35.00000 49.10000 2.30000
41.20000 43.00000 50.80000 60.10000 0.00000
50.90000 52.00000 53.90000 65.90000 0.00000
Hope this helps!
I'm not personally familiar with PC-Axis files, but here are my thoughts.
Parse the header first. If the header is of fixed size, you can read in that many lines and parse out the values you want. The regexp method may be useful for this.
The data appear to be both string and numeric. I would change the ".." values to NaN (make an original backup first, of course), and then scan the matrix using textscan. Textscan can be tricky, so make sure the file parses completely. If textscan encounters a line that does not match the format string, it will stop parsing. You can check the position of the file handle (using ftell) to see if it matches the end of the file (you can fseek to the end of the file to find what that value should be). The length of the cell arrays returned by textscan should all be the same. If not, the length will tell you what line they failed on - you can check this line with a text editor to see what violated the format.
You can assign and access fields in Matlab structs using string arguments. For example:
foo.('a') = 1;
foo.a
ans =
1
So, the workflow I suggest is to parse the header lines, assigning each attribute/value pair as field/value pairs in struct. Then parse the matrix (after some brief text preprocessing to make sure all the data are numeric).

writing binary values into file in MATLAB

i was converting an image file into binary to process in VHDL. while i was converting image to binary and when writing that matrix into a text file unnecessary commas came between the number.I want a column vector.my input is a square matrix.all working well except the comma part.i am using dummy values.can anyone give any suggestions
a=[1 3;6 9];
b=dec2bin(a');
fName = 'output.txt';
fid = fopen('output.txt','w');
dlmwrite(fName, b);
but the output file was like this
0,0,0,1
0,0,1,1
0,1,1,0
1,0,0,1
i was expecting
0001
0011
0110
1001
You have to pass '' as an delimeter . Ref:- dlmwrite usage
dlmwrite('output.txt', b, '')