Converting a comma separated filed to a matlab matrix - matlab

I have a comma separated file in the format:
Col1Name,Col1Val1,Col1Val2,Col1Val3,...Col1ValN,Col2Name,Col2Val1,...Col2ValN,...,ColMName,ColMVal1,...,ColMValN
My question is, how can I convert this file into something Matlab can treat as a matrix, and how would I go about using this matrix in a file? I supposed I could some scripting language to format the file into matlab matrix format and copy it, but the file is rather large (~7mb).
Thanks!
Sorry for the edit:
The file format is:
Col1Name;Col2Name;Col3Name;...;ColNName
Col1Val1;Col2Val2;Col3Val3;...;ColNVal1
...
Col1ValM;Col2ValM;Col3ValM;...;VolNValM
Here is some actual data:
Press;Temp.;CondF;Cond20;O2%;O2ppm;pH;NO3;Chl(a);PhycoEr;PhycoCy;PAR;DATE;TIME;excel.date;date.time
0.96;20.011;432.1;431.9;125.1;11.34;8.999;134;9.2;2.53;1.85;16.302;08.06.2011;12:01:52;40702;40702.0.5
1;20.011;433;432.8;125;11.34;9;133.7;8.19;3.32;2.02;17.06;08.06.2011;12:01:54;40702;40702.0.5
1.1;20.012;432.7;432.4;125.1;11.34;9;133.8;8.35;2.13;2.2;19.007;08.06.2011;12:01:55;40702;40702.0.5
1.2;20.012;432.8;432.5;125.2;11.35;9.001;133.8;8.45;2.95;1.95;21.054;08.06.2011;12:01:56;40702;40702.0.5
1.3;20.012;432.7;432.4;125.4;11.37;9.002;133.7;8.62;3.17;1.87;22.934;08.06.2011;12:01:57;40702;40702.0.5
1.4;20.007;432.1;431.9;125.2;11.35;9.003;133.7;9.48;4.17;1.6;24.828;08.06.2011;12:01:58;40702;40702.0.5
1.5;19.997;432.3;432.2;124.9;11.33;9.003;133.8;8.5;3.84;1.79;27.327;08.06.2011;12:01:59;40702;40702.0.5
1.6;20;432.8;432.6;124.5;11.29;9.003;133.6;8.57;3.22;1.86;30.259;08.06.2011;12:02:00;40702;40702.0.5
1.7;19.99;431.9;431.9;124.4;11.28;9.002;133.6;8.79;3.7;1.81;35.152;08.06.2011;12:02:02;40702;40702.0.5
1.8;19.994;432.1;432.1;124.4;11.28;9.002;133.6;8.58;3.41;1.84;39.098;08.06.2011;12:02:03;40702;40702.0.5
1.9;19.993;433;432.9;124.6;11.3;9.002;133.6;8.59;3.45;5.53;45.488;08.06.2011;12:02:04;40702;40702.0.5
2;19.994;433;432.9;124.8;11.32;9.002;133.5;8.6;2.76;1.99;50.646;08.06.2011;12:02:05;40702;40702.0.5

If you don't know number of rows and columns up front, you can't use previous solution. Use this instead.
7 Mb is not large, it is small. This is the 21st century.
To read in to a matlab matrix:
text = fileread('file.name'); % a string with the entire file contents in it. 7 Mb is no big deal.
NAMES = {}; % we'll record column names here
VALUES = []; % this will be the matrix of values
while text(end) = ','
text(end)=[]; % elimnate any trailing commas
end
commas = find(text==','); % Index all the commas
commas = [0;commas(:);length(commas)+1] % put fake commas before and after text to simplify loop
col = 0; % which column are we in
I = 1;
while I<length(commas)
txt = text(commas(I)+1:commas(I+1)-1);
I = I+1;
num = str2double(txt);
if isnan(num) % this means it must be a column name
NAMES{end+1,1} = txt;
col = col+1; % can you believe Matlab doesn't support col++ ???
row = 1; % back to the top at each new column
continue % we have dealt with this txt, its not a num so ... next
end
% if we made it here we have a number
VALUES(row,col) = num;
end
Then you can save your matlab matrix VALUES and also the header names if you want them in matlab format NAMES into matlab format file
save('mymatrix.mat','VALUES','NAMES'); % saves matrix and column names to .mat file
You get stuff back in to matlab when you want it from the file by:
load mymatrix.mat; % loads VALUES and NAMES from .mat file
Some limitations:
You can't use commas in your column header names.
You cannot "name" a column something like "898.2" or anything which can be read as a double number, it will be read in as a number.
If your columns have different lengths, the shorter ones will be padded with zeros to the length of the longest column.
That's all I can think of.

Related

Optimizing reading the data in Matlab

I have a large data file with a text formatted as a single column with n rows. Each row is either a real number or a string with a value of: No Data. I have imported this text as a nx1 cell named Data. Now I want to filter out the data and to create a nx1 array out of it with NaN values instead of No data. I have managed to do it using a simple cycle (see below), the problem is that it is quite slow.
z = zeros(n,1);
for i = 1:n
if Data{i}(1)~='N'
z(i) = str2double(Data{i});
else
z(i) = NaN;
end
end
Is there a way to optimize it?
Actually, the whole parsing can be performed with a one-liner using a properly parametrized readtable function call (no iterations, no sanitization, no conversion, etc...):
data = readtable('data.txt','Delimiter','\n','Format','%f','ReadVariableNames',false,'TreatAsEmpty','No data');
Here is the content of the text file I used as a template for my test:
9.343410
11.54300
6.733000
-135.210
No data
34.23000
0.550001
No data
1.535000
-0.00012
7.244000
9.999999
34.00000
No data
And here is the output (which can be retrieved in the form of a vector of doubles using data.Var1):
ans =
9.34341
11.543
6.733
-135.21
NaN
34.23
0.550001
NaN
1.535
-0.00012
7.244
9.999999
34
NaN
Delimiter: specified as a line break since you are working with a single column... this prevents No data to produce two columns because of the whitespace.
Format: you want numerical values.
TreatAsEmpty: this tells the function to treat a specific string as empty, and empty doubles are set to NaN by default.
If you run this you can find out which approach is faster. It creates an 11MB text file and reads it with the various approaches.
filename = 'data.txt';
%% generate data
fid = fopen(filename,'wt');
N = 1E6;
for ct = 1:N
val = rand(1);
if val<0.01
fwrite(fid,sprintf('%s\n','No Data'));
else
fwrite(fid,sprintf('%f\n',val*1000));
end
end
fclose(fid)
%% Tommaso Belluzzo
tic
data = readtable(filename,'Delimiter','\n','Format','%f','ReadVariableNames',false,'TreatAsEmpty','No Data');
toc
%% Camilo Rada
tic
[txtMat, nLines]=txt2mat(filename);
NoData=txtMat(:,1)=='N';
z = zeros(nLines,1);
z(NoData)=nan;
toc
%% Gelliant
tic
fid = fopen(filename,'rt');
z= textscan(fid, '%f', 'Delimiter','\n', 'whitespace',' ', 'TreatAsEmpty','No Data', 'EndOfLine','\n','TextType','char');
z=z{1};
fclose(fid);
toc
result:
Elapsed time is 0.273248 seconds.
Elapsed time is 0.304987 seconds.
Elapsed time is 0.206315 seconds.
txt2mat is slow, even without converting resulting string matrix to numbers it is outperformed by readtable and textscan. textscan is slightly faster than readtable. Probably because it skips some of the internal sanity checks and does not convert the resulting data to a table.
Depending of how big are your files and how often you read such files, you might want to go beyond readtable, that could be quite slow.
EDIT: After tests, with a file this simple the method below provide no advantages. The method was developed to read RINEX files, that are large and complex in the sense that the are aphanumeric with different numbers of columns and different delimiters in different rows.
The most efficient way I've found, is to read the whole file as a char matrix, then you can easily find you "No data" lines. And if your real numbers are formatted with fix width you can transform them from char into numbers in a way much more efficient than str2double or similar functions.
The function I wrote to read a text file into a char matrix is:
function [txtMat, nLines]=txt2mat(filename)
% txt2mat Read the content of a text file to a char matrix
% Read all the content of a text file to a matrix as wide as the longest
% line on the file. Shorter lines are padded with blank spaces. New lines
% are not included in the output.
% New lines are identified by new line \n characters.
% Reading the whole file in a string
fid=fopen(filename,'r');
fileData = char(fread(fid));
fclose(fid);
% Finding new lines positions
newLines= fileData==sprintf('\n');
linesEndPos=find(newLines)-1;
% Calculating number of lines
nLines=length(linesEndPos);
% Calculating the width (number of characters) of each line
linesWidth=diff([-1; linesEndPos])-1;
% Number of characters per row including new lines
charsPerRow=max(linesWidth)+1;
% Initializing output var with blank spaces
txtMat=char(zeros(charsPerRow,nLines,'uint8')+' ');
% Computing a logical index to all characters of the input string to
% their final positions
charIdx=false(charsPerRow,nLines);
% Indexes of all new lines
linearInd = sub2ind(size(txtMat), (linesWidth+1)', 1:nLines);
charIdx(linearInd)=true;
charIdx=cumsum(charIdx)==0;
% Filling output matrix
txtMat(charIdx)=fileData(~newLines);
% Cropping the last row coresponding to new lines characters and transposing
txtMat=txtMat(1:end-1,:)';
end
Then, once you have all your data in a matrix (let's assume it is named txtMat), you can do:
NoData=txtMat(:,1)=='N';
And if your number fields have fix width, you can transform them to numbers way more efficiently than str2num with something like
values=((txtMat(:,1:10)-'0')*[1e6; 1e5; 1e4; 1e3; 1e2; 10; 1; 0; 1e-1; 1e-2]);
Where I've assumed the numbers have 7 digits and two decimal places, but you can easily adapt it for your case.
And to finish you need to set the NaN values with:
values(NoData)=NaN;
This is more cumbersome than readtable or similar functions, but if you are looking to optimize the reading, this is WAY faster. And if you don't have fix width numbers you can still do it this way by adding a couple lines to count the number of digits and find the place of the decimal point before doing the conversion, but that will slow down things a little bit. However, I think it will still be faster.

Read a text file and separate the data into different columns and different tables in MATLAB

I have a huge text file that needs to be read and processed in MATLAB. This file at some points contain text to indicate that a new data series has started.
I have searched here but cant find any simple solution.
So what I want to do is to read the data in the file, put the data in a table in three different columns and when it finds text it should create a new table. It should repeat this process until the entire document is scanned.
This is how the document looks like:
time V(A,B) I(R1)
Step Information: X=1 (Run: 1/11)
0.000000000000000e+000 -2.680148e-016 0.000000e+00
9.843925313007988e-012 -4.753470e-006 2.216314e-011
1.000052605772457e-011 -4.835427e-006 2.552497e-011
1.031372754715773e-011 -4.999340e-006 -3.042096e-012
1.094013052602406e-011 -5.327165e-006 -1.206968e-011
Step Information: X=1 (Run: 2/11)
0.000000000000000e+000 -2.680148e-016 0.000000e+000
9.843925313007988e-012 -4.753470e-006 2.216314e-011
1.000052605772457e-011 -4.835427e-006 2.552497e-011
1.031372754715773e-011 -4.999340e-006 -3.042096e-012
1.094013052602406e-011 -5.327165e-006 -1.206968e-011
A rather crude approach is to read the file line by line and check if the line consists of three numbers. If it does, then append this to a temporary matrix. When you finally get to a line that doesn't contain three numbers, append this matrix as an element in a cell array, clear the temporary matrix and continue.
Something like this would work, assuming that the file is stored in 'file.txt':
%// Open the file
f = fopen('file.txt', 'r');
%// Initialize empty cell array
data = {};
%// Initialize temporary matrix
temp = [];
%// Loop over the file...
while true
%// Get a line from the file
line = fgetl(f);
%// If we reach the end of the file, get out
if line == -1
%// Last check before we break
%// Check if the temporary matrix isn't empty and add
if ~isempty(temp)
data = [data; temp];
end
break;
end
%// Else, check to see if this line contains three numbers
numbers = textscan(line, '%f %f %f');
%// If this line doesn't consist of three numbers...
if all(cellfun(#isempty, numbers))
%// If the temporary matrix is empty, skip
if isempty(temp)
continue;
end
%// Concatenate to cell array
data = [data; temp];
%// Reset temporary matrix
temp = [];
%// If this does, then create a row vector and concatenate
else
temp = [temp; numbers{:}];
end
end
%// Close the file
fclose(f);
The code is pretty self-explanatory but let's go into it to be sure you know what's going on. First open up the file with fopen to get a "pointer" to the file, then initialize our cell array that will contain our matrices as well as the temporary matrix used when reading in matrices in between header information. After we simply loop over each line of the file and we can grab a line with fgetl using the file pointer we created. We then check to see if we have reached the end of the file and if we have, let's check to see if the temporary matrix has any numerical data in it. If it does, add this into our cell array then finally get out of the loop. We use fclose to close up the file and clean things up.
Now the heart of the operation is what follows after this check. We use textscan and search for three numbers separated by spaces. That's done with the '%f %f %f' format specifier. This should give you a cell array of three elements if you are successful with numbers. If this is correct, then convert this cell array of elements into a row of numbers and concatenate this into the temporary matrix. Doing temp = [temp; numbers{:}]; facilitates this concatenation. Simply put I piece together each number and concatenate them horizontally to create a single row of numbers. I then take this row and concatenate this as another row in the temporary matrix.
Should we finally get to a line where it's all text, this will give you all three elements in the cell array found by textscan to be empty. That's the purpose of the all and cellfun call. We search each element in the cell and see if it's empty. If every element is empty, this is a line that is text. If this situation arises, simply take the temporary matrix and add this as a new entry into your cell array. You'd then reset the temporary matrix and start the logic over again.
However, we also have to take into account that there may be multiple lines that consist of text. That's what the additional if statement is for inside the first if block using all. If we have an additional line of text that precedes a previous line of text, the temporary matrix of values should still be empty and so you should check to see if that is empty before you try and concatenate the temporary matrix. If it's empty, don't bother and just continue.
After running this code, I get the following for my data matrix:
>> format long g
>> celldisp(data)
data{1} =
0 -2.680148e-16 0
9.84392531300799e-12 -4.75347e-06 2.216314e-11
1.00005260577246e-11 -4.835427e-06 2.552497e-11
1.03137275471577e-11 -4.99934e-06 -3.042096e-12
1.09401305260241e-11 -5.327165e-06 -1.206968e-11
data{2} =
0 -2.680148e-16 0
9.84392531300799e-12 -4.75347e-06 2.216314e-11
1.00005260577246e-11 -4.835427e-06 2.552497e-11
1.03137275471577e-11 -4.99934e-06 -3.042096e-12
1.09401305260241e-11 -5.327165e-06 -1.206968e-11
To access a particular "table", do data{ii} where ii is the table you want to access that was read in from top to bottom in your text file.
The most versatile way is to read line by line using textscan. If you want to speed this process up, you can have a dummy read first:
ie. You loop through all the lines without storing the data and decide which lines are the text lines and which are numbers, recording a quick number of lines for each.
You then have enough information about the data to run through quickly the arrays. This will speed up the time it takes to store the data in your new arrays massively.
Your second loop is the one that actually reads the data into the array/s. You should now know which lines to skip. You can also pre-allocate the arrays within the data cell if you wish to.
fid = fopen('file.txt','r');
data = {};
nlines = [];
% now start the loop
k=0; % counter for data sets
while ~feof(fid)
line = fgetl(fid);
% check if is data or text
if all(ismember(line,' 0123456789+.')) % is it data
nlines(k) = nlines(k)+1;
else %is it text
k=k+1;
nlines(k) = 0;
end
end
frewind(fid); % go back to start of file
% You could preallocate the data array here if you wished
% now get the data
for aa = 1 : length(nlines)
if nlines(aa)==0;
continue
end
textscan(fid,'%s\r\n',1); % skip textline
data{aa} = textscan(fid,'%f%f%f\r\n',nlines(k));
end

Matlab Clipboard Precision - Format long -

I need to copy paste several matrix from matlab to excel so i did my researches and i've found a really amazing script called num2clip that brings the selected array to the clipboard.
The only problem is that the numbers format is short, when i would like it to be long.
I suspect the "double" type used in the script but i'm still new to matlab so i do have some important lacks.
Here is the script that i've found, what do i have to do according to you in order to keep the long input format ?
function arraystring = num2clip(array)
function arraystring = num2clip(array)
%NUM2CLIP copies a numerical-array to the clipboard
%
% ARRAYSTRING = NUM2CLIP(ARRAY)
%
% Copies the numerical array ARRAY to the clipboard as a tab-separated
% string. This format is suitable for direct pasting to Excel and other
% programs.
%
% The tab-separated result is returned as ARRAYSTRING. This
% functionality has been included for completeness.
%
%Author: Grigor Browning
%Last update: 02-Sept-2005
%convert the numerical array to a string array
%note that num2str pads the output array with space characters to account
%for differing numbers of digits in each index entry
arraystring = num2str(array);
%add a carrige return to the end of each row
arraystring(:,end+1) = char(10);
%reshape the array to a single line
%note that the reshape function reshape is column based so to reshape by
%rows one must use the inverse of the matrix
%reshape the array to a single line
arraystring = reshape(arraystring',1,prod(size(arraystring)));
%create a copy of arraystring shifted right by one space character
arraystringshift = [' ',arraystring];
%add a space to the end of arraystring to make it the same length as
%arraystringshift
arraystring = [arraystring,' '];
%now remove the additional space charaters - keeping a single space
%charater after each 'numerical' entry
arraystring = arraystring((double(arraystring)~=32 |...
double(arraystringshift)~=32) &...
~(double(arraystringshift==10) &...
double(arraystring)==32) );
%convert the space characters to tab characters
arraystring(double(arraystring)==32) = char(9);
format long e
%copy the result to the clipboard ready for pasting
clipboard('copy',arraystring);
Best regards.
Just replace the line with:
arraystring = num2str(array) ;
to a line like that:
arraystring = num2str(array,'%15.15f') ;
This will give you the maximum precision you can reach with the double type (15 digits).
Look at the num2str documentation for more custom format.
Thank you Hoki for your participation.
I didnt had the time to go through all the documentation wich is great by the way.
When i tried your solution, the copied data was all inserted on one cell, i just had to change :
arraystring = num2str(array,'%15.15f') ;
to
arraystring = num2str(array,15) ;
Have a nice day !

Writing .mat files to .nc

I have a code that creates a bunch of .mat files, but I want to save them as netcdf files (csv or txt would be fine as well) so people who can't use MATLAB can access them. This is what I have so far
%% Use function to read in
data = read_mixed_csv(filename,'"'); % Creates cell array of data
data = regexprep(data, '^"|"$',''); % Gets rid of double quotes at the start and end of the string
data = data(:,2:2:41); % Keep only the even cells because the odd ones are just commas
%% Sort data based on date (Column 1)
[Y,I] = sort(data(:,1)); % Create 1st column sorted
site_sorted = data(I,:); % Sort the entire array
%% Find unique value in the site data (Column 2)
% Format of site code is state-county-site
u_id = unique(site_sorted(:,2)); % get unique id
for i = 1:length(u_id)
idx=ismember(site_sorted(:,2),u_id{i}); % extract index where the second column matches the current id value
site_data = site_sorted(idx,:);
save([u_id{i} '.mat'],'site_data');
cdfwrite([u_id{i} '.nc'], 'site_data');
end
Everything works until the second to last line. I want to write each 'site_data' as a netcdf file with the same name as save([u_id{i} '.mat'],'site_data');, which is a string from the second column.
Try
cdfwrite([u_id{i}],{'site_data',site_data})
The extension will be '.cdf'. I am not sure if this can be changed while using cdfwrite.
Edit: Corrected Typo

How to import a sequence of Excel Files in matlab as a column vectors or as a cell array?

I want to import a sequence of excel files with a large amount of data in them. The problem that I have is I want to process the data in each file at a time and store the output from this into a variable, but each time I try to process a different file the variable gets overwritten in the variable workspace. Is there anyway I could store these files and process each file at a time?
numFiles = 1;
range = 'A2:Q21';
sheet = 1;
myData = cell(1,numFiles); % Importing data from Excel
for fileNum = 1:numFiles
fileName = sprintf('myfile%02d.xlsx',fileNum);
myData{fileNum} = importfile3(fileName,sheet,range);
end
data = cell2mat(myData);
The actual data import is performed by importfile3 which is, for the most part, a wrapper for the xlsread function that returns a matrix corresponding to the specified range of excel data.
function data = importfile3(workbookFile, sheetName, range)
% If no sheet is specified, read first sheet
if nargin == 1 || isempty(sheetName)
sheetName = 1;
end
% If no range is specified, read all data
if nargin <= 2 || isempty(range)
range = '';
end
%% Import the data
[~, ~, raw] = xlsread(workbookFile, sheetName, range);
%% Replace non-numeric cells with 0.0
R = cellfun(#(x) ~isnumeric(x) || isnan(x),raw); % Find non-numeric cells
raw(R) = {0.0}; % Replace non-numeric cells
%% Create output variable
data = cell2mat(raw);
The issue that you are running in to is a result of cell2mat concatenating all of the data in your cells in to one large 2-dimensional matrix. If you were to import two excel files with 20 rows and 17 columns, each, this would result in a 2-dimensional matrix of size [20 x 34]. The doc for cell2mat has a nice visual describing this.
I see that your importfile3 function returns a matrix, and based on your use of cell2mat in your final line of code, it looks like you would like to have your final result be in the form of a matrix. So I think the easiest way to go about this is to just bypass the intermediate myData cell array.
In the example code below, the resulting data is a 3-dimensional matrix. The 1st dimension indicates row number, 2nd dimension is column number, and 3rd dimension is file number. Cell arrays are very useful for "jagged" data, but based on the code you provided, each excel data set that you import will have the same number of rows and columns.
numFiles = 2;
range = 'A2:Q21';
sheet = 1;
% Number of rows and cols known before data import
numRows = 20;
numCols = 17;
data = zeros(numRows,numCols,numFiles);
for fileNum = 1:numFiles
fileName = sprintf('myfile%02d.xlsx',fileNum);
data(:,:,fileNum) = importfile3(fileName,sheet,range);
end
Accessing this data is now very straight-forward.
data(:,:,1) returns the data imported from your first excel file.
data(:,:,2) returns the data imported from your second excel file.
etc.