MATLAB reads UNICODE CSV with spaces between characters - matlab

I am using the fgetl command to read a .csv file but instead of returning the results I wanted as:
"HIST",1,1,27,PWH,"1"
it returned with additional space between each character:
" H I S T " , 1 , 1 , 2 7 , P W H , " 1 "
I know that I can replace the space with regexprep, but my file contains billions of lines so the added expression might consume considerably more time. I had a feeling that this is a unicode issue and someone pointed out the same issue when he used Java and it was related to unicode. I wonder if anyone knows a better way to deal with the problem in MATLAB?
Update:
It should be the unicode issue because the .csv file is an output from another program, and when I read it using fgetl the spaces are added. However, if I save the .csv file again using Excel and read the .csv file using fgetl again, it returns the results I want.
I am not able to provide an example because the .csv file is very large and I cannot make a small sample because when I open and save it from Excel, this problem is gone.

For the purpose of demonstration, let's consider a demo file - demo.csv:
"GIST",1,6,17,PWH,"1"
"FIST",0,4,72,WPH,"2"
"MIST",3,2,27,WHP,"3"
You have some options:
textscan (for any text file with a known structure):
fID = fopen('demo.csv');
C = textscan(fID,'%s%d%d%d%s%s','Delimiter',{',','"'},'MultipleDelimsAsOne',1);
fclose(fID);
Which results in:
C =
{3x1 cell} [3x1 int32] [3x1 int32] [3x1 int32] {3x1 cell} {3x1 cell}
Import helper + generate script (AKA overkill is an understatement):
Which results in:
%% Import data from text file.
% Script for importing data from the following text file:
%
% F:\demo.csv
%
% To extend the code to different selected data or a different text file, generate a
% function instead of a script.
% Auto-generated by MATLAB on 2016/04/20 19:51:32
%% Initialize variables.
filename = 'F:\demo.csv';
delimiter = ',';
%% Read columns of data as strings:
% For more information, see the TEXTSCAN documentation.
formatSpec = '%q%q%q%q%q%q%[^\n\r]';
%% Open the text file.
fileID = fopen(filename,'r');
%% Read columns of data according to format string.
% This call is based on the structure of the file used to generate this code. If an error
% occurs for a different file, try regenerating the code from the Import Tool.
dataArray = textscan(fileID, formatSpec, 'Delimiter', delimiter, 'ReturnOnError', false);
%% Close the text file.
fclose(fileID);
%% Convert the contents of columns containing numeric strings to numbers.
% Replace non-numeric strings with NaN.
raw = repmat({''},length(dataArray{1}),length(dataArray)-1);
for col=1:length(dataArray)-1
raw(1:length(dataArray{col}),col) = dataArray{col};
end
numericData = NaN(size(dataArray{1},1),size(dataArray,2));
for col=[2,3,4,6]
% Converts strings in the input cell array to numbers. Replaced non-numeric strings with
% NaN.
rawData = dataArray{col};
for row=1:size(rawData, 1);
% Create a regular expression to detect and remove non-numeric prefixes and suffixes.
regexstr = '(?<prefix>.*?)(?<numbers>([-]*(\d+[\,]*)+[\.]{0,1}\d*[eEdD]{0,1}[-+]*\d*[i]{0,1})|([-]*(\d+[\,]*)*[\.]{1,1}\d+[eEdD]{0,1}[-+]*\d*[i]{0,1}))(?<suffix>.*)';
try
result = regexp(rawData{row}, regexstr, 'names');
numbers = result.numbers;
% Detected commas in non-thousand locations.
invalidThousandsSeparator = false;
if any(numbers==',');
thousandsRegExp = '^\d+?(\,\d{3})*\.{0,1}\d*$';
if isempty(regexp(numbers, thousandsRegExp, 'once'));
numbers = NaN;
invalidThousandsSeparator = true;
end
end
% Convert numeric strings to numbers.
if ~invalidThousandsSeparator;
numbers = textscan(strrep(numbers, ',', ''), '%f');
numericData(row, col) = numbers{1};
raw{row, col} = numbers{1};
end
catch me
end
end
end
%% Split data into numeric and cell columns.
rawNumericColumns = raw(:, [2,3,4,6]);
rawCellColumns = raw(:, [1,5]);
%% Allocate imported array to column variable names
GIST = rawCellColumns(:, 1);
VarName2 = cell2mat(rawNumericColumns(:, 1));
VarName3 = cell2mat(rawNumericColumns(:, 2));
VarName4 = cell2mat(rawNumericColumns(:, 3));
PWH = rawCellColumns(:, 2);
VarName6 = cell2mat(rawNumericColumns(:, 4));
%% Clear temporary variables
clearvars filename delimiter formatSpec fileID dataArray ans raw col numericData rawData row regexstr result numbers invalidThousandsSeparator thousandsRegExp me rawNumericColumns rawCellColumns;
csvread (for numeric values only; which means it is not applicable here).

I happened to have the same issue. I opened a .csv file using textscan and it added 1 whitespace on both side of any character and I also noticed that when opening the variable storing the read data, the font was different than the usual in Matlab.
We managed to solve this issue by opening the '.csv' file into Notepad++ and changed the encoding to UTF-8. It solved the problem.
Hope it helps!

Related

String vector to array

I am trying to make a script in Matlab that pulls data from a file and generates an array of data. Since the data is a string I've tried to split it into columns, take the transpose, and split it into columns again to populate an array.
When I run the script I don't get any errors, but I also don't get any useful data. I tell it to display the final vector (Full_Array) and I get {1×4 cell} 8 times. When I try to use strsplit I get the error:
'Error using strsplit (line 80) First input must be either a character vector or a string scalar.'
I'm pretty new to Matlab and I honestly have no clue how to fix it after reading through similar threads and the documentation I'm out of ideas. I've attached the code and the data to read in below. Thank you.
clear
File_Name = uigetfile; %Brings up windows file browser to locate .xyz file
Open_File = fopen(File_Name); %Opens the file given by File_Name
File2Vector = fscanf(Open_File,'%s'); %Prints the contents of the file to a 1xN vector
Vector2ColumnArray = strsplit(File2Vector,';'); %Splits the string vector from
%File2Vector into columns, forming an array
Transpose = transpose(Vector2ColumnArray); %Takes the transpose of Vector2ColumnArray
%making a column array into a row array
FullArray = regexp(Transpose, ',', 'split');
The data I am trying to read in comes from a .xyz file that I have titled methylformate.xyz, here is the data:
O2,-0.23799,0.65588,-0.69492;
O1,0.50665,0.83915,1.47685;
C2,-0.32101,2.08033,-0.75096;
C1,0.19676,0.17984,0.49796;
H4,0.66596,2.52843,-0.59862;
H3,-0.67826,2.36025,-1.74587;
H2,-1.03479,2.45249,-0.00927;
H1,0.23043,-0.91981,0.45346;
When I started using Matlab I also had problems with the data structure. The last line
FullArray = regexp(Transpose, ',', 'split');
splits each line and stores it in a cell array. In order to access the individual strings you have to index with curly brackets into FullArray:
FullArray{1}{1} % -> 'O2'
FullArray{1}{2} % -> '-0.23799'
FullArray{2}{1} % -> 'O1'
FullArray{2}{2} % -> '0.50665'
Thereby the first number corresponds to the row and the second to the particular element in the row.
However, there are easier functions in Matlab which load text files based on regular expressions.
Usually, the easiest function for reading mixed data is readtable.
data = readtable('methylformate.txt');
However, in your case this is more complex because
readtable can't cope with .xyz files, so you'd have to copy to .txt
The semi-colons confuse the read and make the last column characters
You can loop through each row and use textscan like so:
fid = fopen('methylformate.xyz');
tline = fgetl(fid);
myoutput = cell(0,4);
while ischar(tline)
myoutput(end+1,:) = textscan(tline, '%s %f %f %f %*[^\n]', 'delim', ',');
tline = fgetl(fid);
end
fclose(fid);
Output is a cell array of strings or doubles (as appropriate).

Read CSV with semicolon separated data with comma as decimal mark in matlab

My problem is, that I've got CSV-data of the following format:
1,000333e+003;6,620171e+001
1,001297e+003;6,519699e+001
1,002261e+003;6,444984e+001
I want to read the data into matlab, but csvread requires it to be comma separated, and I have not been able to find a solution to the comma-decimal mark. I guess I can use textscan in some way?
I'm sorry to ask such an (I think) easy question, but I hope someone can help. None of the other questions/answers in here seems to be dealing with this combination of comma and semicolon.
EDIT3 (ACCEPTED ANSWER): Using the import data button in the variable section of the home toolbar it is possible to customise how the data is imported. once that is done you can click import selection beneath the arrow and generate a script or function that will follow the same rules defined in the import data window.
--------------------------------------------------kept for reference--------------------------------------------------
You can use dlmread it works in the following format
M = dlmread(filename,';')
the filename is a string with the full path of the file unless the file is in the current working directory in which case you can just type the filename.
EDIT1: to use textscan instead, the following code should do the trick or at least most of it.
%rt is permission r for read t for open in text mode
csv_file = fopen('D:\Dev\MATLAB\stackoverflow_tests\1.csv','rt');
%the formatspec represents what the scan is 'looking'for.
formatSpec = '%s%s';
%textscan inputs work in pairs so your scanning the file using the format
%defined above and with a semicolon delimeter
C = textscan(csv_file, formatSpec, 'Delimiter', ';');
fclose(csv_file);
the result is shown.
C{1}{1} =
1,000333e+003
C{1}{2} =
1,001297e+003
C{1}{3} =
1,002261e+003
C{2}{1} =
6,620171e+001
C{2}{2} =
6,519699e+001
C{2}{3} =
6,444984e+001
EDIT2: to replace the comma with a dot and convert to a integer of type double:
[row, col] = size(C);
for kk = 1 : col
A = C{1,kk};
converted_data{1,kk} = str2double(strrep(A, ',', '.'));
end
celldisp(converted_data)
result:
converted_data{1} =
1.0e+03 *
1.0003
1.0013
1.0023
converted_data{2} =
66.2017
65.1970
64.4498
% Data is in C:\temp\myfile.csv
fid = fopen('C:\temp\myfile.csv');
data = textscan(fid, '%s%s', 'delimiter', ';');
fclose(fid);
% Convert ',' to '.'
data = cellfun( #(x) str2double(strrep(x, ',', '.')), data, 'uniformoutput', false);
data =
[3x1 double] [3x1 double]
data{1}
ans =
1.0e+03 *
1.000333000000000
1.001297000000000
1.002261000000000
data{2}
ans =
66.201710000000006
65.196990000000000
64.449839999999995

Edit generated 'importdata' function to import all files in directory in Matlab

Seeking help from skillful Matlab users!
I'm kind of new to Matlab and hope somebody has the time to help me. I need to import some .txt-files from a directory. I have found a way to do this trough the import tool. There are some data using comma insted of dots, so importdata will not work, but the 'import data' tool does.
So i'm wondering (and hoping) if it is possible to edit the generated function to import all the files in the directory, in such a way as the single file is imported? I want each file to be imported as matrix variable (double). I want to import all the files in one process (loop). Also there are many files and they all have some 100 000 lines or so.
If someone see an easy way to do this i would appreciate the help. Please keep the explanation on a low level, as i'm quite novice. I get the following function using the 'import data' tool:
function Streaming0x00x00158D00000E04621709201405 = importfile1(filename, startRow, endRow)
%IMPORTFILE1 Import numeric data from a text file as a matrix.
% STREAMING0X00X00158D00000E04621709201405 = IMPORTFILE1(FILENAME) Reads
% data from text file FILENAME for the default selection.
%
% STREAMING0X00X00158D00000E04621709201405 = IMPORTFILE1(FILENAME,
% STARTROW, ENDROW) Reads data from rows STARTROW through ENDROW of text
% file FILENAME.
%
% Example:
% Streaming0x00x00158D00000E04621709201405 =
% importfile1('Streaming_0_x_0_0_x_00158D00000E0462_17-09-2014_05.32.24_part000.txt',
% 17, 137834);
%
% See also TEXTSCAN.
% Auto-generated by MATLAB on 2015/02/04 09:28:07
%% Initialize variables.
delimiter = ';';
if nargin<=2
startRow = 17;
endRow = inf;
end
%% Read columns of data as strings:
% For more information, see the TEXTSCAN documentation.
formatSpec = '%s%s%[^\n\r]';
%% Open the text file.
fileID = fopen(filename,'r');
%% Read columns of data according to format string.
% This call is based on the structure of the file used to generate this
% code. If an error occurs for a different file, try regenerating the code
% from the Import Tool.
textscan(fileID, '%[^\n\r]', startRow(1)-1, 'ReturnOnError', false);
dataArray = textscan(fileID, formatSpec, endRow(1)-startRow(1)+1, 'Delimiter', delimiter, 'ReturnOnError', false);
for block=2:length(startRow)
frewind(fileID);
textscan(fileID, '%[^\n\r]', startRow(block)-1, 'ReturnOnError', false);
dataArrayBlock = textscan(fileID, formatSpec, endRow(block)-startRow(block)+1, 'Delimiter', delimiter, 'ReturnOnError', false);
for col=1:length(dataArray)
dataArray{col} = [dataArray{col};dataArrayBlock{col}];
end
end
%% Close the text file.
fclose(fileID);
%% Convert the contents of columns containing numeric strings to numbers.
% Replace non-numeric strings with NaN.
raw = repmat({''},length(dataArray{1}),length(dataArray)-1);
for col=1:length(dataArray)-1
raw(1:length(dataArray{col}),col) = dataArray{col};
end
numericData = NaN(size(dataArray{1},1),size(dataArray,2));
for col=[1,2]
% Converts strings in the input cell array to numbers. Replaced non-numeric
% strings with NaN.
rawData = dataArray{col};
for row=1:size(rawData, 1);
% Create a regular expression to detect and remove non-numeric prefixes and
% suffixes.
regexstr = '(?<prefix>.*?)(?<numbers>([-]*(\d+[\.]*)+[\,]{0,1}\d*[eEdD]{0,1}[-+]*\d*[i]{0,1})|([-]*(\d+[\.]*)*[\,]{1,1}\d+[eEdD]{0,1}[-+]*\d*[i]{0,1}))(?<suffix>.*)';
try
result = regexp(rawData{row}, regexstr, 'names');
numbers = result.numbers;
% Detected commas in non-thousand locations.
invalidThousandsSeparator = false;
if any(numbers=='.');
thousandsRegExp = '^\d+?(\.\d{3})*\,{0,1}\d*$';
if isempty(regexp(thousandsRegExp, '.', 'once'));
numbers = NaN;
invalidThousandsSeparator = true;
end
end
% Convert numeric strings to numbers.
if ~invalidThousandsSeparator;
numbers = strrep(numbers, '.', '');
numbers = strrep(numbers, ',', '.');
numbers = textscan(numbers, '%f');
numericData(row, col) = numbers{1};
raw{row, col} = numbers{1};
end
catch me
end
end
end
%% Replace non-numeric cells with NaN
R = cellfun(#(x) ~isnumeric(x) && ~islogical(x),raw); % Find non-numeric cells
raw(R) = {NaN}; % Replace non-numeric cells
%% Create output variable
Streaming0x00x00158D00000E04621709201405 = cell2mat(raw);
If something is unclear, please comment.
All help is useful, thanks :)
If all files are the same, you can make a cell array of the filenames (NOT a standard array, they do not behave correctly on strings). Then you can loop over the cell array. For instance:
fname_arr = {'file1.txt','file2.txt'}; % your filenames go here
for k in length(fname_arr):
filename = fname_arr{k};
%% Open the text file.
fileID = fopen(filename,'r'); % start of the relevant part of your codeblock
<...> % omitting the stuff in the middle of the code
fclose(fileID) % end of the relevant part of your codeblock
allDataArray{k} = DataArray
end
Then allDataArray is a cell array whose kth element contains the DataArray obtained from file fname_arr{k}.
Ok, tried something here. Implemented this in my function:
output=(dir_output);
for k=1:length(output);
filename = output{k}.name;
%% Open the text file. fileID = fopen(filename,'r');
where 'dir_output' is a struct, containing all the file names in the Directory. Also put in:
%% Close the text file.
fclose(fileID);
allDataArray{k} = DataArray;
end
Get this as error:
>> function1 Undefined function or variable 'dir_output'. Error in function1 (line 30) output=(dir_output);
Why???

Reading CSV with mixed type data

I need to read the following csv file in MATLAB:
2009-04-29 01:01:42.000;16271.1;16271.1
2009-04-29 02:01:42.000;2.5;16273.6
2009-04-29 03:01:42.000;2.599609;16276.2
2009-04-29 04:01:42.000;2.5;16278.7
...
I'd like to have three columns:
timestamp;value1;value2
I tried the approaches described here:
Reading date and time from CSV file in MATLAB
modified as:
filename = 'prova.csv';
fid = fopen(filename, 'rt');
a = textscan(fid, '%s %f %f', ...
'Delimiter',';', 'CollectOutput',1);
fclose(fid);
But it returs a 1x2 cell, whose first element is a{1}='ÿþ2', the other are empty.
I had also tried to adapt to my case the answers to these questions:
importing data with time in MATLAB
Read data files with specific format in matlab and convert date to matal serial time
but I didn't succeed.
How can I import that csv file?
EDIT After the answer of #macduff i try to copy-paste in a new file the data reported above and use:
a = textscan(fid, '%s %f %f','Delimiter',';');
and it works.
Unfortunately that didn't solve the problem because I have to process csv files generated automatically, which seems to be the cause of the strange MATLAB behavior.
What about trying:
a = textscan(fid, '%s %f %f','Delimiter',';');
For me I get:
a =
{4x1 cell} [4x1 double] [4x1 double]
So each element of a corresponds to a column in your csv file. Is this what you need?
Thanks!
Seems you're going about it the right way. The example you provide poses no problems here, I get the output you desire. What's in the 1x2 cell?
If I were you I'd try again with a smaller subset of the file, say 10 lines, and see if the output changes. If yes, then try 100 lines, etc., until you find where the 4x1 cell + 4x2 array breaks down into the 1x2 cell. It might be that there's an empty line or a single empty field or whatever, which forces textscan to collect data in an additional level of cells.
Note that 'CollectOutput',1 will collect the last two columns into a single array, so you'll end up with 1 cell array of 4x1 containing strings, and 1 array of 4x2 containing doubles. Is that indeed what you want? Otherwise, see #macduff's post.
I've had to parse large files like this, and I found I didn't like textscan for this job. I just use a basic while loop to parse the file, and I use datevec to extract the timestamp components into a 6-element time vector.
%% Optional: initialize for speed if you have large files
n = 1000 %% <# of rows in file - if known>
timestamp = zeros(n,6);
value1 = zeros(n,1);
value2 = zeros(n,1);
fid = fopen(fname, 'rt');
if fid < 0
error('Error opening file %s\n', fname); % exit point
end
cntr = 0
while true
tline = fgetl(fid); %% get one line
if ~ischar(tline), break; end; % break out of loop at end of file
cntr = cntr + 1;
splitLine = strsplit(tline, ';'); %% split the line on ; delimiters
timestamp(cntr,:) = datevec(splitLine{1}, 'yyyy-mm-dd HH:MM:SS.FFF'); %% using datevec to parse time gives you a standard timestamp vector
value1(cntr) = splitLine{2};
value2(cntr) = splitLine{3};
end
%% Concatenate at the end if you like
result = [timestamp value1 value2];

Problem (bug?) loading hexadecimal data into MATLAB

I'm trying to load the following ascii file into MATLAB using load()
% some comment
1 0xc661
2 0xd661
3 0xe661
(This is actually a simplified file. The actual file I'm trying to load contains an undefined number of columns and an undefined number of comment lines at the beginning, which is why the load function was attractive)
For some strange reason, I obtain the following:
K>> data = load('testMixed.txt')
data =
1 50785
2 58977
3 58977
I've observed that the problem occurs anytime there's a "d" in the hexadecimal number.
Direct hex2dec conversion works properly:
K>> hex2dec('d661')
ans =
54881
importdata seems to have the same conversion issue, and so does the ImportWizard:
K>> importdata('testMixed.txt')
ans =
1 50785
2 58977
3 58977
Is that a bug, am I using the load function in some prohibited way, or is there something obvious I'm overlooking?
Are there workarounds around the problem, save from reimplementing the file parsing on my own?
Edited my input file to better reflect my actual file format. I had a bit oversimplified in my original question.
"GOLF" ANSWER:
This starts with the answer from mtrw and shortens it further:
fid = fopen('testMixed.txt','rt');
data = textscan(fid,'%s','Delimiter','\n','MultipleDelimsAsOne','1',...
'CommentStyle','%');
fclose(fid);
data = strcat(data{1},{' '});
data = sscanf([data{:}],'%i',[sum(isspace(data{1})) inf]).';
PREVIOUS ANSWER:
My first thought was to use TEXTSCAN, since it has an option that allows you to ignore certain lines as comments when they start with a given character (like %). However, TEXTSCAN doesn't appear to handle numbers in hexadecimal format well. Here's another option:
fid = fopen('testMixed.txt','r'); % Open file
% First, read all the comment lines (lines that start with '%'):
comments = {};
position = 0;
nextLine = fgetl(fid); % Read the first line
while strcmp(nextLine(1),'%')
comments = [comments; {nextLine}]; % Collect the comments
position = ftell(fid); % Get the file pointer position
nextLine = fgetl(fid); % Read the next line
end
fseek(fid,position,-1); % Rewind to beginning of last line read
% Read numerical data:
nCol = sum(isspace(nextLine))+1; % Get the number of columns
data = fscanf(fid,'%i',[nCol inf]).'; % Note '%i' works for all integer formats
fclose(fid); % Close file
This will work for an arbitrary number of comments at the beginning of the file. The computation to get the number of columns was inspired by Jacob's answer.
New:
This is the best I could come up with. It should work for any number of comment lines and columns. You'll have to do the rest yourself if there are strings, etc.
% Define the characters representing the start of the commented line
% and the delimiter
COMMENT_START = '%%';
DELIMITER = ' ';
% Open the file
fid = fopen('testMixed.txt');
% Read each line till we reach the data
l = COMMENT_START;
while(l(1)==COMMENT_START)
l = fgetl(fid);
end
% Compute the number of columns
cols = sum(l==DELIMITER)+1;
% Split the first line
split_l = regexp(l,' ','split');
% Read all the data
A = textscan(fid,'%s');
% Compute the number of rows
rows = numel(A{:})/cols;
% Close the file
fclose(fid);
% Assemble all the data into a matrix of cell strings
DATA = [split_l ; reshape(A{:},[cols rows])']; %' adding this to make it pretty in SO
% Recognize each column and process accordingly
% by analyzing each element in the first row
numeric_data = zeros(size(DATA));
for i=1:cols
str = DATA(1,i);
% If there is no '0x' present
if isempty(findstr(str{1},'0x')) == true
% This is a number
numeric_data(:,i) = str2num(char(DATA(:,i)));
else
% This is a hexadecimal number
col = char(DATA(:,i));
numeric_data(:,i) = hex2dec(col(:,3:end));
end
end
% Display the data
format short g;
disp(numeric_data)
This works for data like this:
% Comment 1
% Comment 2
1.2 0xc661 10 0xa661
2 0xd661 20 0xb661
3 0xe661 30 0xc661
Output:
1.2 50785 10 42593
2 54881 20 46689
3 58977 30 50785
OLD:
Yeah, I don't think LOAD is the way to go. You could try:
a = char(importdata('testHexa.txt'));
a = hex2dec(a(:,3:end));
This is based on both gnovice's and Jacob's answers, and is a "best of breed"
For files like:
% this is my comment
% this is my other comment
1 0xc661 123
2 0xd661 456
% surprise comment
3 0xe661 789
4 0xb661 1234567
(where the number of columns within the file MUST be the same, but not known ahead of time, and all comments denoted by a '%' character), the following code is fast and easy to read:
f = fopen('hexdata.txt', 'rt');
A = textscan(f, '%s', 'Delimiter', '\n', 'MultipleDelimsAsOne', '1', 'CollectOutput', '1', 'CommentStyle', '%');
fclose(f);
A = A{1};
data = sscanf(A{1}, '%i')';
data = repmat(data, length(A), 1);
for ctr = 2:length(A)
data(ctr,:) = sscanf(A{ctr}, '%i')';
end