Effective way to convert/create matrix from mixed cell/string - matlab

Sometimes there might be more that one string located somewhere else, so I need a way to find everyone in the cell array. I have a cell array like the one below and I need a fast and effective way to 1) remove the empty columns, 2) convert the cells containing a string with "#" to the number after the "#" (6.504), and finally 3) create or convert the whole cell array to a data matrix like "data" below. Is there a smart way to do all this? Any suggestions are highly appreciated.
array ={
[47.4500] '' [23.9530] '' [12.4590]
[34.1540] '' [15.1730] '' [ 9.6840]
[45.2510] '' [23.3770] '' [13.0670]
[29.9350] '' [14.8680] '' '# 6.504'}
data =[
47.4500 23.9530 12.4590
34.1540 15.1730 9.6840
45.2510 23.3770 13.0670
29.9350 14.8680 6.5040]

Columns with mixed types are tricky to handle, but if the format always follows the regex pattern # \d+(?:\.\d+) you can proceed as follows:
C = {
47.4500 '' 23.9530 '' 12.4590
34.1540 '' 15.1730 '' 9.6840
45.2510 '' 23.3770 '' 13.0670
29.9350 '' 14.8680 '' '# 6.504'
};
% Get rid of empty columns...
C(:,all(cellfun(#ischar,C))) = [];
% Convert numeric strings into numeric values...
C = cellfun(#(x)convert(x),C,'UniformOutput',false);
% Convert the cell matrix into a numeric matrix...
C = cell2mat(C);
Where the convert function is defined as follows:
function x = convert(x)
if (~ischar(x))
return;
end
x = str2double(strrep(x,'# ',''));
end

Related

Replace every string containing '# '

I have a cell array containing strings like dataT1 below. How do I replace all strings containing a '#' with the letter 'O' and nothing else (no numbers in the string)?
dataT1 = {
[275.7770] [169.6630] [89.5380] [48.2740] [24.2400] [12.7510]
[284.3560] [160.4500] [87.3740] [47.4500] [23.9530] [12.4590]
'# 12.304' [129.7730] [66.2630] [34.1540] [15.1730] [ 9.6840]
[267.5270] [152.3700] '# 17.504' [45.2510] [23.3770] [13.0670]
[206.9110] [115.3030] [56.4770] [29.9350] [14.8680] '# 6.504' }
You can use a simple loop.
If you want to replace the # with O then use this:
for c = 1:numel(data)
% check for character array type in case cell also has numeric values
if ischar(data{c})
% replace hashes with 'O'
data{c} = strrep(data{c}, '#', 'O');
end
end
If you want to replace entire strings containing # with just the string 'O' then use this:
for c = 1:numel(data)
% check for character array type in case cell also has numeric values
if ischar(data{c})
% Search for # within string
if strfind( data{c}, '#' ) > 0
% replace string with 'O'
data{c} = 'O';
end
end
end

Matlab - Read An Unknown CSV [duplicate]

I'm working with MATLAB for few days and I'm having difficulties to import a CSV-file to a matrix.
My problem is that my CSV-file contains almost only Strings and some integer values, so that csvread() doesn't work. csvread() only gets along with integer values.
How can I store my strings in some kind of a 2-dimensional array to have free access to each element?
Here's a sample CSV for my needs:
04;abc;def;ghj;klm;;;;;
;;;;;Test;text;0xFF;;
;;;;;asdfhsdf;dsafdsag;0x0F0F;;
The main thing are the empty cells and the texts within the cells.
As you see, the structure may vary.
For the case when you know how many columns of data there will be in your CSV file, one simple call to textscan like Amro suggests will be your best solution.
However, if you don't know a priori how many columns are in your file, you can use a more general approach like I did in the following function. I first used the function fgetl to read each line of the file into a cell array. Then I used the function textscan to parse each line into separate strings using a predefined field delimiter and treating the integer fields as strings for now (they can be converted to numeric values later). Here is the resulting code, placed in a function read_mixed_csv:
function lineArray = read_mixed_csv(fileName, delimiter)
fid = fopen(fileName, 'r'); % Open the file
lineArray = cell(100, 1); % Preallocate a cell array (ideally slightly
% larger than is needed)
lineIndex = 1; % Index of cell to place the next line in
nextLine = fgetl(fid); % Read the first line from the file
while ~isequal(nextLine, -1) % Loop while not at the end of the file
lineArray{lineIndex} = nextLine; % Add the line to the cell array
lineIndex = lineIndex+1; % Increment the line index
nextLine = fgetl(fid); % Read the next line from the file
end
fclose(fid); % Close the file
lineArray = lineArray(1:lineIndex-1); % Remove empty cells, if needed
for iLine = 1:lineIndex-1 % Loop over lines
lineData = textscan(lineArray{iLine}, '%s', ... % Read strings
'Delimiter', delimiter);
lineData = lineData{1}; % Remove cell encapsulation
if strcmp(lineArray{iLine}(end), delimiter) % Account for when the line
lineData{end+1} = ''; % ends with a delimiter
end
lineArray(iLine, 1:numel(lineData)) = lineData; % Overwrite line data
end
end
Running this function on the sample file content from the question gives this result:
>> data = read_mixed_csv('myfile.csv', ';')
data =
Columns 1 through 7
'04' 'abc' 'def' 'ghj' 'klm' '' ''
'' '' '' '' '' 'Test' 'text'
'' '' '' '' '' 'asdfhsdf' 'dsafdsag'
Columns 8 through 10
'' '' ''
'0xFF' '' ''
'0x0F0F' '' ''
The result is a 3-by-10 cell array with one field per cell where missing fields are represented by the empty string ''. Now you can access each cell or a combination of cells to format them as you like. For example, if you wanted to change the fields in the first column from strings to integer values, you could use the function str2double as follows:
>> data(:, 1) = cellfun(#(s) {str2double(s)}, data(:, 1))
data =
Columns 1 through 7
[ 4] 'abc' 'def' 'ghj' 'klm' '' ''
[NaN] '' '' '' '' 'Test' 'text'
[NaN] '' '' '' '' 'asdfhsdf' 'dsafdsag'
Columns 8 through 10
'' '' ''
'0xFF' '' ''
'0x0F0F' '' ''
Note that the empty fields results in NaN values.
Given the sample you posted, this simple code should do the job:
fid = fopen('file.csv','r');
C = textscan(fid, repmat('%s',1,10), 'delimiter',';', 'CollectOutput',true);
C = C{1};
fclose(fid);
Then you could format the columns according to their type. For example if the first column is all integers, we can format it as such:
C(:,1) = num2cell( str2double(C(:,1)) )
Similarly, if you wish to convert the 8th column from hex to decimals, you can use HEX2DEC:
C(:,8) = cellfun(#hex2dec, strrep(C(:,8),'0x',''), 'UniformOutput',false);
The resulting cell array looks as follows:
C =
[ 4] 'abc' 'def' 'ghj' 'klm' '' '' [] '' ''
[NaN] '' '' '' '' 'Test' 'text' [ 255] '' ''
[NaN] '' '' '' '' 'asdfhsdf' 'dsafdsag' [3855] '' ''
In R2013b or later you can use a table:
>> table = readtable('myfile.txt','Delimiter',';','ReadVariableNames',false)
>> table =
Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10
____ _____ _____ _____ _____ __________ __________ ________ ____ _____
4 'abc' 'def' 'ghj' 'klm' '' '' '' NaN NaN
NaN '' '' '' '' 'Test' 'text' '0xFF' NaN NaN
NaN '' '' '' '' 'asdfhsdf' 'dsafdsag' '0x0F0F' NaN NaN
Here is more info.
Use xlsread, it works just as well on .csv files as it does on .xls files. Specify that you want three outputs:
[num char raw] = xlsread('your_filename.csv')
and it will give you an array containing only the numeric data (num), an array containing only the character data (char) and an array that contains all data types in the same format as the .csv layout (raw).
Have you tried to use the "CSVIMPORT" function found in the file exchange? I haven't tried it myself, but it claims to handle all combinations of text and numbers.
http://www.mathworks.com/matlabcentral/fileexchange/23573-csvimport
Depending on the format of your file, importdata might work.
You can store Strings in a cell array. Type "doc cell" for more information.
I recommend looking at the dataset array.
The dataset array is a data type that ships with Statistics Toolbox.
It is specifically designed to store hetrogeneous data in a single container.
The Statistics Toolbox demo page contains a couple vidoes that show some of the dataset array features. The first is titled "An Introduction to Dataset Arrays". The second is titled "An Introduction to Joins".
http://www.mathworks.com/products/statistics/demos.html
If your input file has a fixed amount of columns separated by commas and you know in which columns are the strings it might be best to use the function
textscan()
Note that you can specify a format where you read up to a maximum number of characters in the string or until a delimiter (comma) is found.
% Assuming that the dataset is ";"-delimited and each line ends with ";"
fid = fopen('sampledata.csv');
tline = fgetl(fid);
u=sprintf('%c',tline); c=length(u);
id=findstr(u,';'); n=length(id);
data=cell(1,n);
for I=1:n
if I==1
data{1,I}=u(1:id(I)-1);
else
data{1,I}=u(id(I-1)+1:id(I)-1);
end
end
ct=1;
while ischar(tline)
ct=ct+1;
tline = fgetl(fid);
u=sprintf('%c',tline);
id=findstr(u,';');
if~isempty(id)
for I=1:n
if I==1
data{ct,I}=u(1:id(I)-1);
else
data{ct,I}=u(id(I-1)+1:id(I)-1);
end
end
end
end
fclose(fid);

How to convert an string array into a character array in Matlab?

Suppose that we have an string array in Matlab like bellow:
a='This is a book'
How can we convert the above string array into a character array by a function in Matlab like bellow?
b={'T' 'h' 'i' 's' ' ' 'i' 's' ' ' 'a' ' ' 'b' 'o' 'o' 'k'}
Your a is not a string array; it's a character array (which also used to be called a string, but starting from R2016b that term has a different meaning). Your b is not a character array, it's a cell array that contains characters.
Anyway, to convert from a to b, use num2cell:
a = 'This is a book';
b = num2cell(a);
If you really really want to convert string (introduced since R2016b) to char array, this is how you do.
s = "My String"; % Create a string with ""
c = char(s); % This is how you convert string to char.
isstring(c)
ans =
logical
0
ischar(c)
ans =
logical
1

Updating N-gram 2 dimension cell array in Matlab

I am trying to extract bi-grams from a set of words and store them in a matrix. what I want is to insert the word in the first raw and all the bi-grams related to that word
for example: if I have the following string 'database file there' my output should be:
database file there
da fi th
at il he
ta le er
ab re
..
I have tried this but it gives me only the bigram without the original word
collection = fileread('e:\m.txt');
collection = regexprep(collection,'<.*?>','');
collection = lower(collection);
collection = regexprep(collection,'\W',' ');
collection = strtrim(regexprep(collection,'\s*',' '));
temp = regexprep(collection,' ',''',''');
eval(['words = {''',temp,'''};']);
word = char(words(1));
word2 = regexp(word, sprintf('\\w{1,%d}', 1), 'match');
bi = cellfun(#(x,y) [x '' y], word2(1:end-1)', word2(2:end)','un',0);
this is only for the first word however, i want to do that for every word in the "words" matrix 1X1000
is there an efficient way to accomplish this as I will deal with around 1 million words?
I am new to Matlab and if there any resource to explain how to deal with matrix (update elements, delete, ...) will be helpful
regards,
Ashraf
If you were looking to get a cell array as the output, this might work for you -
input_str = 'database file there' %// input
str1_split = regexp(input_str,'\s','Split'); %// split words into cells
NW = numel(str1_split); %// number of words
char_arr1 = char(str1_split'); %//' convert split cells into a char array
ind1 = bsxfun(#plus,[1:NW*2]',[0:size(char_arr1,2)-2]*NW); %//' get indices
%// to be used for indexing into char array
t1 = reshape(char_arr1(ind1),NW,2,[]);
t2 = reshape(permute(t1,[2 1 3]),2,[])'; %//' char array with rows for each pair
out = reshape(mat2cell(t2,ones(1,size(t2,1)),2),NW,[])'; %//'
out(reshape(any(t2==' ',2),NW,[])')={''}; %//' Use only paired-elements cells
out = [str1_split ; out] %// output
Code Output -
input_str =
database file there
out =
'database' 'file' 'there'
'da' 'fi' 'th'
'at' 'il' 'he'
'ta' 'le' 'er'
'ab' '' 're'
'ba' '' ''
'as' '' ''
'se' '' ''

Working string in MATLAB

I have the following string in MATLAB, for example
##%%F1_USA(40)_u
and I want
F1_USA_40__u
Does it has any function for this?
Your best bet is probably regexprep which allows you to replace parts of a string using regular expressions:
s_new = regexprep(regexprep(s, '[()]', '_'), '[^A-Za-z0-9_]', '')
Update: based on your updated comment, this is probably what you want:
s_new = regexprep(regexprep(s, '^[^A-Za-z0-9_]*', ''), '[^A-Za-z0-9_]', '')
or:
s_new = regexprep(regexprep(s, '[^A-Za-z0-9_]', '_'), '^_*', '')
One way to do this is to use the function ISSTRPROP to find the indices of alphanumeric characters and replace or remove the others accordingly:
>> str = '##%%F1_USA(40)_u'; %# Sample string
>> index = isstrprop(str,'alphanum'); %# Find indices of alphanumeric characters
>> str(~index) = '_'; %# Set non-alphanumeric characters to '_'
>> str = str(find(index,1):end) %# Remove any leading '_'
str =
F1_USA_40__u %# Result
If you want to use regular expressions (which can get a little more complicated) then the last suggestion from Tamas will work. However, it can be greatly simplified to the following:
str = regexprep(str,{'\W','^_*'},{'_',''});