Updating N-gram 2 dimension cell array in Matlab - matlab

I am trying to extract bi-grams from a set of words and store them in a matrix. what I want is to insert the word in the first raw and all the bi-grams related to that word
for example: if I have the following string 'database file there' my output should be:
database file there
da fi th
at il he
ta le er
ab re
..
I have tried this but it gives me only the bigram without the original word
collection = fileread('e:\m.txt');
collection = regexprep(collection,'<.*?>','');
collection = lower(collection);
collection = regexprep(collection,'\W',' ');
collection = strtrim(regexprep(collection,'\s*',' '));
temp = regexprep(collection,' ',''',''');
eval(['words = {''',temp,'''};']);
word = char(words(1));
word2 = regexp(word, sprintf('\\w{1,%d}', 1), 'match');
bi = cellfun(#(x,y) [x '' y], word2(1:end-1)', word2(2:end)','un',0);
this is only for the first word however, i want to do that for every word in the "words" matrix 1X1000
is there an efficient way to accomplish this as I will deal with around 1 million words?
I am new to Matlab and if there any resource to explain how to deal with matrix (update elements, delete, ...) will be helpful
regards,
Ashraf

If you were looking to get a cell array as the output, this might work for you -
input_str = 'database file there' %// input
str1_split = regexp(input_str,'\s','Split'); %// split words into cells
NW = numel(str1_split); %// number of words
char_arr1 = char(str1_split'); %//' convert split cells into a char array
ind1 = bsxfun(#plus,[1:NW*2]',[0:size(char_arr1,2)-2]*NW); %//' get indices
%// to be used for indexing into char array
t1 = reshape(char_arr1(ind1),NW,2,[]);
t2 = reshape(permute(t1,[2 1 3]),2,[])'; %//' char array with rows for each pair
out = reshape(mat2cell(t2,ones(1,size(t2,1)),2),NW,[])'; %//'
out(reshape(any(t2==' ',2),NW,[])')={''}; %//' Use only paired-elements cells
out = [str1_split ; out] %// output
Code Output -
input_str =
database file there
out =
'database' 'file' 'there'
'da' 'fi' 'th'
'at' 'il' 'he'
'ta' 'le' 'er'
'ab' '' 're'
'ba' '' ''
'as' '' ''
'se' '' ''

Related

MATLAB string cell array without loop

I am using a loop to create my cell array. It contains the string 'A1' to 'A10'.
Is there a way to iterate without using a loop ?
a = cell( 10, 1 );
for i = 1 : length( a )
a{i} = [ 'A', num2str( i ) ];
end
a =
'A1'
'A2'
'A3'
'A4'
'A5'
'A6'
'A7'
'A8'
'A9'
'A10'
I assume you want to build a without a loop. Let N = 10 as per your example.
Approach 1
a = sprintf('A%i ', 1:N);
a = a(1:end-1);
a = strsplit(a).';
This builds a char vector with a space after each number, removes the final space, splits on spaces, and transposes.
Approach 2
Another approach:
a = deblank(cellstr(strcat('A', strjust(num2str((1:10).'), 'left'))));
This concatenates 'A' with the numbers to form a 2D char array with some spaces; moves the spaces in each row to the right; converts each row into a cell; and removes trailing spaces on each cell.
If you have R2017a or later consider using string arrays instead of cell array of char vectors. You can create your string array using
"A"+(1:10)'

Effective way to convert/create matrix from mixed cell/string

Sometimes there might be more that one string located somewhere else, so I need a way to find everyone in the cell array. I have a cell array like the one below and I need a fast and effective way to 1) remove the empty columns, 2) convert the cells containing a string with "#" to the number after the "#" (6.504), and finally 3) create or convert the whole cell array to a data matrix like "data" below. Is there a smart way to do all this? Any suggestions are highly appreciated.
array ={
[47.4500] '' [23.9530] '' [12.4590]
[34.1540] '' [15.1730] '' [ 9.6840]
[45.2510] '' [23.3770] '' [13.0670]
[29.9350] '' [14.8680] '' '# 6.504'}
data =[
47.4500 23.9530 12.4590
34.1540 15.1730 9.6840
45.2510 23.3770 13.0670
29.9350 14.8680 6.5040]
Columns with mixed types are tricky to handle, but if the format always follows the regex pattern # \d+(?:\.\d+) you can proceed as follows:
C = {
47.4500 '' 23.9530 '' 12.4590
34.1540 '' 15.1730 '' 9.6840
45.2510 '' 23.3770 '' 13.0670
29.9350 '' 14.8680 '' '# 6.504'
};
% Get rid of empty columns...
C(:,all(cellfun(#ischar,C))) = [];
% Convert numeric strings into numeric values...
C = cellfun(#(x)convert(x),C,'UniformOutput',false);
% Convert the cell matrix into a numeric matrix...
C = cell2mat(C);
Where the convert function is defined as follows:
function x = convert(x)
if (~ischar(x))
return;
end
x = str2double(strrep(x,'# ',''));
end

Matlab - Read An Unknown CSV [duplicate]

I'm working with MATLAB for few days and I'm having difficulties to import a CSV-file to a matrix.
My problem is that my CSV-file contains almost only Strings and some integer values, so that csvread() doesn't work. csvread() only gets along with integer values.
How can I store my strings in some kind of a 2-dimensional array to have free access to each element?
Here's a sample CSV for my needs:
04;abc;def;ghj;klm;;;;;
;;;;;Test;text;0xFF;;
;;;;;asdfhsdf;dsafdsag;0x0F0F;;
The main thing are the empty cells and the texts within the cells.
As you see, the structure may vary.
For the case when you know how many columns of data there will be in your CSV file, one simple call to textscan like Amro suggests will be your best solution.
However, if you don't know a priori how many columns are in your file, you can use a more general approach like I did in the following function. I first used the function fgetl to read each line of the file into a cell array. Then I used the function textscan to parse each line into separate strings using a predefined field delimiter and treating the integer fields as strings for now (they can be converted to numeric values later). Here is the resulting code, placed in a function read_mixed_csv:
function lineArray = read_mixed_csv(fileName, delimiter)
fid = fopen(fileName, 'r'); % Open the file
lineArray = cell(100, 1); % Preallocate a cell array (ideally slightly
% larger than is needed)
lineIndex = 1; % Index of cell to place the next line in
nextLine = fgetl(fid); % Read the first line from the file
while ~isequal(nextLine, -1) % Loop while not at the end of the file
lineArray{lineIndex} = nextLine; % Add the line to the cell array
lineIndex = lineIndex+1; % Increment the line index
nextLine = fgetl(fid); % Read the next line from the file
end
fclose(fid); % Close the file
lineArray = lineArray(1:lineIndex-1); % Remove empty cells, if needed
for iLine = 1:lineIndex-1 % Loop over lines
lineData = textscan(lineArray{iLine}, '%s', ... % Read strings
'Delimiter', delimiter);
lineData = lineData{1}; % Remove cell encapsulation
if strcmp(lineArray{iLine}(end), delimiter) % Account for when the line
lineData{end+1} = ''; % ends with a delimiter
end
lineArray(iLine, 1:numel(lineData)) = lineData; % Overwrite line data
end
end
Running this function on the sample file content from the question gives this result:
>> data = read_mixed_csv('myfile.csv', ';')
data =
Columns 1 through 7
'04' 'abc' 'def' 'ghj' 'klm' '' ''
'' '' '' '' '' 'Test' 'text'
'' '' '' '' '' 'asdfhsdf' 'dsafdsag'
Columns 8 through 10
'' '' ''
'0xFF' '' ''
'0x0F0F' '' ''
The result is a 3-by-10 cell array with one field per cell where missing fields are represented by the empty string ''. Now you can access each cell or a combination of cells to format them as you like. For example, if you wanted to change the fields in the first column from strings to integer values, you could use the function str2double as follows:
>> data(:, 1) = cellfun(#(s) {str2double(s)}, data(:, 1))
data =
Columns 1 through 7
[ 4] 'abc' 'def' 'ghj' 'klm' '' ''
[NaN] '' '' '' '' 'Test' 'text'
[NaN] '' '' '' '' 'asdfhsdf' 'dsafdsag'
Columns 8 through 10
'' '' ''
'0xFF' '' ''
'0x0F0F' '' ''
Note that the empty fields results in NaN values.
Given the sample you posted, this simple code should do the job:
fid = fopen('file.csv','r');
C = textscan(fid, repmat('%s',1,10), 'delimiter',';', 'CollectOutput',true);
C = C{1};
fclose(fid);
Then you could format the columns according to their type. For example if the first column is all integers, we can format it as such:
C(:,1) = num2cell( str2double(C(:,1)) )
Similarly, if you wish to convert the 8th column from hex to decimals, you can use HEX2DEC:
C(:,8) = cellfun(#hex2dec, strrep(C(:,8),'0x',''), 'UniformOutput',false);
The resulting cell array looks as follows:
C =
[ 4] 'abc' 'def' 'ghj' 'klm' '' '' [] '' ''
[NaN] '' '' '' '' 'Test' 'text' [ 255] '' ''
[NaN] '' '' '' '' 'asdfhsdf' 'dsafdsag' [3855] '' ''
In R2013b or later you can use a table:
>> table = readtable('myfile.txt','Delimiter',';','ReadVariableNames',false)
>> table =
Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10
____ _____ _____ _____ _____ __________ __________ ________ ____ _____
4 'abc' 'def' 'ghj' 'klm' '' '' '' NaN NaN
NaN '' '' '' '' 'Test' 'text' '0xFF' NaN NaN
NaN '' '' '' '' 'asdfhsdf' 'dsafdsag' '0x0F0F' NaN NaN
Here is more info.
Use xlsread, it works just as well on .csv files as it does on .xls files. Specify that you want three outputs:
[num char raw] = xlsread('your_filename.csv')
and it will give you an array containing only the numeric data (num), an array containing only the character data (char) and an array that contains all data types in the same format as the .csv layout (raw).
Have you tried to use the "CSVIMPORT" function found in the file exchange? I haven't tried it myself, but it claims to handle all combinations of text and numbers.
http://www.mathworks.com/matlabcentral/fileexchange/23573-csvimport
Depending on the format of your file, importdata might work.
You can store Strings in a cell array. Type "doc cell" for more information.
I recommend looking at the dataset array.
The dataset array is a data type that ships with Statistics Toolbox.
It is specifically designed to store hetrogeneous data in a single container.
The Statistics Toolbox demo page contains a couple vidoes that show some of the dataset array features. The first is titled "An Introduction to Dataset Arrays". The second is titled "An Introduction to Joins".
http://www.mathworks.com/products/statistics/demos.html
If your input file has a fixed amount of columns separated by commas and you know in which columns are the strings it might be best to use the function
textscan()
Note that you can specify a format where you read up to a maximum number of characters in the string or until a delimiter (comma) is found.
% Assuming that the dataset is ";"-delimited and each line ends with ";"
fid = fopen('sampledata.csv');
tline = fgetl(fid);
u=sprintf('%c',tline); c=length(u);
id=findstr(u,';'); n=length(id);
data=cell(1,n);
for I=1:n
if I==1
data{1,I}=u(1:id(I)-1);
else
data{1,I}=u(id(I-1)+1:id(I)-1);
end
end
ct=1;
while ischar(tline)
ct=ct+1;
tline = fgetl(fid);
u=sprintf('%c',tline);
id=findstr(u,';');
if~isempty(id)
for I=1:n
if I==1
data{ct,I}=u(1:id(I)-1);
else
data{ct,I}=u(id(I-1)+1:id(I)-1);
end
end
end
end
fclose(fid);

Extract only words from a cell array in matlab

I have a set of documents containing pre-processed texts from html pages. They are already given to me. I want to extract only the words from it. I do not want any numbers or common words or any single letters to be extracted. The first problem I am facing is this.
Suppose I have a cell array :
{'!' '!!' '!!!!)' '!!!!thanks' '!!dogsbreath' '!)' '!--[endif]--' '!--[if'}
I want to make the cell array having only the words - like this.
{'!!!!thanks' '!!dogsbreath' '!--[endif]--' '!--[if'}
And then convert this to this cell array
{'thanks' 'dogsbreath' 'endif' 'if'}
Is there any way to do this ?
Updated Requirement : Thanks to all of your answers. However I am facing a problem ! Let me illustrate this (Please note that the cell values are extracted text from HTML documents and hence may contain non ASCII values) -
{'!/bin/bash' '![endif]' '!take-a-long' '!–photo'}
This gives me the answer
{'bin' 'bash' 'endif' 'take' 'a' 'long' 'â' 'photo' }
My Questions:
Why is bin/bash and take-a-long being separated into three cells ? Its not a problem for me but still why? Can this be avoided. I mean all words coming from a single cell being combined into one.
Notice that in '!–photo' there exists an non-ascii character â which esentially means a. Can a step be incorporated such that this transformation is automatic?
I noticed that the text "it? __________ About the Author:" gives me "__________" as a word. Why is this so ?
Also the text "2. areoplane 3. cactus 4. a_rinny_boo... 5. trumpet 6. window 7. curtain ... 173. gypsy_wagon..." returns a word as 'areoplane' 'cactus' 'a_rinny_boo' 'trumpet' 'window' 'curtain' 'gypsy_wagon'. I want the words 'a_rinny_boo' and ''gypsy_wagon to be 'a' 'rinny' 'boo' 'gypsy' 'wagon'. Can this be done ?
Update 1 Following all the suggestions I have written down a function which does most of the things except the above two newly asked questions.
function [Text_Data] = raw_txt_gn(filename)
% This function will convert the text documnets into raw text
% It will remove all commas empty cells and other special characters
% It will also convert all the words of the text documents into lowercase
T = textread(filename, '%s');
% find all the important indices
ind1=find(ismember(T,':WebpageTitle:'));
T1 = T(ind1+1:end,1);
% Remove things which are not basically words
not_words = {'##','-',':ImageSurroundingText:',':WebpageDescription:',':WebpageKeywords:',' '};
T2 = []; count = 1;
for j=1:length(T1)
x = T1{j};
ind=find(ismember(not_words,x), 1);
if isempty(ind)
B = regexp(x, '\w*', 'match');
B(cellfun('isempty', B)) = []; % Clean out empty cells
B = [B{:}]; % Flatten cell array
% convert the string into lowecase
% so that while generating the features the case sensitivity is
% handled well
x = lower(B);
T2{count,1} = x;
count = count+1;
end
end
T2 = T2(~cellfun('isempty',T2));
% Getting the common words in the english language
% found from Wikipedia
not_words2 = {'the','be','to','of','and','a','in','that','have','i'};
not_words2 = [not_words2, 'it' 'for' 'not' 'on' 'with' 'he' 'as' 'you' 'do' 'at'];
not_words2 = [not_words2, 'this' 'but' 'his' 'by' 'from' 'they' 'we' 'say' 'her' 'she'];
not_words2 = [not_words2, 'or' 'an' 'will' 'my' 'one' 'all' 'would' 'there' 'their' 'what'];
not_words2 = [not_words2, 'so' 'up' 'out' 'if' 'about' 'who' 'get' 'which' 'go' 'me'];
not_words2 = [not_words2, 'when' 'make' 'can' 'like' 'time' 'no' 'just' 'him' 'know' 'take'];
not_words2 = [not_words2, 'people' 'into' 'year' 'your' 'good' 'some' 'could' 'them' 'see' 'other'];
not_words2 = [not_words2, 'than' 'then' 'now' 'look' 'only' 'come' 'its' 'over' 'think' 'also'];
not_words2 = [not_words2, 'back' 'after' 'use' 'two' 'how' 'our' 'work' 'first' 'well' 'way'];
not_words2 = [not_words2, 'even' 'new' 'want' 'because' 'any' 'these' 'give' 'day' 'most' 'us'];
for j=1:length(T2)
x = T2{j};
% if a particular cell contains only numbers then make it empty
if sum(isstrprop(x, 'digit'))~=0
T2{j} = [];
end
% also remove single character cells
if length(x)==1
T2{j} = [];
end
% also remove the most common words from the dictionary
% the common words are taken from the english dicitonary (source
% wikipedia)
ind=find(ismember(not_words2,x), 1);
if isempty(ind)==0
T2{j} = [];
end
end
Text_Data = T2(~cellfun('isempty',T2));
Update 2
I found this code in here which tells me how to check for non-ascii characters. Incorporating this code snippet in Matlab as
% remove the non-ascii characters
if all(x < 128)
else
T2{j} = [];
end
and then removing the empty cells it seems my second requirement is fulfilled though the text containing a part of non-ascii characters completely disappears.
Can my final requirements be completed ? Most of them concerns the character '_' and '-'.
A regexp approach to go directly to the final step:
A = {'!' '!!' '!!!!)' '!!!!thanks' '!!dogsbreath' '!)' '!--[endif]--' '!--[if'};
B = regexp(A, '\w*', 'match');
B(cellfun('isempty', B)) = []; % Clean out empty cells
B = [B{:}]; % Flatten cell array
Which matches any alphabetic, numeric, or underscore character. For the sample case we get a 1x4 cell array:
B =
'thanks' 'dogsbreath' 'endif' 'if'
Edit:
Why is bin/bash and take-a-long being separated into three cells ? Its not a problem for me but still why? Can this be avoided. I mean all words coming from a single cell being combined into one.
Because I'm flattening the cell arrays to remove nested cells. If you remove B = [B{:}]; each cell will have a nested cell inside containing all of the matches for the input cell array. You can combine these however you want after.
Notice that in '!–photo' there exists an non-ascii character â which esentially means a. Can a step be incorporated such that this transformation is automatic?
Yes, you'll have to make it based on the character codes.
I noticed that the text "it? __________ About the Author:" gives me "__________" as a word. Why is this so ?
As I said, the regex matches alphabetic, numeric, or underscore characters. You can change your filter to exclude _, which will also address the fourth bullet point: B = regexp(A, '[a-zA-Z0-9]*', 'match'); This will match a-z, A-Z, and 0-9 only. This will also exclude the non-ASCII characters, which it seems like the \w* flag matches.
I think #excaza's solution would be the go-to approach, but here's an alternative one with isstrprop using its optional input argument 'alpha' to look for alphabets -
A(cellfun(#(x) any(isstrprop(x, 'alpha')), A))
Sample run -
>> A
A =
'!' '!!' '!!!!)' '!!!!thanks' '!!dogsbreath' '!)' '!--[endif]--' '!--[if'
>> A(cellfun(#(x) any(isstrprop(x, 'alpha')), A))
ans =
'!!!!thanks' '!!dogsbreath' '!--[endif]--' '!--[if'
To get to the final destination, you can tweak this approach a bit, like so -
B = cellfun(#(x) x(isstrprop(x, 'alpha')), A,'Uni',0);
out = B(~cellfun('isempty',B))
Sample run -
A =
'!' '!!' '!!!!)' '!!!!thanks' '!!dogsbreath' '!)' '!--[endif]--' '!--[if'
out =
'thanks' 'dogsbreath' 'endif' 'if'

Reading text from multiple text files at the same time and splitting them into array of words

I want help in reading all the text files at the same time and also splitting the text to be stored into an array. I have tried for this but was not able to do so. The main problem occurring is that even while using for loop for reading text file the strsplit splits only one text file. How can i split all of them at once into different arrays, means one array for one text file.
Below is the code so far-
for i = 1:10
file = [num2str(i) '.eng'];
% load string from a file
STR = importdata(file);
% extract string between tags
B = regexprep(STR, '<.*?>','');
% split each string by delimiters and add to C
C = [];
for j=1:length(B)
if ~isempty(B{j})
C = [C strsplit(B{j}, {'/', ' '})];
end
end
Below is sample of text file---
<DOC>
<DOCNO>annotations/01/1515.eng</DOCNO>
<TITLE>Yacare Ibera</TITLE>
<DESCRIPTION>an alligator in the water;</DESCRIPTION>
<NOTES></NOTES>
<LOCATION>Corrientes, Argentina</LOCATION>
<DATE>August 2002</DATE>
<IMAGE>images/01/1515.jpg</IMAGE>
<THUMBNAIL>thumbnails/01/1515.jpg</THUMBNAIL>
</DOC>
Suppose you are looking for the word "alligator". Then you could do the following
clc
word = 'alligator';
num_of_files = 10;
C = cell(num_of_files, 1);
for i = 1:10
file = [num2str(i) '.eng'];
%// load string from a file
STR = importdata(file);
%// extract string between tags
%// assuming you want to remove the angle brackets
B = regexprep(STR, '<.*?>','');
B(strcmp(B, '')) = [];
%// split each string by delimiters and add to C
tmp = regexp(B, '/| ', 'split');
C{i} = [tmp{:}];
end
where = [];
for j = 1:length(C)
if find(strcmp(C{j}, word))
where = [where num2str(j) '.eng, '];
end
end
if length(where) == 0
disp(['No file contains the word ' word '.'])
else
where(end-1:end) = [];
disp(['The word ' word ' is contained in: ' where])
end
Because I used 10 copies of your file, the word "alligator" is in each one of them so I get
The word alligator is contained in: 1.eng, 2.eng, 3.eng, 4.eng, 5.eng,
6.eng, 7.eng, 8.eng, 9.eng, 10.eng
Whereas, if I set word = 'cohomology', the output is
No file contains the word cohomology.