Match a numeric string from a plain text fileusing regexpi - matlab

I have. Txt file and I am looking for specific string
Example
Row1
Row2
Pressure 95.8
There could be more white-space characters between string and numbers and in every cycle there is different value: 1. Cycle 95.8>>next cycle 1009.6543>>..., but its always in this format
I tried this
Fid = fileread ('search.txt')
Value = regexpi (fid, '(? <=Pressure\s*)\d*', 'match')
It saved just '95' instead of '95.8'

To be as complete as possible, some quite generic pattern to match a double (excluding for NaN and +/-Inf cases) is the following:
[+\-]?(?:(?:\d+\.\d*)|(?:\.\d+)|(?:\d+))(?:[eE][+\-]?\d+)?
Quickly reviewing the pattern it says:
Optionally the sign + or -
Followed by (either and in order):
Some digits (at least one), a dot, some digits (eventually zero)
A dot, some digits (at least one)
Some digits (at least one)
Optionally followed by:
The letter e or E
Optionally the sign + or -
Some digits (at least one)
The pattern to match many whitespaces (including none) is the following:
\s*
So to match the word Pressure followed by many spaces and then a double in the string str, you can use the following code:
function [pressures] = ParsePressures(str)
%[
% Some default string
if (nargin < 1)
str = cell(0,1);
str{end+1} = 'Row1';
str{end+1} = 'Row2';
str{end+1} = 'Pressure 95.8';
str{end+1} = 'Row1';
str{end+1} = 'Row2';
str{end+1} = 'Pressure 1009.6543';
str = sprintf('%s\n', str{1:end});
end
% Obtain 'tokens' for all macthes in 'str'
% NB1: 'tokens' is a cell vector whose length is the number of match
% NB2: For each match 'tokens{i}' is a cell vector containing captured groups
tokens = regexpi(str, sprintf('%s', 'Pressure', mspaces, cdouble), 'tokens');
% Transform maches in a array of double
pressures = arrayfun(#(x)str2double(x{1}), tokens);
%]
end
%% --- Regular expression helpers
function [s] = mspaces()
%[
s = '\s*'; %% This means 'as many spaces as you want'
%]
end
function [s] = cdouble()
%[
s = '([+\-]?(?:(?:\d+\.\d*)|(?:\.\d+)|(?:\d*))(?:[eE][+\-]?\d+)?)'; %% This means 'capture a double'
%]
end
It will return all pressure values in str as an array of double:
>> ParsePressures
ans =
95.8 1009.6543
NB: In the code i'm using mspaces and cdouble helper functions to build the complete pattern (i.e. sprintf('%s', 'Pressure', mspaces, cdouble), that is the word Pressure, followed by many spaces and followed by a double that I want to capture).

Related

How to calculate the number of appearance of each letter(A-Z ,a-z as well as '.' , ',' and ' ' ) in a text file in matlab?

How can I go about doing this? So far I've opened the file like this
fileID = fopen('hamlet.txt'.'r');
[A,count] = fscanf(fileID, '%s');
fclose(fileID);
Getting spaces from the file
First, if you want to capture spaces, you'll need to change your format specifier. %s reads only non-whitespace characters.
>> fileID = fopen('space.txt','r');
>> A = fscanf(fileID, '%s');
>> fclose(fileID);
>> A
A = Thistexthasspacesinit.
Instead, we can use %c:
>> fileID = fopen('space.txt','r');
>> A = fscanf(fileID, '%c');
>> fclose(fileID);
>> A
A = This text has spaces in it.
Mapping between characters and values (array indices)
We could create a character array that contains all of the target characters to look for:
search_chars = ['A':'Z', 'a':'z', ',', '.', ' '];
That would work, but to map the character to a position in the array you'd have to do something like:
>> char_pos = find(search_chars == 'q')
char_pos = 43
You could also use containters.Map, but that seems like overkill.
Instead, let's use the ASCII value of each character. For convenience, we'll use only values 1:126 (0 is NUL, and 127 is DEL. We should never encounter either of those.) Converting from characters to their ASCII code is easy:
>> c = 'q'
c = s
>> a = uint8(c) % MATLAB actually does this using double(). Seems wasteful to me.
a = 115
>> c2 = char(a)
c2 = s
Note that by doing this, you're counting characters that are not in your desired list like ! and *. If that's a problem, then use search_chars and figure out how you want to map from characters to indices.
Looping solution
The most intuitive way to count each character is a loop. For each character in A, find its ASCII code and increment the counter array at that index.
char_count = zeros(1, 126);
for current_char = A
c = uint8(current_char);
char_count(c) = char_count(c) + 1;
end
Now you've got an array of counts for each character with ASCII codes from 1 to 126. To find out how many instances of 's' there are, we can just use its ASCII code as an index:
>> char_count(115)
ans = 4
We can even use the character itself as an index:
>> char_count('s')
ans = 4
Vectorized solution
As you can see with that last example, MATLAB's weak typing makes characters and their ASCII codes pretty much equivalent. In fact:
>> 's' == 115
ans = 1
That means that we can use implicit broadcasting and == to create a logical 2D array where L(c,a) == 1 if character c in our string A has an ASCII code of a. Then we can get the count for each ASCII code by summing along the columns.
L = (A.' == [1:126]);
char_count = sum(L, 1);
A one-liner
Just for fun, I'll show one more way to do this: histcounts. This is meant to put values into bins, but as we said before, characters can be treated like values.
char_count = histcounts(uint8(A), 1:126);
There are dozens of other possibilities, for instance you could use the search_chars array and ismember(), but this should be a good starting point.
With [A,count] = fscanf(fileID, '%s'); you'll only count all string letters, doesn't matter which one. You can use regexp here which search for each letter you specify and will put it in a cell array. It consists of fields which contains the indices of your occuring letters. In the end you only sum the number of indices and you have the count for each letter:
fileID = fopen('hamlet.txt'.'r');
A = fscanf(fileID, '%s');
indexCellArray = regexp(A,{'A','B','C','D',... %I'm too lazy to add the other letters now^^
'a','b','c','d',...
','.' '};
letterCount = cellfun(#(x) numel(x),indexCellArray);
fclose(fileID);
Maybe you put the cell array in a struct where you can give fieldnames for the letters, otherwise you might loose track which count belongs to which number.
Maybe there's much easier solution, cause this one is kind of exhausting to put all the letters in the regexp but it works.

Regexp equivalent for floating number in matlab

I'd like to know how regexp is used for floating numbers or is there any other function to do this.
For example, the following returns {'2', '5'} rather than {'2.5'}.
nums= regexp('2.5','\d+','match')
Regular expressions are a tool for low-level text parsing and they have no concept of numeric datatypes. If you will want to parse decimals, you need to consider what characters compose a decimal number and design a regexp to explicitly match all of those characters.
Your example only returns the '2' and '5' because your pattern only matches characters that are digits (\d). To handle the decimal numbers, you need to explicitly include the . in your pattern. The following will match any number of digits followed by 0 or 1 radix points and 0 or more numbers after the radix point.
regexp('2.5', '\d+\.?\d*', 'match')
This assumes that you'll always have a leading digit (i.e. not '.5')
Alternately, you may consider using something like textscan or sscanf instead to parse your string which is going to be more robust than a custom regex since they are aware of different numeric datatypes.
C = textscan('2.5', '%f');
C = sscanf('2.5', '%f');
If your string only contains this floating point number, you can just use str2double
val = str2double('2.5');
#Suever answer has already been accepted, anyway here is some more complete one that should accept all sorts of floating points syntaxes (including NaN and +/-Inf by default):
% Regular expression for capturing a double value
function [s] = cdouble(supportPositiveInfinity, supportNegativeInfinity,
supportNotANumber)
%[
if (nargin < 3), supportNotANumber = true; end
if (nargin < 2), supportNegativeInfinity = true; end
if (nargin < 1), supportPositiveInfinity = true; end
% A double
s = '[+\-]?(?:(?:\d+\.\d*)|(?:\.\d+)|(?:\d+))(?:[eE][+\-]?\d+)?'; %% This means a numeric double
% Extra for nan or [+/-]inf
extra = '';
if (supportNotANumber), extra = ['nan|' extra]; end
if (supportNegativeInfinity), extra = ['-inf|' extra]; end
if (supportPositiveInfinity), extra = ['inf|\+inf|' extra]; end
% Adding capture
if (~isempty(extra))
s = ['((?i)(?:', extra, s, '))']; % (?i) => Locally case-insensitive for captured group
else
s = ['(', s, ')'];
end
%]
end
Basically above pattern says:
Eventually start with '+' or '-' sign
Then followed by either
One or more digits followed by a dot and eventually zero to many digits
A dot followed by one or more digits
One or more digits only
Then followed by exponant pattern, that is:
'e' or 'E' (eventually followed with '+' or '-') and one or more digits
Pattern is later completed with support of Inf, +Inf, -Inf and NaN in case insensitive way. Finally everthing is enclosed between ( and ) for capturing purpose.
Here is some online test example: https://regex101.com/r/K87z6e/1
Or you can test in matlab:
>> regexp('Hello 1.235 how .2e-7 are INF you +.7 doing ?', cdouble(), 'match')
ans =
'1.235' '.2e-7' 'INF' '+.7'

Matlab strsplit by space and quotes

I have a string of variable names as shown below:
{'"NORM TIME SEC, SEC, 9999997" "ROD FORCE, LBS, 3000118" "ROD POS, DEG, 3000216" P_ext_chamber_press P_ret_chamber_press "GEAR#1 POS INCH" 388821 Q_valve_gpm P_return 3882992 "COMMAND VOLTAGE VOLT"'}
the double quotes are for the variable names with spaces or special characters between the words" and a single word variable doesn't have any quotes around them. The variables are separated by one space. Some variable names are just numbers.
At the end, I want to create a cell with strings as follows:
{'NORM_TIME_SEC_SEC_9999997','ROD_FORCE_LBF_3000118','ROD_POS_DEG_3000216','P_ext_chamber_press','P_ret_chamber_press','GEAR#1_POS_INCH','3388821','Q_valve_gpm','P_return','3882992','COMMAND_VOLTAGE_VOLT'}
You can use regexp to first split it into groups and then replace all space with _
data = {'"abc def ghi" "jkl mno pqr" "stu vwx" yz"'};
% Get each piece within the " "
pieces = regexp(data{1}, '(?<=\"\s*)([A-Za-z0-9]+\s*)*(?=\"\s*)', 'match');
% 'abc def ghi' 'jkl mno pqr'
% Replace any space with _
names = regexprep(pieces, '\s+', '_');
% 'abc_def_ghi' 'jkl_mno_pqr'
Update
Since your last variable isn't surrounded by quotes, you could do something like the following
pieces = strtrim(regexp(data, '[^\"]+(?=\")', 'match'));
pieces = pieces{1};
pieces = pieces(~cellfun(#isempty, pieces));
% Replace spaces with _
regexprep(pieces, '\s+', '_')
I ended up forcing myself to study little bit on regular expression and the following seems to be working well for what I'm trying to do:
regexp(str,'(\"[\w\s\,\.\#]+\"|\w+)','match')
Probably not as robust as I want since I'm specifically singling out a certain set of special characters only, but so far, I haven't seen other special characters other than those in the data sets I have.
str = {'"abc def ghi" "jkl mno pqr" "stu vwx" yz"'};
Then
str_u = strrep(str,' ','_');
[str_q rest] = strtok(str_u,'"');
str_u = rest;
while ~strcmp(rest,'')
[token rest] = strtok(str_u,'"');
if ~(strcmp(token,'_')||strcmp(token,''))
if strcmp(token{1,1}(1),'_')
token{1,1} = strrep(token{1,1},'_','');
end
str_q = [str_q, token];
end
str_u = rest;
end
The resultant cell array is str_q, which will give the names of the variables
str_q = 'abc_def_ghi' 'jkl_mno_pqr' 'stu_vwx' 'yz'

Capitalize the first and last letter of three letter words in a string

I am trying to capitalize the first and last letter of only the three letter words in a string. So far, I have tried
spaces = strfind(str, ' ');
spaces = [0 spaces];
lw = diff(spaces);
lw3 = find(lw ==4);
a3 = lw-1;
b3 = spaces(a3+1);
b4 = b3 + 2 ;
str(b3) = upper(str(b3));
str(b4) = upper(str(b4);
we had to find where the 3 letter words were first so that is what the first 4 lines of code are and then the others are trying to get it so that it will find where the first and last letters are and then capitalize them?
I would use regular expressions to identity the 3-letter words and then use regexprep combined with an anonymous function to perform the case-conversion.
str = 'abcd efg hijk lmn';
% Custom function to capitalize the first and last letter of a word
f = #(x)[upper(x(1)), x(2:end-1), upper(x(end))];
% This will match 3-letter words and apply function f to them
out = regexprep(str, '\<\w{3}\>', '${f($0)}')
% abcd EfG hijk LmN
Regular expressions are definitely the way to go. I am going to suggest a slightly different route, and that is to return the indices using the tokenExtents flag for regexpi:
str = 'abcd efg hijk lmn';
% Tokenize the words and return the first and last index of each
idx = regexpi(str, '(\<w{3}\>)', 'tokenExtents');
% Convert those indices to upper case
str([idx{:}]) = upper(str([idx{:}]));
Using the matlab ipusum function from the File Exchange, I generated a 1000 paragraph random text string with mean word length 4 +/- 2.
str = lower(matlab_ipsum('WordLength', 4, 'Paragraphs', 1000));
The result was a 177,575 character string with 5,531 3-letter words. I used timeit to check the execution time of using regexprep and regexpi with tokenExtents. Using regexpi is an order of magnitude faster:
regexpi = 0.013979s
regexprep = 0.14401s

Extract only words from a cell array in matlab

I have a set of documents containing pre-processed texts from html pages. They are already given to me. I want to extract only the words from it. I do not want any numbers or common words or any single letters to be extracted. The first problem I am facing is this.
Suppose I have a cell array :
{'!' '!!' '!!!!)' '!!!!thanks' '!!dogsbreath' '!)' '!--[endif]--' '!--[if'}
I want to make the cell array having only the words - like this.
{'!!!!thanks' '!!dogsbreath' '!--[endif]--' '!--[if'}
And then convert this to this cell array
{'thanks' 'dogsbreath' 'endif' 'if'}
Is there any way to do this ?
Updated Requirement : Thanks to all of your answers. However I am facing a problem ! Let me illustrate this (Please note that the cell values are extracted text from HTML documents and hence may contain non ASCII values) -
{'!/bin/bash' '![endif]' '!take-a-long' '!–photo'}
This gives me the answer
{'bin' 'bash' 'endif' 'take' 'a' 'long' 'â' 'photo' }
My Questions:
Why is bin/bash and take-a-long being separated into three cells ? Its not a problem for me but still why? Can this be avoided. I mean all words coming from a single cell being combined into one.
Notice that in '!–photo' there exists an non-ascii character â which esentially means a. Can a step be incorporated such that this transformation is automatic?
I noticed that the text "it? __________ About the Author:" gives me "__________" as a word. Why is this so ?
Also the text "2. areoplane 3. cactus 4. a_rinny_boo... 5. trumpet 6. window 7. curtain ... 173. gypsy_wagon..." returns a word as 'areoplane' 'cactus' 'a_rinny_boo' 'trumpet' 'window' 'curtain' 'gypsy_wagon'. I want the words 'a_rinny_boo' and ''gypsy_wagon to be 'a' 'rinny' 'boo' 'gypsy' 'wagon'. Can this be done ?
Update 1 Following all the suggestions I have written down a function which does most of the things except the above two newly asked questions.
function [Text_Data] = raw_txt_gn(filename)
% This function will convert the text documnets into raw text
% It will remove all commas empty cells and other special characters
% It will also convert all the words of the text documents into lowercase
T = textread(filename, '%s');
% find all the important indices
ind1=find(ismember(T,':WebpageTitle:'));
T1 = T(ind1+1:end,1);
% Remove things which are not basically words
not_words = {'##','-',':ImageSurroundingText:',':WebpageDescription:',':WebpageKeywords:',' '};
T2 = []; count = 1;
for j=1:length(T1)
x = T1{j};
ind=find(ismember(not_words,x), 1);
if isempty(ind)
B = regexp(x, '\w*', 'match');
B(cellfun('isempty', B)) = []; % Clean out empty cells
B = [B{:}]; % Flatten cell array
% convert the string into lowecase
% so that while generating the features the case sensitivity is
% handled well
x = lower(B);
T2{count,1} = x;
count = count+1;
end
end
T2 = T2(~cellfun('isempty',T2));
% Getting the common words in the english language
% found from Wikipedia
not_words2 = {'the','be','to','of','and','a','in','that','have','i'};
not_words2 = [not_words2, 'it' 'for' 'not' 'on' 'with' 'he' 'as' 'you' 'do' 'at'];
not_words2 = [not_words2, 'this' 'but' 'his' 'by' 'from' 'they' 'we' 'say' 'her' 'she'];
not_words2 = [not_words2, 'or' 'an' 'will' 'my' 'one' 'all' 'would' 'there' 'their' 'what'];
not_words2 = [not_words2, 'so' 'up' 'out' 'if' 'about' 'who' 'get' 'which' 'go' 'me'];
not_words2 = [not_words2, 'when' 'make' 'can' 'like' 'time' 'no' 'just' 'him' 'know' 'take'];
not_words2 = [not_words2, 'people' 'into' 'year' 'your' 'good' 'some' 'could' 'them' 'see' 'other'];
not_words2 = [not_words2, 'than' 'then' 'now' 'look' 'only' 'come' 'its' 'over' 'think' 'also'];
not_words2 = [not_words2, 'back' 'after' 'use' 'two' 'how' 'our' 'work' 'first' 'well' 'way'];
not_words2 = [not_words2, 'even' 'new' 'want' 'because' 'any' 'these' 'give' 'day' 'most' 'us'];
for j=1:length(T2)
x = T2{j};
% if a particular cell contains only numbers then make it empty
if sum(isstrprop(x, 'digit'))~=0
T2{j} = [];
end
% also remove single character cells
if length(x)==1
T2{j} = [];
end
% also remove the most common words from the dictionary
% the common words are taken from the english dicitonary (source
% wikipedia)
ind=find(ismember(not_words2,x), 1);
if isempty(ind)==0
T2{j} = [];
end
end
Text_Data = T2(~cellfun('isempty',T2));
Update 2
I found this code in here which tells me how to check for non-ascii characters. Incorporating this code snippet in Matlab as
% remove the non-ascii characters
if all(x < 128)
else
T2{j} = [];
end
and then removing the empty cells it seems my second requirement is fulfilled though the text containing a part of non-ascii characters completely disappears.
Can my final requirements be completed ? Most of them concerns the character '_' and '-'.
A regexp approach to go directly to the final step:
A = {'!' '!!' '!!!!)' '!!!!thanks' '!!dogsbreath' '!)' '!--[endif]--' '!--[if'};
B = regexp(A, '\w*', 'match');
B(cellfun('isempty', B)) = []; % Clean out empty cells
B = [B{:}]; % Flatten cell array
Which matches any alphabetic, numeric, or underscore character. For the sample case we get a 1x4 cell array:
B =
'thanks' 'dogsbreath' 'endif' 'if'
Edit:
Why is bin/bash and take-a-long being separated into three cells ? Its not a problem for me but still why? Can this be avoided. I mean all words coming from a single cell being combined into one.
Because I'm flattening the cell arrays to remove nested cells. If you remove B = [B{:}]; each cell will have a nested cell inside containing all of the matches for the input cell array. You can combine these however you want after.
Notice that in '!–photo' there exists an non-ascii character â which esentially means a. Can a step be incorporated such that this transformation is automatic?
Yes, you'll have to make it based on the character codes.
I noticed that the text "it? __________ About the Author:" gives me "__________" as a word. Why is this so ?
As I said, the regex matches alphabetic, numeric, or underscore characters. You can change your filter to exclude _, which will also address the fourth bullet point: B = regexp(A, '[a-zA-Z0-9]*', 'match'); This will match a-z, A-Z, and 0-9 only. This will also exclude the non-ASCII characters, which it seems like the \w* flag matches.
I think #excaza's solution would be the go-to approach, but here's an alternative one with isstrprop using its optional input argument 'alpha' to look for alphabets -
A(cellfun(#(x) any(isstrprop(x, 'alpha')), A))
Sample run -
>> A
A =
'!' '!!' '!!!!)' '!!!!thanks' '!!dogsbreath' '!)' '!--[endif]--' '!--[if'
>> A(cellfun(#(x) any(isstrprop(x, 'alpha')), A))
ans =
'!!!!thanks' '!!dogsbreath' '!--[endif]--' '!--[if'
To get to the final destination, you can tweak this approach a bit, like so -
B = cellfun(#(x) x(isstrprop(x, 'alpha')), A,'Uni',0);
out = B(~cellfun('isempty',B))
Sample run -
A =
'!' '!!' '!!!!)' '!!!!thanks' '!!dogsbreath' '!)' '!--[endif]--' '!--[if'
out =
'thanks' 'dogsbreath' 'endif' 'if'