Perl Regex to match words with more than 2 characters - perl

I am new to PERL and working on a regex to match only words with equal to or more than 3 letters . Here is the program I am trying. I tried adding \w{3,} since it should match 3 re more characters. But it is still matching <3 characters in a word. For example If i give "This is a Pattern". I want my $field to match only "This" and "Pattern" and skip "is" and "a".
#!/usr/bin/perl
while (<STDIN>) {
foreach my $reg_part (split(/\s+/, $_)) {
if ($reg_part =~ /([^\w\#\.]*)?([\w{3,}\#\(\)\+\$\.]+)(?::(.+))?/) {
print "reg_part = $reg_part \n";
my ($mod, $field, $pat) = ($1, $2, $3);
print "#$mod#$field#$pat#$negate#\n";
}
}
}
exit(0);
What am I missing?

You have
[\w{3,}...]+
which is the same as
[{},3\w...]+
I think you want
(?:\w{3,}|[\$\#()+.])+

Break your regular expression up.
You know you want three word characters, so specify :-
# Match three word characters.
\w{3}
After that, you don't really care if the word has more characters, but you won't block it either.
# Match 0 or more word characters
\w*
Finally, you want to ensure that you have boundaries to catch the end of words. So, putting it all together. To match a word with at least three word characters, possibly more, use:-
# Word boundaries at start and end
\b\w{3}\w*\b
Note - \w matches alphanumeric - if it's just alpha you need:-
# Alpha only
\b[A-Za-z]{3}[A-Za-z]*\b

Related

Matching special character (###!~`%^&()[]}{;') and replace it with _ (underscore) in perl

I want to remove all special characters except this 2 character .-
$name=~s/[^\w\d\.-]/_/g ;
But the line above it not only removes the special character but also non-alphabet characters e.g Arabic or other none alphabet characters.
How to remove only these characters (###!~`%^&()[]}{;',)
There are a few things to consider here.
First, do \d and \w really do what you think they do? Recent perls are Unicode aware (and in some cases locale aware), and those character classes aren't the same in every situation.
Since you know what you want to exclude, you can just put that directly into the character class. You need escape only the ] so it doesn't end the character class:
use v5.10;
my $name = "(Hello] #&^% {World[} (###!~`%^&()[]}{;',)!";
$name =~ s/[(###!~`%^&()[\]}{;',)]/_/g;
say $name;
Mark Jason Dominus has written about the "American" and "Prussian" approaches to cleansing data. You can specify what to exclude, or what to include.
If you specify the things to exclude, you potentially pass through some things that you should have excluded but did not. This may be because you forgot or didn't even know you should exclude it. These unintended situations may bite you.
If you specify only the things that are safe, you potentially miss out on things you should pass through, but bad things don't get through by mistakes of omission.
You then might try this, where you don't use the character class shortcuts:
$name =~ s/[^0-9A-Za-z.-]/_/g;
But the output is a bit weird because this also replaces whitespace. You might add the \s shortcut in there:
$name =~ s/[^0-9A-Za-z\s.-]/_/g;
But the meaning of \s has also changed over time too (vertical tab!) and is also Unicode aware. You could list the whitespace you would accept:
$name =~ s/[^0-9A-Za-z\x20.-]/_/g;
But no this is getting a bit weird. There's another way. You can go back to the ASCII versions of the character class shortcuts with the /a flag:
$name =~ s/[^\d\w\s.-]/_/ga;
The regex operator flags are in perlop since they apply to an operator. But, for as long as I've been using Perl and telling that to people in classes, someone I still go to perlre first.
Transliterate
Second, the substitution operator may be more than you need though. If you want to change single characters into other single characters, the transliteration operator may be what you need. It changes the character on the left with the corresponding character on the right:
$name =~ tr/abc/XYZ/; # a -> X, b -> Y, c -> Z
If you don't have enough characters to match up on the right, it reuses the last character:
$name =~ tr/abc/XY/; # a -> X, b -> Y, c -> Y
So, in your case with one underscore:
$name =~ tr/##!~`%^&()[]}{;',/_/;
Since the sequence of characters in tr/// aren't a regular expression, you don't worry about metacharacters.
Just for giggles
If this pattern is something you want to use in multiple places, you might want to give it a name with a user-defined Unicode property. Once it has a name, you use that everywhere and can update for everyone at the same time:
use v5.10;
my $name = "(Hello] #&^% {World[} (###!~`%^&()[]}{;',)!";
$name =~ s/\p{IsForbidden}/_/g;
say $name;
sub IsForbidden {
# see https://perldoc.perl.org/perlunicode#User-Defined-Character-Properties
state $exclude = q|##!~`%^&()[]}{;',|;
state $string =
join '',
map { sprintf "%X\n", ord }
split( //, $exclude );
return $string;
}
Building on Gene's comment, specify what you want to replace but I'd escape each special character. Note, to replace #, use \#\# in character array as shown in line 2:
$name = "# # R ! ~## ` % ^ & ( O ){{();,'`## { } ;!!! ' N , ";
$name =~ s/[\#\!\~\`\%\&\^\(\)\{\}\;\'\,\#\#]//g;
$name =~ s/ *//g;
print $name;
### Outputs RON

Finding index of white space in Perl

I'm trying to find the index of white space in a string in Perl.
For example, if I have the string
stuff/more stuffhere
I'd like to select the word "more" with a substring method. I can find the index of "/" but haven't figured out how to find the index of white space. The length of the substring I'm trying to select will vary, so I can't hard code the index. There will only be one white space in the string (other than those after the end of the string).
Also, if anybody has any better ideas of how to do this, I'd appreciate hearing them. I'm fairly new to programming so I'm open to advice. Thanks.
Just use index:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my $string = 'stuff/more stuffhere';
my $index_of_slash = index $string, '/';
my $index_of_space = index $string, ' ';
say "Between $index_of_slash and $index_of_space.";
The output is
Between 5 and 10.
Which is correct:
0 1
01234567890123456789
stuff/more stuffhere
If by "whitespace" you also mean tabs or whatever, you can use a regular expression match and the special variables #- and #+:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my $string = "stuff/more\tstuffhere";
if ($string =~ m{/.*(?=\s)}) {
say "Between $-[0] and $+[0]";
}
The (?=\s) means is followed by a whitespace character, but the character itself is not part of the match, so you don't need to do any maths on the returned values.
As you stated, you want to select the word between the first /
and the first space following it.
If this is the case, you maybe don't need any index (you need just
the word).
A perfect tool to find something in a text is regex.
Look at the following code:
$txt = 'stuff/more stuffxx here';
if ($txt =~ /\/(.+?) /) {
print "Match: $1.\n";
}
The regex used tries to match:
a slash,
a non-empty sequence of any chars (note ? - reluctant
version), enclosed in a capturing group,
a space.
So after the match $1 contains what was captured by the first
capturing group, i.e. "your" word.
But if for any reason you were interested in starting and ending
offsets to this word, you can read them from $-[1]
and $+[1] (starting / ending indices of the first capturing group).
The arrays #- (#LAST_MATCH_START) and #+ (#LAST_MATCH_END) give offsets of the start and end of last successful submatches. See Regex related variables in perlvar.
You can capture your real target, and then read off the offset right after it with $+[0]
#+
This array holds the offsets of the ends of the last successful submatches in the currently active dynamic scope. $+[0] is the offset into the string of the end of the entire match. This is the same value as what the pos function returns when called on the variable that was matched against.
Example
my $str = 'target and target with spaces';
while ($str =~ /(target)\s/g)
{
say "Position after match: $+[0]"
}
prints
Position after match: 7
Position after match: 18
These are positions right after 'target', so of spaces that come after it.
Or you can capture \s instead and use $-[1] + 1 (first position of the match, the space).
You can use
my $str = "stuff/more stuffhere";
if ($str =~ m{/\K\S+}) {
... substr($str, $-[0], $+[0] - $-[0]) ...
}
But why substr? That's very weird there. Maybe if you told us what you actually wanted to do, we could provide a better alternatives. Here are three cases:
Data extraction:
my $str = "stuff/more stuffhere";
if ( my ($word) = $str =~ m{/(\S+)} ) {
say $word; # more
}
Data replacement:
my $str = "stuff/more stuffhere";
$str =~ s{/\K\S+}{REPLACED};
say $str; # stuff/REPLACED stuffhere
Data replacement (dynamic):
my $str = "stuff/more stuffhere";
$str =~ s{/\K(\S+)}{ uc($1) }e;
say $str; # stuff/MORE stuffhere

How can I preserve the uppercase/lower case of a string in search using perl?

I want to search for "Frequencies" (its first letter in uppercase) in my text files. And my code will print to the output file some columns including "Frequencies". But there are also occurrences of "frequencies" (its first letter in lowercase) in the text files. I am using this part $search_word = qr/Frequencies/; in the code. How can I make the first letter of the word "Frequencies" upper case in the $search_word = qr/Frequencies/; part to eliminate the occurrences of "frequencies" in the search?
In Perl, you have ucfirst to capitalize the first letter. For example:
$a = "freQuEncY";
$a = ucfirst(lc($a)); # $a <-- "Frequency";
Why don't you use regex match to check , like this
if($string_to_be_searched =~ /Frequencies/){
do something; # like print
}
Try this one:
if ( $$test_string[$i] =~ /\b(?i)f(?-i)requencies/ ) {
my $captured = ucfirst($&);
# process $captured
}
Explanation:
The regex matches will be case-insensitive for the first letter of the word frequencies only. (?i) turns on case-insensitive matching at the position it occurs for the remainder of the pattern or until it is revoked by (?-i). This works for other flags too, cf. perldoc section on re.
$& contains the full match
\b denotes a word boundary (perhaps you don't need that but your problem description suggests you do).

Perl - partial pattern matching in a sequence of letters

I am trying to find a pattern using perl. But I am only interested with the beginning and the end of the pattern. To be more specific I have a sequence of letters and I would like to see if the following pattern exists. There are 23 characters. And I'm only interested in the beginning and the end of the sequence.
For example I would like to extract anything that starts with ab and ends with zt. There is always
So it can be
abaaaaaaaaaaaaaaaaaaazt
So that it detects this match
but not
abaaaaaaaaaaaaaaaaaaazz
So far I tried
if ($line =~ /ab[*]zt/) {
print "found pattern ";
}
thanks
* is a quantifier and meta character. Inside a character class bracket [ .. ] it just means a literal asterisk. You are probably thinking of .* which is a wildcard followed by the quantifier.
Matching entire string, e.g. "abaazt".
/^ab.*zt$/
Note the anchors ^ and $, and the wildcard character . followed by the zero or more * quantifier.
Match substrings inside another string, e.g. "a b abaazt c d"
/\bab\S*zt\b/
Using word boundary \b to denote beginning and end instead of anchors. You can also be more specific:
/(?<!\S)ab\S*zt(?!\S)/
Using a double negation to assert that no non-whitespace characters follow or precede the target text.
It is also possible to use the substr function
if (substr($string, 0, 2) eq "ab" and substr($string, -2) eq "zt")
You mention that the string is 23 characters, and if that is a fixed length, you can get even more specific, for example
/^ab.{19}zt$/
Which matches exactly 19 wildcards. The syntax for the {} quantifier is {min, max}, and any value left blank means infinite, i.e. {1,} is the same as + and {0,} is the same as *, meaning one/zero or more matches (respectively).
Just a * by itself wont match anything (except a literal *), if you want to match anything you need to use .*.
if ($line =~ /^ab.*zt$/) {
print "found pattern ";
}
If you really want to capture the match, wrap the whole pattern in a capture group:
if (my ($string) = $line =~ /^(ab.*zt)$/) {
print "found pattern $string";
}

How can i make substitutions of the same word in Perl on the same xml line?

I'm working on an XML Document, I need to open it and transform to uppercase some specific tag values on the same line. If I have the same word it only does the substitution for one of them although I'm using two different if loops:
This is my XML:
<pageID="1" width="827" height="1169" Sender_Company="société" Sender_Address="société" Sender_Fax="" Category="C2" Language_2="" Document_Object="" Language_1="french" Language_3="" NumPage="1" Script_1="typed">
This is my code:
while (<FILEIN>) {
if ($_ =~ /pageID="1"/) {
$haschanged = 1;
if ($_ !~ /Sender_Address=""/) {
if ($_ =~ /(Sender_Address="(.*?)")/){
my $SenderAddress = $2;
$SenderAddress = uc($SenderAddress);
$_ =~ s/$1/Sender_Address="$SenderAddress"/;
}
}
if ($_ !~ /Sender_Company=""/) {
if ($_ =~ /(Sender_Company="(.*?)")/) {
my $SenderCompany = $2;
$SenderCompany = uc($SenderCompany);
$_ =~ s/$1/Sender_Company="$SenderCompany"/;
#print "$_\n";
}
}
}
}
When I use two different values for Sender_Company="bla" and Sender_Address="société" the transformation to uppercase works but when I use in this case the same word Sender_Company="société" and Sender_Address="société" it doesn't do the transformation to uppercase.
Does anyone have any ideas? I can't find the logic behind it not wanting to transform the same word when I'm using two distinct if loops at a time. Thank you!
Your understanding of XML is a bit debatable:
That isn't XML. It is an XML fragment at most (Element not closed, tag name can't double as attribute like <pageID="1">, no <?xml ...?> declaration, no root element, …)
Don't parse XML with regexes ;-)
XML doesn't have a concept of “lines”.
Besides of that, the code should work fine. Do note that you can make your life easy, and your code short:
$_ =~ /foo/ is the same as /foo/, $_ !~ /foo/ is the same as !/foo/.
Instead of extracting two captures, and substituting the result in a second regex, you can do it all in just one step:
s{ (?<=Sender_Address=") ([^"]+) (?=") }{ uc $1 }ex
Wait, what? I extract one or more non-"-characters that are preceded by the string Sender_Address=" and are followed by " (look-around assertions). The thing in between I capture, and substitute it with an uppercased version. Because I match at least one character, I don't have to test for the empty tag case. The /e flag allows code in the substitution (not really neccessary here), and the /x allows us to include nonmatching whitespace for better formatting.
You can easily extend this for both attributes you want to uppercase:
# This subsumes your whole logic inside `if (/pageID="1"/)`
$haschanged = 1;
for my $attr (qw/Sender_Address Sender_Company/) {
s{ (?<=\Q$attr\E=") ([^"]+) (?=") }{ uc $1 }ex;
}
The \Q...\E causes the interpolated stuff to match literally, even if it contains characters that would be regex metacharacters otherwise.
There are a few remaining bugs:
You fail to uppercase characters that are given as entities.
XML allows single quotes '...' to be used as tag value delimiters. You don't handle them
See the points under Your understanding of XML…
All of these can be solved by using an XML parser, and then transforming the attributes in the DOM.