I have an array like this
my #stopWords = ("and","this",....)
My text is in this variable
my $wholeText = "....and so this is...."
I want to match every occurrence of every element of my stopWords array in the scalar wholeText and replace it with spaces.
One way of doing this is as follows :
foreach my $stopW (#stopWords)
{
$wholeText =~ s/$stopW/ /;
}
This works and replaces every occurrence of all the stop words. I was just wondering, if there is a shorter way of doing it.
Like this:
$wholeText =~ s/#stopWords/ /;
The above does not seem to work though.
While the various map/for-based solutions will work, they'll also do regex processing of your string separately for each and every stopword. While this is no big deal in the example given, it can cause major performance issues as the target text and stopword list grow.
Jonathan Leffler and Robert P are on the right track with their suggestions of mashing all the stopwords together into a single regex, but a simple join of all the stopwords into a single alternation is a crude approach and, again, becomes inefficient if the stopword list is long.
Enter Regexp::Assemble, which will build you a much 'smarter' regex to handle all the matches at once - I've used it to good effect with lists of up to 1700 or so words to be checked against:
#!/usr/bin/env perl
use strict;
use warnings;
use 5.010;
use Regexp::Assemble;
my #stopwords = qw( and the this that a an in to );
my $whole_text = <<EOT;
Fourscore and seven years ago our fathers brought forth
on this continent a new nation, conceived in liberty, and
dedicated to the proposition that all men are created equal.
EOT
my $ra = Regexp::Assemble->new(anchor_word_begin => 1, anchor_word_end => 1);
$ra->add(#stopwords);
say $ra->as_string;
say '---';
my $re = $ra->re;
$whole_text =~ s/$re//g;
say $whole_text;
Which outputs:
\b(?:t(?:h(?:at|is|e)|o)|a(?:nd?)?|in)\b
---
Fourscore seven years ago our fathers brought forth
on continent new nation, conceived liberty,
dedicated proposition all men are created equal.
My best solution:
$wholeText =~ s/$_//g for #stopWords;
You might want to sharpen the regexp using some \b and whitespace.
What about:
my $qrstring = '\b(' . (join '|', #stopWords) . ')\b';
my $qr = qr/$qrstring/;
$wholeText =~ s/$qr/ /g;
Concatenate all the words to form '\b(and|the|it|...)\b'; the parentheses around the join are necessary to give it a list context; without them, you end up with the count of the number of words). The '\b' metacharacters mark word boundaries, and therefore prevent you changing 'thousand' into 'thous'. Convert that into a quoted regular expression; apply it globally to your subject string (so that all occurrences of all stop words are removed in a single operation).
You can also do without the variable '$qr':
my $qrstring = '\b(' . (join '|', #stopWords) . ')\b';
$wholeText =~ s/$qrstring/ /g;
I don't think I'd care to maintain the code of anyone who managed to do without the variable '$qrstring'; it probably can be done, but I don't think it would be very readable.
My paranoid version:
$wholeText =~ s/\b\Q$_\E\b/ /gi for #stopWords;
Use \b to match word boundaries, and \Q..\E just in case any of your stopwords contains characters which may be interpreted as "special" by the regex engine.
You could consider using a regex join to create a single regex.
my $regex_str = join '|', map { quotemeta } #stopwords;
$string =~ /$regex_str/ /g;
Note that the quotemeta part just makes sure that any regex characters are properly escaped.
grep{$wholeText =~ s/\b$_\b/ /g}#stopWords;
Related
I tried to sorting upper case and lower case in the perl language. A bunch of text are save in as "electricity.txt"
in the .txt file:
Today's scientific question is: What in the world is electricity and
where does it go after it leaves the toaster?
Here is a simple experiment that will teach you an important
electrical lesson: On a cool dry day, scuff your feet along a carpet,
then reach your hand into a friend's mouth and touch one of his dental
fillings. Did you notice how your friend twitched violently and cried
out in pain? This teaches one that electricity can be a very powerful
force, but we must never use it to hurt others unless we need to learn
an important lesson about electricity.
Somehow, I can't get any uppercase word
and my code is
my %count;
my $openFileile = "electricity.txt";
open my $openFile, '<', $openFileile;
while (my $list = <$openFile>) {
chomp $list;
foreach my $word (split /\s+/, $list) {
$count{lc($word)}++;
}
}
printf "\n\nSorting Alphabetically with upper case words in front of lower-case words with the same initial characters\n";
foreach my $word (sort keys %count){
printf "%-31s \n", sort {"\$a" cmp uc"\$b"} lc($word);
}
Issue 1
First problem is the statement below means you are only storing the lower-case versions of all the words
$count{lc($word)}++;
After the initial while loop %count has only lower-case words. That means your foreach loop can never retrieve the upper-case words.
Issue 2
Second issue is this statement
printf "%-31s \n", sort {"\$a" cmp uc"\$b"} lc($word);
I have no idea what you think that the sort will achieve -- it is sorting a list with only one element, lc($word), so doesn't actually do anything.
A working example
Taking the comments above into account, here is a version that outputs both upper & lower-case words (abbreviated)
use strict;
use warnings;
my %count;
#my $openFileile = "electricity.txt";
#open my $openFile, '<', $openFileile;
while (my $list = <DATA>) {
chomp $list;
foreach my $word (split /\s+/, $list) {
$count{$word}++;
}
}
printf "\n\nSorting Alphabetically with upper case words in front of lower-case words with the same initial characters\n";
foreach my $word (sort keys %count){
printf "%-31s \n", $word;
}
__DATA__
Today's scientific question is: What in the world is electricity and where does it go after it leaves the toaster?
Here is a simple experiment that will teach you an important electrical lesson: On a cool dry day, scuff your feet along a carpet, then reach your hand into a friend's mouth and touch one of his dental fillings. Did you notice how your friend twitched violently and cried out in pain? This teaches one that electricity can be a very powerful force, but we must never use it to hurt others unless we need to learn an important lesson about electricity.
That print this
Sorting Alphabetically with upper case words in front of lower-case words with the same initial characters
Did
Here
On
This
Today's
What
a
about
after
along
...
use
very
violently
we
where
will
world
you
As Hunter McMillen's comment says, you are using lc on the words when creating the hash, therefore all of your original capitalization will be lost. Lets go through your code, as I spot some other mistakes.
First off, always use use strict; use warnings. Especially if you have a preference for long and complicated variable names. It will save you from typos and weird bugs.
open my $openFile, '<', $openFileile;
With open statements, it is idiomatic to check the return value of the open, to see if anything went wrong. And if it did, to report the error. I.e. add ..., or die "Cannot open '$openFileile': $!".
foreach my $word (split /\s+/, $list) {
Typically, if you split on whitespace you usually want to split on ' ' -- a single space. This is a special case for split, also the default split mode, it will split on \s+, but also remove leading whitespace.
$count{lc($word)}++;
Here is your problem. All the words lose their original case.
printf "\n\nSorting Alphabetically with upper case words in front of lower-case words with the same initial characters\n";
printf is a special formatting print. If you do not intend to use that formatting, use the regular print to avoid problems.
printf "%-31s \n", sort {"\$a" cmp uc"\$b"} lc($word);
You cannot sort just one (1) word. You need at least 2 words to be able to sort.
Why are you using double quotes, and then escaping the variable sigil? I am guessing this is you testing different things to see what works. This looks very unlikely to do what you want.
"\$a" will just become $a -- a dollar sign plus an "a". This is what you do when you want to print the variable name, e.g. print "\$a is $a" (prints $a is 12, for example).
lc will have no effect, since all your words are already in lower case.
Even if lc and uc would work here, you cannot use uc like that in the sort subroutine. The sort function will choose one word in the comparison at random and capitalize it. Effectively destroying your sort.
Also uc will change all the letters to upper case (cat => CAT). You want ucfirst (cat => Cat).
When I clean up your code, and also make the variable names somewhat more reasonable, I get this below. Also, I removed your file open, since I use the internal DATA file handle to facilitate testing. You can just put back your own open, with the additions I described above.
use strict;
use warnings;
my %words;
while (my $line = <DATA>) {
for my $word (split ' ', $line) { # split on ' ' a single space removes leading and trailing whitespace
my $key = lc $word; # save lowercase word as key
$words{$key}{count}++; # separate count
$words{$key}{value} = $word; # word original formatting as value
}
}
# printf is used for special formatting, if you are not using that formatting, use regular print to avoid unnecessary interpolation of %
print "\nSorting Alphabetically with upper case words in front of lower-case words with the same initial characters\n";
for my $word (sort keys %words) {
printf "%-31s : %s\n", $words{$word}{value}, $words{$word}{count};
}
__DATA__
Today's scientific question is: What in the world is electricity and where does it go after it leaves the toaster?
Here is a simple experiment that will teach you an important electrical lesson: On a cool dry day, scuff your feet along a carpet, then reach your hand into a friend's mouth and touch one of his dental fillings. Did you notice how your friend twitched violently and cried out in pain? This teaches one that electricity can be a very powerful force, but we must never use it to hurt others unless we need to learn an important lesson about electricity.
And it prints
a : 5
about : 1
after : 1
along : 1
an : 2
and : 3
be : 1
but : 1
can : 1
carpet, : 1
cool : 1
...etc
As can be noticed, this differentiates between carpet and carpet, since you are only splitting on whitespace. It keeps the non-word characters and includes them in the hash. There are different ways to find words in a text. For example, instead of split you could use a regex:
my #words = $line =~ /\w+/g; # \w is word characters, plus numbers, and underscore _
Even this is simplistic, but will work better than your split. You can add characters to the regex as your needs require, for example: /[\w\-]+/ -- include dash for hyphenated words, e.g. mega-carpet. (Note that dash - has to be escaped when placed between other characters inside a character class bracket, otherwise it will be interpreted as a range, e.g. a-z.)
I'm new to Perl, though not to programming, and am working through Learning Perl. The book has exercises to match successive lines of a small text file.
I had the idea of supplying match strings from STDIN, and going through the file for each one:
while(<STDIN>) {
chomp;
$regex = $_;
seek JUNK, 0, 0;
while(<JUNK>) {
chomp();
if(/$regex/) {
say;
}
}
say '';
}
This works fine, but I can't find a way to interpolate an entire match string, e.g.
/fred/i
into the predicate. I tried
if($$matcher) # with $matcher = '/fred/'
but Perl complained.
I imagine this is my ignorance, and should welcome enlightenment.
Statement modifiers, such as /i, are a part of the code telling Perl how to perform the match, not a part of the pattern to be matched. This is why that doesn't work for you.
You have three ways to work around this (well, probably more, since this is Perl we're talking about, but three ways that I can think of straight off):
1) Use extended regex syntax and, when you want a case-insensitive match, enter (?i:fred), as suggested in comments on the question.
2) Use string eval to allow the use of the regular statement modifiers: if (eval "$_ =~ $regex") { say } Note that this method will require you to also type the surrounding slashes. e.g., You'd have to enter /fred/i; just typing in fred would not work. Note also that it's a huge security hole to do this without validating your input first, since the user's entered text is executed as Perl code, just as if it were part of the original program. (Imagine if the user entered //, system("rm -rf /") - it would test against an empty regex, then delete all the files on your computer.) So probably not a recommended approach unless you really know what you're doing and/or you're the only one who will ever run the program.
3) The most complex, but also most correct, solution is to write a parser which inspects the user's entered string to see whether any special flags are present and then responds accordingly. A very simple example which allows the user to append /i for a case-insensitive search:
#!/usr/bin/env perl
use strict;
use warnings;
use 5.010;
while(<STDIN>) {
chomp;
my #parts = split '/', $_;
# If the user input starts with a /, the first part will be empty, so throw
# it away.
shift #parts unless $parts[0];
my $re = shift #parts;
my %flags;
for (#parts) {
for (split '') {
$flags{i} = 1 if $_ eq 'i';
}
}
my $f = join '', keys %flags;
say "Matched" if eval qq('foo' =~ /$re/$f);
}
This also uses string eval, so it is potentially vulnerable to the same kind of security issues as #2, but $re cannot contain any / characters (the split '/' would have ended $re immediately prior to the first /), which prevents code from being inserted there and $f can contain only the letter i (or any other flags you might choose to recognize if you expand on this). So it should be safe. (But, if anyone can demonstrate an exploit I missed, please tell me about it in comments!)
Problem
What you are trying to do can be summarized by:
my $regex = '/fred/i';
my #lines = (
'A line containing some words and Fred said Hello.',
'Another line. Here is a regex embedded in the line: /fred/i',
);
for ( #lines ) {
say if /$regex/;
}
Output:
Another line. Here is a regex embedded in the line: /fred/i
We see that the second line matches $regex, whereas we wanted the first line containing Fred to match the string fred with the (case insensitive) i flag added to the regex. The problem is that the characters / and i in $regex are taken as characters to be matched literally, i.e., they are not interpreted as special characters surrounding a Regex (as part of a Perl expression).
Note:
The character / is special as part of a Perl expression for a regular expression, but it is not special inside the Regex pattern. There are however characters that are special inside the pattern, the so-called meta characters:
\ | ( ) [ { ^ $ * + ? .
see perldoc quotemeta for more information.
A solution using extended patterns
Simply change the first line to:
my $regex = '(?i)fred'; # or alternatively: (?i:fred)
Regex flags can be added to a regex pattern using "Extended patterns" described in the manual perldoc perlre :
Extended Patterns
The syntax for most of these is a pair of parentheses with a question
mark as the first thing within the parentheses. The character after
the question mark indicates the extension.
[...]
(?adlupimnsx-imnsx)
(?^alupimnsx)
One or more embedded pattern-match modifiers, to be turned on (or
turned off if preceded by "-" ) for the remainder of the pattern or
the remainder of the enclosing pattern group (if any). This is
particularly useful for dynamically-generated patterns, such as those
read in from a configuration file, taken from an argument, or
specified in a table somewhere.
[...]
These modifiers are restored at the end of the enclosing group.
Alternatively the non-capturing form can be used:
(?:pattern)
(?adluimnsx-imnsx:pattern)
(?^aluimnsx:pattern)
This is for clustering, not capturing; it groups subexpressions like
"()" , but doesn't make backreferences as "()" does.
The question has been answered in the following comment:
Try (?i:fred), see Extended
patterns in
perldoc perlre for more information
– Håkon Hægland 7 hours ago.
What is the meaning of below statement in perl?
($script = $0) =~ s#^.*/##g;
I am trying to understand the operator =~ along with the statement on the right side s#^.*/##g.
Thanks
=~ applies the thing on the right (a pattern match or search and replace) to the thing on the left. There's lots of documentation about =~ out there, so I'm just going to point you at a pretty good one.
There's a couple of idioms going on there which are not obvious nor well documented which might be tripping you up. Let's cover them.
First is this...
($copy = $original) =~ s/foo/bar/;
This is a way of copying a variable and performing a search and replace on it in a single step. It is equivalent to:
$copy = $original;
$copy =~ s/foo/bar/;
The =~ operates on whatever is on the left after the left hand code has been run. ($copy = $original) evaluates to $copy so the =~ acts on the copy.
s#^.*/##g is the same as s/^.*\///g but using alternative delimiters to avoid Leaning Toothpick Syndrome. You can use just about anything as a regex delimiter. # is common, though I think its ugly and hard to read. I prefer {} because they balance. s{^.*/}{}g is equivalent code.
Unrolling the idioms, you have this:
$script = $0;
$script =~ s{^.*/}{}g;
$0 is the name of the script. So this is code to copy the name of the script and strip everything up to the last slash (.* is greedy and will match as much as possible) off it. It is getting just the filename of the script.
The /g indicates to perform the match on the string as many times as possible. Since this can only ever match once (the ^ anchors it to the beginning of the string) it serves no purpose.
There's a better and safer way to do this.
use File::Basename;
$script = basename($0);
It's very, very simple:
Perl quote-like expressions can take many different characters as part separators. The separator right after the command (in this case, the s) is the separator for the rest of the operation. For example:
# Out with the "Old" and "In" with the new
$string =~ s/old/new/;
$string =~ s#old#new#;
$string =~ s(old)(new);
$string =~ s#old#new#;
All four of those expressions are the same thing. They replace the string old with new in my $string. Whatever comes after the s is the separator. Note that parentheses, curly braces, and square brackets use parings. This works out rather nicely for the q and qq which can be used instead of single quotes and double quotes:
print "The value of \$foo is \"foo\"\n"; # A bit hard to read
print qq/The value of \$foo is "$foo"\n/; # Maybe slashes weren't a great choice...
print qq(The value of \$foo is "$foo"\n); # Very nice and clean!
print qq(The value of \$foo is (believe it or not) "$foo"\n); #Still works!
The last still works because the quote like operators count opening and closing parentheses. Of course, with regular expressions, parentheses and square brackets are part of the regular expression syntax, so you won't see them so much in substitutions.
Most of the time, it is highly recommended that you stick with the s/.../.../ form just for readability. It's what people are use to and it's easy to digest. However, what if you have this?
$bin_dir =~ s/\/home\/([^\/]+)\/bin/\/Users\/$1\bin/;
Those backslashes can make it hard to read, so the tradition has been to replace the backslash separators to avoid the hills and valleys effect.
$bin_dir =~ s#/home/([^/]+)/bin#/Users/$1/bin#;
This is a bit hard to read, but at least I don't have to quote each forward slash and backslash, so it's easier to see what I'm substituting. Regular expressions are hard because good quote characters are hard to find. Various special symbols such as the ^, *, |, and + are magical regular expression characters, and could probably be in a regular expression, the # is a common one to use. It's not common in strings, and it doesn't have any special meaning in a regular expression, so it won't be used.
Getting back to your original question:
($script = $0) =~ s#^.*/##g;
is the equivalent of:
($script = $0) =~ s/^.*\///g;
But because the original programmer didn't want to backquote that slash, they changed the separator character.
As for the:
($script = $0) =~ s#^.*/##g;`
It's the same as saying:
$script = $0;
$script =~ s#^.*/##g;
You're assigning the $script variable and doing the substitution in a single step. It's very common in Perl, but it is a bit hard to understand at first.
By the way, if I understand that basic expression (Removing all characters to the last forward slash. This would have been way cleaner:
use File::Basename;
...
$script = basename($0);
Much easier to read and understand -- even for an old Perl hand.
In perl, you can use many kinds of characters as quoting characters (string, regular expression, list). lets break it down:
Assign the $script variable the contents of $0 (the string that contains the name of the calling script.)
The =~ character is the binding operator. It invokes a regular expression match or a regex search and replace. In this case, it matches against the new variable, $script.
the s character indicates a search and replace regex.
The # character is being used as the delimiter for the regex. The regex pattern quote character is usually the / character, but you can use others, including # in this case.
The regex, ^.*/. It means, "at the start of string, search for zero or more characters until a slash. This will keep capturing on each line except for newline characters (which . does not match by default.)
The # indicating the start of the 'replace' value. Usually you have a pattern here that uses any captured part of the first line.
The # again. This ends the replace pattern. Since there was nothing between the start and end of the replace pattern, everything that was found in the first is replaced with nothing.
g, or global match. The search and replace will keep happening as many times as it matches in the value.
Effectively, searches for and empties every value before the / in the value , but keeps all the newlines, in the name of the script. It's a really lazy way of getting the script name when invoked in a long script that only works with a unix-like path.
If you have a chance, consider replacing with File::Basename, a core module in Perl:
use File::Basename;
# later ...
my $script = fileparse($0);
$string = 'a=1;b=2';
use Data::Dumper;
#array = split("; ?", $string);
print Dumper(\#array);
output:
$VAR1 = [
'a=1',
'b=2'
];
Anyone knows how "; ?" work here?It's not regex, but works quite like regex,so I don't understand.
I think it means "semicolon followed by optional space (just one or zero)".
It's not regex, but works quite like regex,so I don't understand.
The pattern parameter to split is always treated as a regular expression (would be better to not use a string, though). The only exception is the "single space", which is taken to mean "split on whitespace"
The first parameter of split is a regex. So I'd rather write split /; ?/, $string;.
When you use a string for the first parameter, it just means the regex can vary and has to be compiled anew each time the split is run. See perldoc -f split for details.
The regex could be read; the character ";" optionally followed by a space. See perlretut and perlreref for details.
A semicolon (the ;) followed by an optional (the ?) space (the ).
I know this might be very easy to some,,
I have a simple string like this #¨0+639172523299 (with characters before a mobile number). My question is, how do i remove all the characters before the plus(+)? What i know is to remove a known character as follows:
$number =~ tr/://d; (if i want to remove a colon)
But here, I want all characters before '+' to be removed.
To remove everything up to and including the first +, you can do:
$number ~= s/.*\+//;
If you want to keep the +, you can put that into the replacement:
$number ~= s/.*\+/+/;
The above says: Match "anything" (the .*) followed by a + (+ is a special character in regular expressions, which is why it needs the backslash escape) and replace it with nothing (or in the above example, replace it with a single +).
Note that the above will strip out everything up to the LAST + in the string, which may not be what you want. If you want to keep strip out everything up to the FIRST + in a string, you can do:
$number =~ s/[^+]*\+//;
or
$number =~ s/[^+]*\+/+/; # Keep the +
The difference from the first regular expression being the [^+]* instead of .*, which means "match any character except a +".
For more information on Perl's regular expressions, the perldoc perlre manual page is pretty good, as is O'Reilly's Mastering Regular Expressions book.
in the simplest case
$string =~ s/^.*\+//;
if you have more than one "+" before the mobile number
$string="#+0+0+639172523299";
#s=split /\+/,$string;
print $s[-1];
In fact, you can just use split() instead of regex. Its easier.
my $string = '#¨0+639172523299';
$string =~ s/(.*)(?=\+)//;
print $string;
$number =~ s/^.*\+//;
s/(.*?\+)(.*)/\2/;
If you want plus to be remain
s/(.*?)(\+)(.*)/\2\3/;
my $str="#¨0+639172523299";
if($str=~/(\D+)(\+[0-9]+)/)
{
print $2;
}