Perl Hashes and regex - perl

I am working on a code that splits sentence into individual words, the words are then searched against hash keys for their presence. My code returns terms that are 100% same, after a match I tag the word from the sentence with the value that corresponds to the matching key. The problem is the code tags terms but with random values not with what I expect. Also, there are situations where the term and the hash key are similar but not 100% identical,
how can I write a regular expression to match my terms with the keys.
Note: I have stemmed the hash keys to their root forms.
I cam provide some examples: If the term from the sentence is Synergistic or anti-synergistic, and my hash key is Synerg, how can I match the above term with Synerg.
My code is as follows:
open IN, "C:\\Users\\Desktop\\TM\\clean_cells.txt" or die "import file absent";
my %hash=();
use Tie::IxHash;
tie %hash => "Tie::IxHash";
while(<IN>)
{
chomp $_;
$line=lc $_;
#Organs=split/\t/, $line;
$hash{$Organs[0]}=$Organs[1];
}
$Sentence="Lymphoma is Lymph Heart and Lung";
#list=split/ /,$Sentence;
#array=();
foreach $term(#list)
{
chomp $term;
for $keys(keys %hash)
{
if($hash{$term})
{
$cell="<$hash{$keys}>$term<\/$hash{$keys}>";
push(#array, $cell);
}
elsif($term=~m/\b\Q$keys(\w+)\E\b/)
{
$cell="<$hash{$keys}>$term<\/$hash{$keys}>";
push(#array, $cell);
}
elsif($term=~m/\b\Q(\w+)$keys\E\b/)
{
$cell="<$hash{$keys}>$term<\/$hash{$keys}>";
push(#array, $cell);
}
elsif($term=~m/\b\Q(\w+)$keys(\w+)\E\b/)
{
$cell="<$hash{$keys}>$term<\/$hash{$keys}>";
push(#array, $cell);
}
}
}
print #array;
for example: hash looks like this: %hash={
TF1 => Lymph
Thoracic_duct => Lymph
SK-MEL-1 => Lymph
Brain => Brain
Cerebellum => Brain
};
So if the term TF1 is found it should be substituted to Lymph TF1 /Lymph

I found two big problems that were preventing your code from working:
You are making the keys to your hash lowercase, but you are not doing
the same for the terms in $Sentence. Thus, uppercase words from
$Sentence will never match.
The \Q...\E modifier disables regex meta-characters. While it is often good to do this when interpolating a variable, you cannot use expressions like (\w+) in there--that will look for the literal characters (\w+). Those regexes need to be rewritten like this: m/\b\Q$keys\E(\w+)\b/.
There are other design issues with your code, as well:
You are using undeclared global variables all over the place. You should declare all variables with my. Always turn on use strict; use warnings;, which will force you to do this correctly.
There doesn't appear to be any reason for Tie::IxHash, which causes your hash to be ordered. You don't use this ordering in any way in your code. The output is ordered by #list. I would do away with this unnecessary module.
Your if/elsif statements are redundant. if($term=~m/\b\Q(\w*)$keys(\w*)\E\b/) will accomplish the same thing as all of them combined. Note that I replaced \w+ with \w*. This allows the groups before and after to match zero or more characters instead of one or more characters.
Note: I didn't bother testing with Tie::IxHash, since I don't have that module and it appears unnecessary. It's possible that using this module is also introducing other problems in your code.

Related

Global symbol "%formsequence" requires explicit package name at line 37

I am trying to execute a Perl CGI script, but I am getting an error:
Global symbol "%formsequence" requires explicit package name at line 37.
I did some research and found that use strict forces me to declare the variables before I use them or store any data, but in my program I have declared them and that is why I don't understand the error. Here is my script:
#!/usr/bin/perl -w
use strict;
my %errors;
my %form;
my #formsequence;
my %fields = (
"lname" => "Last Name",
"phone" => "Phone",
"fname" => "Fist Name"
);
my %patterns = (
"lname" => '[A-Z][a-z]{2,50}',
"phone" => '\d{3}-\d{3}-\d{4}',
"fname" => '[A-Z][A-Za-z]{2,60}'
);
#formsequence = ("lname", "phone", "phone");
print "content-type/html\n\n";
if ($ENV{REQUEST_METHOD} eq "POST") {
&readformdata;
if (&checkrequiredfields) {
print "Form Data validated successfully!";
}
else {
foreach (keys (%fields)) {
if ($fields{$_} != $formsequence{$_}) { <-- line 37
$errors{$_}="Not in correct sequence\n";
}
}
}
I suspect you may be viewing the concept of an 'array' from the perspective of a PHP developer. In Perl a hash and an array are separate data structures.
Arrays are declared using the # prefix and you refer to elements using square brackets around an integer index:
my #names = ('Tom', 'Dick', 'Larry');
say $names[0]; # 'Tom'
say $names[-1]; # 'Larry'
my $count = #names; # $count now equals 3
foreach my $i (0..$#names) {
say $names[$i];
}
Hashes are declared using the % prefix and you refer to elements using curly braces around a string key:
my %rgb = (
red => '#ff0000',
white => '#ffffff',
blue => '#0000ff',
);
say $rgb{'red'}; # '#ff0000'
say $rgb{blue}; # '#0000ff' quotes optional around bareword keys
foreach my $k (keys %rgb) {
say $rgb{$k};
}
You wouldn't normally use the keys function on an array - in fact older versions of Perl don't even support it, newer versions will return a range of integers (e.g.: 0..2).
When you call keys on a hash the keys have no inherent order, and the order may change.
Other things worth knowing:
Using & to call a function is really old style (i.e. early 90s), these days we'd use readformdata() instead of &readformdata.
The != operator is a numeric comparison operator so only use it when the values you're comparing are actually numbers. If you want to check two strings are 'not equal' then use ne instead (e.g.: if($thing1 ne $thing2) { ... }).
This seems to be some rather old Perl.
You use -won the shebang line rather than use warnings (which has been available since Perl 5.6.0 was released in 2000.
You use a (presumably) custom readformdata() function instead of CGI,pm's params() method. CGI.pm was added to the Perl core in 1997 (it was recently removed - but that's not a reason to not use it).
You use ampersands on function calls. These haven't been necessary since Perl 5 was released in 1994.
Your problem is caused by declaring an array, #formsequence, which you then try to access as a hash - $formsequence{$_} means, "look up the key $_ in the hash %formsequence. In Perl, arrays and hashes are two completely different data types and it is possible (although not recommended for, hopefully, obvious reasons) to have an array and a hash with the same name.
You declare arrays like this - using #:
my #array = ('foo', 'bar', 'baz');
And access individual element like this - using [...]:
print $array[0]; # prints 'foo'
You declare hashes like this - using `%':
my %hash = (foo => 'Foo', bar => 'Bar', baz => 'Baz');
And access individual elements like ths - using {...}:
print $hash{foo}; # prints 'Foo'
Arrays are indexed using integers and are ordered. Hashes are indexed using strings and are unordered.
I can't really suggest a fix for your code as it's not really clear what you are trying to do. It appears that you want to check that parameters appear in a certain order, but this is doomed to failure as a) you can't guarantee the order in which CGI parameters are transmitted from the browser to your web server and b) you can't guarantee the order in which keys(%fields) will return the keys from your %fields hash.
If you explain in a little more detail what you are trying to do, then we might be able to help you more.

Perl replace multiple strings simultaneously (case insensitive)

Consider the following perl code which works perfectly:
%replacements = ("what" => "its", "lovely" => "bad");
($val = $sentence) =~ s/(#{[join "|", keys %replacements]})/$replacements{$1}/g;
stackoverflow user sresevoir brilliantly came up with that replacement code that involved using a hash, allowing you to find and replace multiple terms without iterating through a loop.
I've been throwing other various search and replace terms at it programmatically and I've started using it to highlight words that are the result of a search.
The problem (refer to problem code shown below):
Make it case insensitive by adding an "i" before the "g" at the end.
If the search term $thisterm and the search term word contained in $sentence has no difference in case, there are no problems. If the search term $thisterm (i.e. Stackoverflow) and the search term word contained in $sentence is a different case (i.e. stackoverflow), then the result returned is nothing for that term. It's as if I told it to
$sentence =~ s/$thisterm//g;
Here's the problem code:
foreach $thisterm (#searchtermarray) {
# The variable $thisterm has already gone through a filter to remove special characters.
$thistermtochange = $thisterm;
$replacements{$thistermtochange} = "<span style=\"background-color:#FFFFCC;\">$thistermtochange<\/span>";
}
$sentence =~ s/(#{[join "|", keys %replacements]})/$replacements{$1}/ig;
I also went back and duplicated the problem with the above original code. It seems the combination of adding the i modifier, using a hash reference, and different case is something Perl doesn't like.
What am I missing?
Thanks,
DB
P. S. I've benefited from stackoverflow for years; but I just signed up for this question and the site wouldn't let me directly comment to sresevoir. As a "brand new" user I don't have enough reputation points.
Keep all the keys of the hash in lower case, and do this:
s/(#{[join "|", keys %replacements]})/$replacements{ lc $1 }/ig
(note the addition of lc)
There are a few other things you ought to consider.
First, as is, if you are trying to replace both lovely and love with different replacements, lovely may or may not ever be found, depending on which key is returned by keys first. To prevent this, it's a good idea to sort by descending length:
s/(#{[join "|", sort { length $b <=> length $a } keys %replacements]})/$replacements{$1}/ig
Second, this technique only works with fixed strings; if your keys contain any regex metacharacters, for instance replacing how? with why?, it will fail, because $1 will never be how?. To allow metacharacters (interpreted as literal characters), quote them:
s/(#{[join "|", map quotemeta, sort { length $b <=> length $a } keys %replacements]})/$replacements{$1}/ig
From your comment, it seems to me that you want to find certain strings, all in one pass, and add stuff around them (that doesn't vary by which string). If so, you are going about it the hard way and shouldn't be using a hash at all. Have an array of the strings you want to search for and replace them:
s/(#{[join "|", map quotemeta, sort { length $b <=> length $a } #search_strings]})/<span style="background-color:#FFFFCC;">$1<\/span>/ig;
The problem is that, if you have a hash like this
my %replacements = (
word => '<span style="background-color:#FFFFCC;">word</span>'
)
then the substitution will look like
s/(word)/$replacements{$1}/ig;
But a case-independent regex pattern will match WORD as well, so the replacement expression $replacements{$1} will be $replacements{'WORD'} which doesn't exist.
While you may be pleased with his solution, sresevoir uses an ugly way of embedding a string expression within a regex. This
($val = $sentence) =~ s/(#{[join "|", keys %replacements]})/$replacements{$1}/g;
would be much better as
my $pattern = join '|', keys %replacements;
($val = $sentence) =~ s/($pattern)/$replacements{$1}/g;
But you have generalised this hash idea too far and it is the wrong way to make the changes that you need. If your replacement string is a simple function of the original string, as in this case, then it is best written directly as a replacement string using captures from the pattern. I would write it like this
my $pattern = join '|', #searchtermarray;
$sentence =~ s{($pattern)}{<span style="background-color:#FFFFCC;">$1</span>\n}ig;
But note that that, as it stands, the search will find any words that are substrings of anything in the text, and will also go awry if #searchtermarray has any strings that contain regex metacharacters. You don't say anything about your actual data so I can't really help you to resolve this.

Perl: Greedy nature refuses to work

I am trying to replace a string with another string, but the greedy nature doesn't seem to be working for me. Below is my code where "PERFORM GET-APLCY" is identified and replaced properly, but string "PERFORM GET-APLCY-SOI-CVG-WVR" and many other such strings are being replaced by the the replacement string for "PERFORM GET-APLCY".
s/PERFORM $func[$i]\.*/# PERFORM $func[$i]\.\n $hash{$func[$i]}/g;
where the full stop is optional during string match and replacement. I have also tried giving the pattern to be matched as $func[$i]\b
Please help me understand what the issue could be.
Thanks in advance,
Faez
Why GET-APLCY- should not match GET-APLCY., if the dot is optional?
Easy solution: sort your array by length in descending order.
#func = sort { length $b <=> length $a } #func
Testing script:
#!/usr/bin/perl
use warnings;
use strict;
use feature 'say';
my %hash = ('GET-APLCY' => 'REP1',
'GET-APLCY-SOI-CVG-WVR' => 'REP2',
'GET-APLCY-SOI-MNG-CVRW' => 'REP3',
);
my #func = sort { length $b <=> length $a } keys %hash;
while (<DATA>) {
chomp;
print;
print "\t -> \t";
for my $i (0 .. $#func) {
s/$func[$i]/$hash{$func[$i]}/;
}
say;
}
__DATA__
GET-APLCY param
GET-APLCY- param
GET-APLCY. param
GET-APLCY-SOI. param
GET-APLCY-SOI-CVG-WVR param
GET-APLCY-SOI-MNG-CVRW param
You appear to be looping over function names, and calling s/// for each one. An alternative is to use the e option, and do them all in one go (without a loop):
my %hash = (
'GET-APLCY' => 'replacement 1',
'GET-APLCY-SOI-CVG-WVR' => 'replacement 2',
);
s{
PERFORM \s+ # 'PERFORM' keyword
([A-Z-]+) # the original function name
\.? # an optional period
}{
"# PERFORM $1.\n" . $hash{$1};
}xmsge;
The e causes the replacement part to be evaluated as an expression. Basically, the first part finds all PERFORM calls (I'm assuming that the function names are all upper case with '-' between them – adjust otherwise). The second part replaces that line with the text you want to appear.
I've also used the x, m, and s options, which is what allows the comments in the regular expression, among other things. You can find more about these under perldoc perlop.
A plain version of the s-line should be:
s/PERFORM ([A-Z-]+)\.?/"# PERFORM $1.\n" . $hash{$1}/eg;
I guess that $func[$i] contains "GET-APLCY". If so, this is because the star only applies to the dot, an actual dot, not "any character". Try
s/PERFORM $func[$i].*/# PERFORM $func[$i]\.\n $hash{$func[$i]}/g;
I'm pretty sure you trying to do some kind of loop for $i. And in that case most likely
GET-APLCY is located in #func array before GET-APLCY-SOI-CVG-WVR. So I recommend to reverse sort #func before entering loop.

Simple multi-dimensional array with loop in perl

I'm trying to use an array and a loop to print out the following (basically for each letter of the alphabet, print each letter of the alphabet after it and then move on to the next letter). I'm new to perl, anyone have any quick words of :
aa
ab
ac
ad
...
ba
bb
bc
bd
...
ca
cb
...
Currently I have this, but it only prints a single character alphabet...
#arr = ("a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z");
$i = #arr;
while ($i)
{
print $arr[$i];
$i--;
}
Using the range operator and the ranges you want to target:
use strict;
use warnings;
my #elements = ("aa" .. "zz");
for my $combo (#elements)
{
print "$combo\n";
}
You can utilize the initial 2 letters till the ending 2 letters you want as ending and the for will take care of everything.
This really isn't multi-dimensional array work, if it were you'd be working with stuff like:
my #foo = (
[1,2,3],
[4,7,8,1,2,3],
[2,3],
);
This is really a very basic how do I make a nested loop that iterates over the same array. I'll bet this is homework.
So, I'll let you figure out the nesting bits, but give some help with Perl's loop operators.
!! for/foreach
for (the each is optional) is the real heavy hitter for looping in perl. Use it like so:
for my $var ( #array ) {
#do stuff with $var
}
Each element in #array will be aliased to the $var variable, and the block of code will be executed. The fact that we are aliasing, rather than copying means that if alter the value of $var, #array will be changed as well. The stuff between the parenthesis may be any expression. The expression will be evaluated in list context. So if you put a file handle in the parens, the entire file will be read into memory and processed.
You can also leave off naming the loop variable, and $_ will be used instead. In general, DO NOT DO THIS.
!! C-Style for
Every once in a while you need to keep track of indexes as you loop over an array. This is when a C style for loop comes in handy.
for( my $i=0; $i<#array; $i++ ) {
# do stuff with $array[$i]
}
!! While/Until
While and until operate with boolean loop conditions. That means that the loop will repeat as long as the appropriate boolean value if found for the condition ( TRUE for while, and FALSE for until). In addition to the obvious cases where you are looking for a particular condition, while is great for processing a file one line at a time.
while ( my $line = <$fh> ) {
# Do stuff with $line.
}
!! map
map is an amazingly useful bit of functional programming kung-fu. It is used to turn one list into another. You pass an anonymous code reference that is used to enact the transformation.
# Multiply all elements of #old by two and store them in #new.
my #new = map { $_ * 2 } #old;
So how do you solve your particular problem? There are many ways. Which is best depends on how you want to use the results. If you want to create a new array of the letter pairs, use map. If you are interested primarily in a side effect (say printing a variable) use for. If you need to work with really big lists that come from sort of interator (like lines from a filehandle) use while.
Here's a solution. I wouldn't turn it in to your professor until you understand how it works.
print map { my $letter=$_; map "$letter$_\n", "a".."z" } "a".."z";
Look at perldoc articles, perlsyn for info on the looping constructs, perlfunc for info on map and look at perlop for info on the range operator (..).
Good luck.
Use the range operator (..) for your initialization. The range operator basically grabs a range of values such as numbers or characters.
Then use a nested loop to go through the array one time per character for a total of 26^2 iterations.
Rather than a while loop I've used a foreach loop to go through each item in the array. You could also put 'a' .. 'z' instead of declared #arr as the argument to the foreach loop. The foreach loops below set $char or $char2 to each value in #arr in turn.
my #arr = ('a' .. 'z');
for my $char (#arr) {
for my $char2 (#arr) {
print "$char$char2\n";
}
}
If all you really want to do is print the 676 strings you describe, then:
#!/usr/bin/perl
use warnings;
use strict;
my $str = 'aa';
while (length $str < 3) {
print $str++, "\n";
}
But I smell an "XY problem"...

What's the safest way to iterate through the keys of a Perl hash?

If I have a Perl hash with a bunch of (key, value) pairs, what is the preferred method of iterating through all the keys? I have heard that using each may in some way have unintended side effects. So, is that true, and is one of the two following methods best, or is there a better way?
# Method 1
while (my ($key, $value) = each(%hash)) {
# Something
}
# Method 2
foreach my $key (keys(%hash)) {
# Something
}
The rule of thumb is to use the function most suited to your needs.
If you just want the keys and do not plan to ever read any of the values, use keys():
foreach my $key (keys %hash) { ... }
If you just want the values, use values():
foreach my $val (values %hash) { ... }
If you need the keys and the values, use each():
keys %hash; # reset the internal iterator so a prior each() doesn't affect the loop
while(my($k, $v) = each %hash) { ... }
If you plan to change the keys of the hash in any way except for deleting the current key during the iteration, then you must not use each(). For example, this code to create a new set of uppercase keys with doubled values works fine using keys():
%h = (a => 1, b => 2);
foreach my $k (keys %h)
{
$h{uc $k} = $h{$k} * 2;
}
producing the expected resulting hash:
(a => 1, A => 2, b => 2, B => 4)
But using each() to do the same thing:
%h = (a => 1, b => 2);
keys %h;
while(my($k, $v) = each %h)
{
$h{uc $k} = $h{$k} * 2; # BAD IDEA!
}
produces incorrect results in hard-to-predict ways. For example:
(a => 1, A => 2, b => 2, B => 8)
This, however, is safe:
keys %h;
while(my($k, $v) = each %h)
{
if(...)
{
delete $h{$k}; # This is safe
}
}
All of this is described in the perl documentation:
% perldoc -f keys
% perldoc -f each
One thing you should be aware of when using each is that it has
the side effect of adding "state" to your hash (the hash has to remember
what the "next" key is). When using code like the snippets posted above,
which iterate over the whole hash in one go, this is usually not a
problem. However, you will run into hard to track down problems (I speak from
experience ;), when using each together with statements like
last or return to exit from the while ... each loop before you
have processed all keys.
In this case, the hash will remember which keys it has already returned, and
when you use each on it the next time (maybe in a totaly unrelated piece of
code), it will continue at this position.
Example:
my %hash = ( foo => 1, bar => 2, baz => 3, quux => 4 );
# find key 'baz'
while ( my ($k, $v) = each %hash ) {
print "found key $k\n";
last if $k eq 'baz'; # found it!
}
# later ...
print "the hash contains:\n";
# iterate over all keys:
while ( my ($k, $v) = each %hash ) {
print "$k => $v\n";
}
This prints:
found key bar
found key baz
the hash contains:
quux => 4
foo => 1
What happened to keys "bar" and baz"? They're still there, but the
second each starts where the first one left off, and stops when it reaches the end of the hash, so we never see them in the second loop.
The place where each can cause you problems is that it's a true, non-scoped iterator. By way of example:
while ( my ($key,$val) = each %a_hash ) {
print "$key => $val\n";
last if $val; #exits loop when $val is true
}
# but "each" hasn't reset!!
while ( my ($key,$val) = each %a_hash ) {
# continues where the last loop left off
print "$key => $val\n";
}
If you need to be sure that each gets all the keys and values, you need to make sure you use keys or values first (as that resets the iterator). See the documentation for each.
Using the each syntax will prevent the entire set of keys from being generated at once. This can be important if you're using a tie-ed hash to a database with millions of rows. You don't want to generate the entire list of keys all at once and exhaust your physical memory. In this case each serves as an iterator whereas keys actually generates the entire array before the loop starts.
So, the only place "each" is of real use is when the hash is very large (compared to the memory available). That is only likely to happen when the hash itself doesn't live in memory itself unless you're programming a handheld data collection device or something with small memory.
If memory is not an issue, usually the map or keys paradigm is the more prevelant and easier to read paradigm.
A few miscellaneous thoughts on this topic:
There is nothing unsafe about any of the hash iterators themselves. What is unsafe is modifying the keys of a hash while you're iterating over it. (It's perfectly safe to modify the values.) The only potential side-effect I can think of is that values returns aliases which means that modifying them will modify the contents of the hash. This is by design but may not be what you want in some circumstances.
John's accepted answer is good with one exception: the documentation is clear that it is not safe to add keys while iterating over a hash. It may work for some data sets but will fail for others depending on the hash order.
As already noted, it is safe to delete the last key returned by each. This is not true for keys as each is an iterator while keys returns a list.
I always use method 2 as well. The only benefit of using each is if you're just reading (rather than re-assigning) the value of the hash entry, you're not constantly de-referencing the hash.
I may get bitten by this one but I think that it's personal preference. I can't find any reference in the docs to each() being different than keys() or values() (other than the obvious "they return different things" answer. In fact the docs state the use the same iterator and they all return actual list values instead of copies of them, and that modifying the hash while iterating over it using any call is bad.
All that said, I almost always use keys() because to me it is usually more self documenting to access the key's value via the hash itself. I occasionally use values() when the value is a reference to a large structure and the key to the hash was already stored in the structure, at which point the key is redundant and I don't need it. I think I've used each() 2 times in 10 years of Perl programming and it was probably the wrong choice both times =)
I usually use keys and I can't think of the last time I used or read a use of each.
Don't forget about map, depending on what you're doing in the loop!
map { print "$_ => $hash{$_}\n" } keys %hash;
I woudl say:
Use whatever's easiest to read/understand for most people (so keys, usually, I'd argue)
Use whatever you decide consistently throught the whole code base.
This give 2 major advantages:
It's easier to spot "common" code so you can re-factor into functions/methiods.
It's easier for future developers to maintain.
I don't think it's more expensive to use keys over each, so no need for two different constructs for the same thing in your code.