Perl replace multiple strings simultaneously (case insensitive) - perl

Consider the following perl code which works perfectly:
%replacements = ("what" => "its", "lovely" => "bad");
($val = $sentence) =~ s/(#{[join "|", keys %replacements]})/$replacements{$1}/g;
stackoverflow user sresevoir brilliantly came up with that replacement code that involved using a hash, allowing you to find and replace multiple terms without iterating through a loop.
I've been throwing other various search and replace terms at it programmatically and I've started using it to highlight words that are the result of a search.
The problem (refer to problem code shown below):
Make it case insensitive by adding an "i" before the "g" at the end.
If the search term $thisterm and the search term word contained in $sentence has no difference in case, there are no problems. If the search term $thisterm (i.e. Stackoverflow) and the search term word contained in $sentence is a different case (i.e. stackoverflow), then the result returned is nothing for that term. It's as if I told it to
$sentence =~ s/$thisterm//g;
Here's the problem code:
foreach $thisterm (#searchtermarray) {
# The variable $thisterm has already gone through a filter to remove special characters.
$thistermtochange = $thisterm;
$replacements{$thistermtochange} = "<span style=\"background-color:#FFFFCC;\">$thistermtochange<\/span>";
}
$sentence =~ s/(#{[join "|", keys %replacements]})/$replacements{$1}/ig;
I also went back and duplicated the problem with the above original code. It seems the combination of adding the i modifier, using a hash reference, and different case is something Perl doesn't like.
What am I missing?
Thanks,
DB
P. S. I've benefited from stackoverflow for years; but I just signed up for this question and the site wouldn't let me directly comment to sresevoir. As a "brand new" user I don't have enough reputation points.

Keep all the keys of the hash in lower case, and do this:
s/(#{[join "|", keys %replacements]})/$replacements{ lc $1 }/ig
(note the addition of lc)
There are a few other things you ought to consider.
First, as is, if you are trying to replace both lovely and love with different replacements, lovely may or may not ever be found, depending on which key is returned by keys first. To prevent this, it's a good idea to sort by descending length:
s/(#{[join "|", sort { length $b <=> length $a } keys %replacements]})/$replacements{$1}/ig
Second, this technique only works with fixed strings; if your keys contain any regex metacharacters, for instance replacing how? with why?, it will fail, because $1 will never be how?. To allow metacharacters (interpreted as literal characters), quote them:
s/(#{[join "|", map quotemeta, sort { length $b <=> length $a } keys %replacements]})/$replacements{$1}/ig
From your comment, it seems to me that you want to find certain strings, all in one pass, and add stuff around them (that doesn't vary by which string). If so, you are going about it the hard way and shouldn't be using a hash at all. Have an array of the strings you want to search for and replace them:
s/(#{[join "|", map quotemeta, sort { length $b <=> length $a } #search_strings]})/<span style="background-color:#FFFFCC;">$1<\/span>/ig;

The problem is that, if you have a hash like this
my %replacements = (
word => '<span style="background-color:#FFFFCC;">word</span>'
)
then the substitution will look like
s/(word)/$replacements{$1}/ig;
But a case-independent regex pattern will match WORD as well, so the replacement expression $replacements{$1} will be $replacements{'WORD'} which doesn't exist.
While you may be pleased with his solution, sresevoir uses an ugly way of embedding a string expression within a regex. This
($val = $sentence) =~ s/(#{[join "|", keys %replacements]})/$replacements{$1}/g;
would be much better as
my $pattern = join '|', keys %replacements;
($val = $sentence) =~ s/($pattern)/$replacements{$1}/g;
But you have generalised this hash idea too far and it is the wrong way to make the changes that you need. If your replacement string is a simple function of the original string, as in this case, then it is best written directly as a replacement string using captures from the pattern. I would write it like this
my $pattern = join '|', #searchtermarray;
$sentence =~ s{($pattern)}{<span style="background-color:#FFFFCC;">$1</span>\n}ig;
But note that that, as it stands, the search will find any words that are substrings of anything in the text, and will also go awry if #searchtermarray has any strings that contain regex metacharacters. You don't say anything about your actual data so I can't really help you to resolve this.

Related

Matching special character (###!~`%^&()[]}{;') and replace it with _ (underscore) in perl

I want to remove all special characters except this 2 character .-
$name=~s/[^\w\d\.-]/_/g ;
But the line above it not only removes the special character but also non-alphabet characters e.g Arabic or other none alphabet characters.
How to remove only these characters (###!~`%^&()[]}{;',)
There are a few things to consider here.
First, do \d and \w really do what you think they do? Recent perls are Unicode aware (and in some cases locale aware), and those character classes aren't the same in every situation.
Since you know what you want to exclude, you can just put that directly into the character class. You need escape only the ] so it doesn't end the character class:
use v5.10;
my $name = "(Hello] #&^% {World[} (###!~`%^&()[]}{;',)!";
$name =~ s/[(###!~`%^&()[\]}{;',)]/_/g;
say $name;
Mark Jason Dominus has written about the "American" and "Prussian" approaches to cleansing data. You can specify what to exclude, or what to include.
If you specify the things to exclude, you potentially pass through some things that you should have excluded but did not. This may be because you forgot or didn't even know you should exclude it. These unintended situations may bite you.
If you specify only the things that are safe, you potentially miss out on things you should pass through, but bad things don't get through by mistakes of omission.
You then might try this, where you don't use the character class shortcuts:
$name =~ s/[^0-9A-Za-z.-]/_/g;
But the output is a bit weird because this also replaces whitespace. You might add the \s shortcut in there:
$name =~ s/[^0-9A-Za-z\s.-]/_/g;
But the meaning of \s has also changed over time too (vertical tab!) and is also Unicode aware. You could list the whitespace you would accept:
$name =~ s/[^0-9A-Za-z\x20.-]/_/g;
But no this is getting a bit weird. There's another way. You can go back to the ASCII versions of the character class shortcuts with the /a flag:
$name =~ s/[^\d\w\s.-]/_/ga;
The regex operator flags are in perlop since they apply to an operator. But, for as long as I've been using Perl and telling that to people in classes, someone I still go to perlre first.
Transliterate
Second, the substitution operator may be more than you need though. If you want to change single characters into other single characters, the transliteration operator may be what you need. It changes the character on the left with the corresponding character on the right:
$name =~ tr/abc/XYZ/; # a -> X, b -> Y, c -> Z
If you don't have enough characters to match up on the right, it reuses the last character:
$name =~ tr/abc/XY/; # a -> X, b -> Y, c -> Y
So, in your case with one underscore:
$name =~ tr/##!~`%^&()[]}{;',/_/;
Since the sequence of characters in tr/// aren't a regular expression, you don't worry about metacharacters.
Just for giggles
If this pattern is something you want to use in multiple places, you might want to give it a name with a user-defined Unicode property. Once it has a name, you use that everywhere and can update for everyone at the same time:
use v5.10;
my $name = "(Hello] #&^% {World[} (###!~`%^&()[]}{;',)!";
$name =~ s/\p{IsForbidden}/_/g;
say $name;
sub IsForbidden {
# see https://perldoc.perl.org/perlunicode#User-Defined-Character-Properties
state $exclude = q|##!~`%^&()[]}{;',|;
state $string =
join '',
map { sprintf "%X\n", ord }
split( //, $exclude );
return $string;
}
Building on Gene's comment, specify what you want to replace but I'd escape each special character. Note, to replace #, use \#\# in character array as shown in line 2:
$name = "# # R ! ~## ` % ^ & ( O ){{();,'`## { } ;!!! ' N , ";
$name =~ s/[\#\!\~\`\%\&\^\(\)\{\}\;\'\,\#\#]//g;
$name =~ s/ *//g;
print $name;
### Outputs RON

How can I remove all the vowels unless they are in word beginnings?

$text = "I like apples more than oranges\n";
#words = split /” “/, $text;
foreach (#words) [1..] {
if $words "AEIOUaeiou";
$words =~ tr/A E I O U a e i o u//d;
}
print "$words\n";
"I like apples more than oranges" will become "I lk appls mr thn orngs". "I" in "I", "a" in "appls" and "o" in "orngs" will stay because they are the first letter in the word.
This is my research assignment as a first year student. I am allowed to ask questions and later cite them. Please don't be mean.
I know you say you are not allowed to use a regex, but for everyone else that shows up here I'll show the use of proper tools. But, then I'll do something just as useful with tr///.
One of the tricks of programming (and mathematics) decomposing what look like hard problems into easier problems, especially if you already have solutions for the easy problems. (Read about Parnas decomposition, for example).
So, the question is "How can I remove all the vowels unless they are in word beginnings?" (after I made your title a bit shorter). This led the answers to think about words, so they split up the input, did some work to ensure they weren't working on the first character, and then reassembled the result.
But, another way to frame the problem is "How do I remove all the vowels that come after another letter?". The only letter that doesn't come after another letter is the first letter of a word.
The regex for a vowel that comes after another letter is simple (but I'll stick to ASCII here, although it is just as simple for any Unicode letter):
[a-z][aeiou]
That only matches when there is a vowel after the first letter. Now you want to replace all of those with nothing. Use the substitution operator, s///. The /g flag makes all global substitutions and the /i makes it case insensitive:
s/[a-z][aeiou]//gi;
But, there's a problem. It also replaces that leading letter. That's easy enough to fix. The \K in a substitution says to ignore the part of the pattern before it in the replacement. Anything before the \K is not replaced. So, this only replaces the vowels:
s/[a-z]\K[aeiou]//gi;
But, maybe there are vowels next to each other, so throw in the + quantifier for "one or more" of the preceding item:
s/[a-z]\K[aeiou]+//gi;
You don't need to care about words at all.
Some other ways
Saying that a letter must follow another letter has a special zero-width assertion: the non-word boundary, \B (although that also counts digits and underscore as "letters"):
s/\B[aeiou]+//gi;
The \K was introduced v5.10 and was really a nifty trick to have a variable-width lookbehind. But, the lookbehind here is fixed width: it's one character:
s/(?<=[a-z])[aeiou]+//gi;
But, caring about words
Suppose you need to handle each word separately, for some other requirement. It looks like you've mixed a little Python-ish sort of code, and it would be nice if Perl could do that :). The problem doesn't change that much because you can do the same thing for each individual word.
foreach my $word ( split /\s+/, $x ) {
.... # same thing for each word
}
But, here's an interesting twist? How do you put it all back together? The other solutions just use a single space assuming that's the separator. Maybe there should be two spaces, or tabs, or whatever. The split has a special "separator retention mode" that can keep whatever was between the pieces. When you have captures in the split pattern, those capture values are part of the output list:
my #words_and_separators = split /(\s+)/, $x;
Since you know that none of the separators will have vowels, you can make substitutions on them knowing they won't change. This means you can treat them just like the words (that is, there is no special case, which is another thing to think about as you decompose problems). To get your final string with the original spacing, join on the empty string:
my $ending_string = join '', #words_and_separators;
So, here's how that might all look put together. I'll add the /r flag on the substitution so it returns the modified copy instead of working on the original (don't modify the control variable!):
my #words;
foreach my $word ( split /(\s+)/, $x ) {
push #words, $word =~ s/\B[aeiou]+//gr;
}
my $ending_string = join '', #words;
But, that foreach is a bit annoying. This list pipeline is the same, and it's easier to read these bottom to top. Each thing produces a list that flows into the thing above it. This is how I'd probably express it in real code:
my $ending_string =
join '',
map { s/\B[aeiou]+//gr } # each item is in $_
split /(\s+)/, $x;
Now, here's the grand finale. What if we didn't split thing up on whitespace but on whitespace and the first letter of each word? With separator retention mode we know that we only have to affect every other item, so we count them as we do the map:
my $n = 0;
my $ending_string =
join '',
map { ++$n % 2 ? tr/aeiouAEIOU//dr : $_ }
split /((?:^|\s+)[a-z])/i, $x;
But, I wouldn't write this technique in this way because someone would ultimately find me and exact their revenge. Instead, that foreach I found annoying before may soothe the angry masses:
my $n = 0;
foreach ( split /((?:^|\s+)[a-z])/i, $x ) {
print ++$n % 2 ? tr/aeiouAEIOU//dr : $_;
}
This now remembers the actual separators from the original string and leaves alone the first character of the "word" because it's not in the element we will modify.
The code in the foreach doesn't need to use the conditional operator, ?: or some of the other features. The important part is skipping every other element. That split pattern is a bit of a puzzler if you haven't seen it before, but that's what you get with those sorts of requirements. I think modifying a portion of the substring is just as likely to trip up people on a first read.
I mean, if they are going to make you do it the wrong way in the homework, strike back with something that will take up a bit of their time. :)
Oh, this is fun
I had another idea, because tr/// has another task beyond transliteration. It also counts. Because it returns the number of replacements, if you replace anything with itself, you get a count of the occurrences of that thing. You can count vowels, for instance:
my $has_vowels = $string =~ tr/aeiou/aeiou/; # counts vowels
But, with a string of one letter, that means you have a way to tell if it is a vowel:
my $is_vowel = substr( $string, $i, 1 ) =~ tr/aeiou/aeiou/;
You also can know things about the previous character:
my $is_letter = substr( $string, $i - 1, 1 ) =~ tr/a-zA-Z/a-zA-Z/;
Put that together and you can look at any position and know if it's a vowel that follows a letter. If so, you skip that letter. Otherwise, you add that letter to the output:
use v5.10;
$x = "I like apples more than oranges oooooranges\n";
my $output = substr $x, 0, 1; # avoid the -1 trap (end of string!)
for( my $i = 1; $i < length $x; $i++ ) {
if( substr( $x, $i, 1 ) =~ tr/aeiou/aeiou/ ) { # is a vowel
next if substr( $x, $i - 1, 1 ) =~ tr/a-zA-Z/a-zA-Z/;
}
$output .= substr $x, $i, 1;
}
say $output;
This has the fun consequence of using the recommended operator but completely bypassing the intent. But, this is a proper and intended use of tr///.
It appears that you need to put a little more effort into learning Perl before taking on challenges like this. Your example contains a lot of code that simply isn't valid Perl.
$x = "I like apples more than oranges\n"; #the original sentence
foreach $i in #x[1..] {
You assign your text to the scalar variable $x, but then try to use the array variable #x. In Perl, these are two completely separate variables that have no connection whatsoever. Also, in Perl, the range operator (..) needs values at both ends.
If you had an array called #x (and you don't, you have a scalar) then you could do what you're trying to do here with foreach $i (#x)
if $i "AEIOUaeiou";
I'm not sure what you're trying to do here. I guess the nearest useful Perl expression I can see would be something like:
if ($i =~ /^[AEIOUaeiou]$/)
Which would test if $i is a vowel. But that's a regex, so you're not allowed to use it.
Obviously, I'd solve this problem with a regex, but as those are banned, I've reached for some slightly more obscure Perl features in my code below (that's so your teacher won't believe this is your solution if you just cut and paste it):
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
my $text = "I like apples more than oranges\n";
# Split the string into an array of words
my #words = split /\s+/, $text;
# For each word...
for (#words) {
# Get a substring that omits the first character
# and use tr/// to remove vowels from that substring
substr($_, 1) =~ tr/AEIOUaeiou//d;
}
# Join the array back together
$text = join ' ', #words;
say $text;
Update: Oh, and notice that I've used tr/AEIUOaeiou//d where you have tr/A E I O U a e i o u//d. It probably won't make any difference here (depending on your approach - but you'll probably be applying it to strings that don't contain spaces) but it's good practice to only include the characters that you want to remove.
We can go over the input string from the end and remove any vowel that's not preceded by a space. We go from right to left so we don't have to adjust the position after each deletion. We don't need to check the very first letter, it shouldn't be ever removed. To remove a vowel, we can use tr///d on the substr of the original string.
for my $i (reverse 1 .. length $x) {
substr($x, $i, 1) =~ tr/aeiouAEIOU//d
if substr($x, $i - 1, 1) ne ' ';
}
Firstly your if statement is wrong.
Secondly this is not a Perl code.
Here is a piece of code that will work, but there is a better way to do it
my $x = "I like apples more than oranges\n";
my $new = "";
my #arr;
foreach my $word (split(' ', $x)) {
#arr = split('', $word);
foreach (my $i; $i<scalar #arr; $i++){
if ($i == 0){
$new .= $arr[$i];
}
elsif (index("AEIOUaeiou", $arr[$i]) == -1) {
$new .= $arr[$i];
}
}
$new .= " ";
}
print "$new\n";
Here I am splitting the string in order to get an array, then I am checking if the given char is a vowel, if it's not, I am appending it to a new string.
Always include
use strict;
use warnings;
on top of your code.
Clearly this is an exercise in lvalues. Obviously. Indubitably!
#!/usr/bin/env perl
# any old perl will do
use 5.010;
use strict;
use warnings;
# This is not idomatic nor fantastic code. Idiotastic?
$_='I am yclept Azure-Orange, queueing to close a query. How are YOU?';
# My little paws typed "local pos" and got
# "Useless localization of match position" :(
# so a busy $b keeps/restores that value
while (/\b./g) {
substr($_,$b=pos,/\b/g && -$b+pos)
# Suggestion to use tr is poetic, not pragmatic,
# ~ tr is sometimes y and y is sometimes a vowel
=~ y/aeiouAEIOU//d;
pos=$b;
}
say
# "say" is the last word.
Was there an embargo against using s/// substitution, or against using all regex? For some reason I thought matching was OK, just not substitution. If matches are OK, I have an idea that "improves" upon this by removing $b through pattern matching side effects. Will see if it pans out. If not, should be pretty easy to replace /\b/ and pos with index and variables, though the definition of word boundary over-simplifies in that case.
(edit) here it is a little more legible with nary a regex
my $text="YO you are the one! The-only-person- asking about double spaces.
Unfortunate about newlines...";
for (my $end=length $text;
$end > 0 && (my $start = rindex $text,' ',$end);
$end = $start-1) {
# y is a beautiful letter, using it for vowels is poetry.
substr($text,2+$start,$end-$start) =~ y/aeiouUOIEA//d;
}
say $text;
Maybe more devious minds will succeed with vec, unpack, open, fork?
You can learn about some of these techniques via
perldoc -f substr
perldoc -f pos
perldoc re
As for my own implementer notes, the least important thing is ending without punctuation so nothing can go after

Perl Hashes and regex

I am working on a code that splits sentence into individual words, the words are then searched against hash keys for their presence. My code returns terms that are 100% same, after a match I tag the word from the sentence with the value that corresponds to the matching key. The problem is the code tags terms but with random values not with what I expect. Also, there are situations where the term and the hash key are similar but not 100% identical,
how can I write a regular expression to match my terms with the keys.
Note: I have stemmed the hash keys to their root forms.
I cam provide some examples: If the term from the sentence is Synergistic or anti-synergistic, and my hash key is Synerg, how can I match the above term with Synerg.
My code is as follows:
open IN, "C:\\Users\\Desktop\\TM\\clean_cells.txt" or die "import file absent";
my %hash=();
use Tie::IxHash;
tie %hash => "Tie::IxHash";
while(<IN>)
{
chomp $_;
$line=lc $_;
#Organs=split/\t/, $line;
$hash{$Organs[0]}=$Organs[1];
}
$Sentence="Lymphoma is Lymph Heart and Lung";
#list=split/ /,$Sentence;
#array=();
foreach $term(#list)
{
chomp $term;
for $keys(keys %hash)
{
if($hash{$term})
{
$cell="<$hash{$keys}>$term<\/$hash{$keys}>";
push(#array, $cell);
}
elsif($term=~m/\b\Q$keys(\w+)\E\b/)
{
$cell="<$hash{$keys}>$term<\/$hash{$keys}>";
push(#array, $cell);
}
elsif($term=~m/\b\Q(\w+)$keys\E\b/)
{
$cell="<$hash{$keys}>$term<\/$hash{$keys}>";
push(#array, $cell);
}
elsif($term=~m/\b\Q(\w+)$keys(\w+)\E\b/)
{
$cell="<$hash{$keys}>$term<\/$hash{$keys}>";
push(#array, $cell);
}
}
}
print #array;
for example: hash looks like this: %hash={
TF1 => Lymph
Thoracic_duct => Lymph
SK-MEL-1 => Lymph
Brain => Brain
Cerebellum => Brain
};
So if the term TF1 is found it should be substituted to Lymph TF1 /Lymph
I found two big problems that were preventing your code from working:
You are making the keys to your hash lowercase, but you are not doing
the same for the terms in $Sentence. Thus, uppercase words from
$Sentence will never match.
The \Q...\E modifier disables regex meta-characters. While it is often good to do this when interpolating a variable, you cannot use expressions like (\w+) in there--that will look for the literal characters (\w+). Those regexes need to be rewritten like this: m/\b\Q$keys\E(\w+)\b/.
There are other design issues with your code, as well:
You are using undeclared global variables all over the place. You should declare all variables with my. Always turn on use strict; use warnings;, which will force you to do this correctly.
There doesn't appear to be any reason for Tie::IxHash, which causes your hash to be ordered. You don't use this ordering in any way in your code. The output is ordered by #list. I would do away with this unnecessary module.
Your if/elsif statements are redundant. if($term=~m/\b\Q(\w*)$keys(\w*)\E\b/) will accomplish the same thing as all of them combined. Note that I replaced \w+ with \w*. This allows the groups before and after to match zero or more characters instead of one or more characters.
Note: I didn't bother testing with Tie::IxHash, since I don't have that module and it appears unnecessary. It's possible that using this module is also introducing other problems in your code.

Perl: Greedy nature refuses to work

I am trying to replace a string with another string, but the greedy nature doesn't seem to be working for me. Below is my code where "PERFORM GET-APLCY" is identified and replaced properly, but string "PERFORM GET-APLCY-SOI-CVG-WVR" and many other such strings are being replaced by the the replacement string for "PERFORM GET-APLCY".
s/PERFORM $func[$i]\.*/# PERFORM $func[$i]\.\n $hash{$func[$i]}/g;
where the full stop is optional during string match and replacement. I have also tried giving the pattern to be matched as $func[$i]\b
Please help me understand what the issue could be.
Thanks in advance,
Faez
Why GET-APLCY- should not match GET-APLCY., if the dot is optional?
Easy solution: sort your array by length in descending order.
#func = sort { length $b <=> length $a } #func
Testing script:
#!/usr/bin/perl
use warnings;
use strict;
use feature 'say';
my %hash = ('GET-APLCY' => 'REP1',
'GET-APLCY-SOI-CVG-WVR' => 'REP2',
'GET-APLCY-SOI-MNG-CVRW' => 'REP3',
);
my #func = sort { length $b <=> length $a } keys %hash;
while (<DATA>) {
chomp;
print;
print "\t -> \t";
for my $i (0 .. $#func) {
s/$func[$i]/$hash{$func[$i]}/;
}
say;
}
__DATA__
GET-APLCY param
GET-APLCY- param
GET-APLCY. param
GET-APLCY-SOI. param
GET-APLCY-SOI-CVG-WVR param
GET-APLCY-SOI-MNG-CVRW param
You appear to be looping over function names, and calling s/// for each one. An alternative is to use the e option, and do them all in one go (without a loop):
my %hash = (
'GET-APLCY' => 'replacement 1',
'GET-APLCY-SOI-CVG-WVR' => 'replacement 2',
);
s{
PERFORM \s+ # 'PERFORM' keyword
([A-Z-]+) # the original function name
\.? # an optional period
}{
"# PERFORM $1.\n" . $hash{$1};
}xmsge;
The e causes the replacement part to be evaluated as an expression. Basically, the first part finds all PERFORM calls (I'm assuming that the function names are all upper case with '-' between them – adjust otherwise). The second part replaces that line with the text you want to appear.
I've also used the x, m, and s options, which is what allows the comments in the regular expression, among other things. You can find more about these under perldoc perlop.
A plain version of the s-line should be:
s/PERFORM ([A-Z-]+)\.?/"# PERFORM $1.\n" . $hash{$1}/eg;
I guess that $func[$i] contains "GET-APLCY". If so, this is because the star only applies to the dot, an actual dot, not "any character". Try
s/PERFORM $func[$i].*/# PERFORM $func[$i]\.\n $hash{$func[$i]}/g;
I'm pretty sure you trying to do some kind of loop for $i. And in that case most likely
GET-APLCY is located in #func array before GET-APLCY-SOI-CVG-WVR. So I recommend to reverse sort #func before entering loop.

PERL -- Regex incl all hash keys (sorted) + deleting empty fields from $_ in file read

I'm working on a program and I have a couple of questions, hope you can help:
First I need to access a file and retrieve specific information according to an index that is obtained from a previous step, in which the indexes to retrieve are found and store in a hash.
I've been looking for a way to include all array elements in a regex that I can use in the file search, but I haven´t been able to make it work. Eventually i've found a way that works:
my #atoms = ();
my $natoms=0;
foreach my $atomi (keys %{$atome}){
push (#atoms,$atomi);
$natoms++;
}
#atoms = sort {$b cmp $a} #atoms;
and then I use it as a regex this way:
while (<IN_LIG>){
if (!$natoms) {last;}
......
if ($_ =~ m/^\s*$atoms[$natoms-1]\s+/){
$natoms--;
.....
}
Is there any way to create a regex expression that would include all hash keys? They are numeric and must be sorted. The keys refer to the line index in IN_LIG, whose content is something like this:
8 C5 9.9153 2.3814 -8.6988 C.ar 1 MLK -0.1500
The key is to be found in column 0 (8). I have added ^ and \s+ to make sure it refers only to the first column.
My second problem is that sometimes input files are not always identical and they make contain white spaces before the index, so when I create an array from $_ I get column0 = " " instead of column0=8
I don't understand why this "empty column" is not eliminated on the split command and I'm having some trouble to remove it. This is what I have done:
#info = split (/[\s]+/,$_);
if ($info[0] eq " ") {splice (#info, 0,1);} # also tried $info[0] =~ m/\s+/
and when I print the array #info I get this:
Array:
Array: 8
Array: C5
Array: 9.9153
Array: 2.3814
.....
How can I get rid of the empty column?
Many thanks for your help
Merche
There is a special form of split where it will remove both leading and trailing spaces. It looks like this, try it:
my $line = ' begins with spaces and ends with spaces ';
my #tokens = split ' ', $line;
# This prints |begins:with:spaces:and:ends:with:spaces|
print "|", join(':', #tokens), "|\n";
See the documentation for split at http://p3rl.org/split (or with perldoc split)
Also, the first part of your program might be simpler as:
my #atoms = sort {$b cmp $a} keys %$atome;
my $natoms = #atoms;
But, what is your ultimate goal with the atoms? If you simply want to verify that the atoms you're given are indeed in the file, then you don't need to sort them, nor to count them:
my #atoms = keys %$atome;
while (<IN_LIG>){
# The atom ID on this line
my ($atom_id) = split ' ';
# Is this atom ID in the array of atom IDs that we are looking for
if (grep { /$atom_id/ } #atoms) {
# This line of the file has an atom that was in the array: $atom_id
}
}
Lets warm up by refining and correcting some of your code:
# If these are all numbers, do a numerical sort: <=> not cmp
my #atoms = ( sort { $b <=> $a } keys %{$atome} );
my $natoms = scalar #atoms;
No need to loop through the keys, you can insert them into the array right away. You can also sort them right away, and if they are numbers, the sort must be numerical, otherwise you will get a sort like: 1, 11, 111, 2, 22, 222, ...
$natoms can be assigned directly by the count of values in #atoms.
while(<IN_LIG>) {
last unless $natoms;
my $key = (split)[0]; # split splits on whitespace and $_ by default
$natoms-- if ($key == $atoms[$natoms - 1]);
}
I'm not quite sure what you are doing here, and if it is the best way, but this code should work, whereas your regex would not. Inside a regex, [] are meta characters. Split by default splits $_ on whitespace, so you need not be explicit about that. This split will also definitely remove all whitespace. Your empty field is most likely an empty string, '', and not a space ' '.
The best way to compare two numbers is not by a regex, but with the equality operator ==.
Your empty field should be gone by splitting on whitespace. The default for split is split ' '.
Also, if you are not already doing it, you should use:
use strict;
use warnings;
It will save you a lot of headaches.
for your second question you could use this line:
#info = $_ =~ m{^\s*(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)}xms;
in order to capture 9 items from each line (assuming they do not contain whitespace).
The first question I do not understand.
Update: I would read alle the lines of the file and use them in a hash with $info[0] as the key and [#info[1..8]] as the value. Then you can lookup the entries by your index.
my %details;
while (<IN_LIG>) {
#info = $_ =~ m{^\s*(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)}xms;
$details{ $info[0] } = [ #info[1..$#info] ];
}
Later you can lookup details for the indices you are interested in and process as needed. This assumes the index is unique (has the property of keys).
thanks for all your replies. I tried the split form with ' ' and it saved me several lines of code. thanks!
As for the regex, I found something that could make all keys as part of the string expression with join and quotemeta, but I couldn't make it work. Nevertheless I found an alternative that works, but I liked the join/quotemeta solution better
The atom indexes are obtained from a text file according to some energy threshold. Later, in the IN_LIG loop, I need to access the molecule file to obtain more information about the atoms selected, thus I use the atom "index" in the molecule to identify which lines of the file I have to read and process. This is a subroutine to which I send a hash with the atom index and some other information.
I tried this for the regex:
my $strings = join "|" map quotemeta,
sort { $hash->{$b} <=> $hash->{$a}} keys %($hash);
but I did something wrong cos it wouldn't take all keys