Perl hash while loop cannot find the key value - perl

I am confused by one perl question, anyone has some idea?
I use one hash structure to store the keys and values like:
$hash{1} - > a;
$hash{2} - > b;
$hash{3} - > c;
$hash{4} - > d;
....
more than 1000 lines. I give a name like %hash
and then, I plan to have one loop statement to search for all keys to see whether it will match with the value from the file.
for example, below is the file content:
first line 1
second line 2
nothing
another line 3
my logic is:
while(read line){
while (($key, $value) = each (%hash))
{
if ($line =~/$key/i){
print "found";
}
}
so my expectation is :
first line 1 - > return found
second line 2 - > return found
nothing
another line 3 - > return found
....
However, during my testing, only first line and second line return found, for 'another line3', the
program does not return 'found'
Note: the hash has more than 1000 records.
So I try to debug it and add some count inside and find out for those found case, the loop has run like 600 or 700 times, but for the 'another line3' case, it only runs around 300 times and just exit the loop and did not return found.
any idea why it happens like that?
and I have done one more testing is if my hash structure is small, like only 10 keys, the logic works.
and I try to use foreach, and It looks like foreach does not have this kind of issue.

The pseudo code you give should work fine, but there might be a subtle problem.
If after you found your key and print it out you end the while loop, the next time each is called, it will continue where you left. Put it in other words "each" is an iterator that stores its state in the hash it iterates over.
In http://blogs.perl.org/users/rurban/2014/04/do-not-use-each.html the author explains this in more detail. His conclusion:
So each should be treated as in php: Avoid it like a plague. Only use it in optimized cases where you know what you are doing.

The problem is not very well articulated by OP, provided sample data are poor for demonstration purpose.
Following sample code is an attempt based on provided problem description by OP.
Recreate filter hash from DATA block, compose $re_filter consisting of filter hash keys, walk through a file given as an argument on command line to filter out lines matching $re_filter.
use strict;
use warnings;
my $data = do { local $/; <DATA> };
my %hash = split ' ', $data;
my $re_filter = join('|',keys %hash);
/$re_filter/ && print for <>;
__DATA__
1 a
2 b
3 c
4 d
Input data file content
first line 1
second line 2
nothing
another line 3
Output
first line 1
second line 2
another line 3

Related

delete previous and next lines in perl

I have the following file:
#TWEETY:150:000000000-ACFKE:1:2104:27858:17965
AAATTAGCAAAAAACAATAACAAAACTGGGAAAATGCAATTTAACAACGAAAATTTTCCGAGAACTTGAAAGCGTACGAAAACGATACGCTCC
+
D1FFFB11FDG00EE0FFFA1110FAA1F/ABA0FGHEGDFEEFGDBGGGGFEHBFDDG/FE/EGH1#GF#F0AEEEEFHGGFEFFCEC/>EE
#TWEETY:150:000000000-ACFKE:1:1105:22044:20029
AAAAAATATTAAAACTACGAATGCATAAATTATTTCGTTCGAAATAAACTCACACTCGTAACATTGAACTACGCGCTCC
+
CCFDDDFGGGGGGGGGGHGGHHHHGHHHHHHHHHHHHHHHGHHGHHHHHHHHHHHHHGHGHGGHHHHHHGHHEGGGGGG
#TWEETY:150:000000000-ACFKE:1:2113:14793:7182
TATATAAAGCGAGAGTAGAAACTTTTTAATTGACGCGGCGAGAAAGTATATAGCAACAAGCGAGCACCCGCTCC
+
BBFFFFFGGGGFFGGFGHHHHHHHHHHHHHHHHHGGAEEEAFGGGHHFEGHHGHHHHHGHHGGGGFHHGG?EEG
#TWEETY:150:000000000-ACFKE:1:2109:5013:22093
AAAAAAATAATTCATATCGCCATATCGACTGACAGATAATCTATCTATAATCATAACTTTTCCCTCGCTCC
+
DAFAADDGF1EAGG3EG3A00ECGDFFAEGFCHHCAGHBGEAGBFDEDGGHBGHGFGHHFHHHBDG?/FA/
#TWEETY:150:000000000-ACFKE:1:2106:25318:19875
+
CCCCCCCCCCCCGGGGGGGGGGGGGGGGGGGGGGGGFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
The lines are in groups of four (each time there is a name, starting with #TWEETY, a string of letters, a + character, and another string of letters).
The second and fourth lines should have the same number of characters.
But there are cases where the second line is empty, as in the last four lines.
In these cases, I would like to get rid of the whole block (the previous line before the empty line and the next two lines).
I have just started perl and have been trying to write a script for my problem, but am having a hard time. Does anyone have some feedback?
Thanks!
Keep an array buffer of the last four lines. When it's full, check the second line, print the lines or not, empty the buffer, repeat.
#!/usr/bin/perl
use warnings;
use strict;
my #buffer;
sub output {
print #buffer unless 1 == length $buffer[1];
#buffer = ();
}
while (<>) {
if (4 == #buffer) {
output();
}
push #buffer, $_;
}
output(); # Don't forget to process the last four lines.
Yes. Start with looking at $/ and set it so you can work on a chunk at a time. I would suggest you can treat # as a record separator in your example.
Then iterate your records using a while loop. E.g. while ( <> ) {
Use split on \n to turn the current chunk into an array of lines.
Perform your test on the appropriate lines, and either print - or not - depending on whether it passed.
If you get stuck with that, then I'm sure a specific question including your code and where you're having problems will be well received here.
If you chunk the data correctly, this becomes almost trivial.
#!/usr/bin/perl
use strict;
use warnings;
# Use '#TWEETY' as the record separator to make it
# easy to chunk the data.
local $/ = '#TWEETY';
while (<DATA>) {
# The first entry will be empty (as the separator
# is the first thing in the file). Skip that record.
next unless /\S/;
# Skip any records with two consecutive newlines
# (as they will be the ones with the empty line 2)
next if /\n\n/;
# Print the remaining records
# (with $/ stuck back on the front)
print "$/$_";
}
__DATA__
#TWEETY:150:000000000-ACFKE:1:2104:27858:17965
AAATTAGCAAAAAACAATAACAAAACTGGGAAAATGCAATTTAACAACGAAAATTTTCCGAGAACTTGAAAGCGTACGAAAACGATACGCTCC
+
D1FFFB11FDG00EE0FFFA1110FAA1F/ABA0FGHEGDFEEFGDBGGGGFEHBFDDG/FE/EGH1#GF#F0AEEEEFHGGFEFFCEC/>EE
#TWEETY:150:000000000-ACFKE:1:1105:22044:20029
AAAAAATATTAAAACTACGAATGCATAAATTATTTCGTTCGAAATAAACTCACACTCGTAACATTGAACTACGCGCTCC
+
CCFDDDFGGGGGGGGGGHGGHHHHGHHHHHHHHHHHHHHHGHHGHHHHHHHHHHHHHGHGHGGHHHHHHGHHEGGGGGG
#TWEETY:150:000000000-ACFKE:1:2113:14793:7182
TATATAAAGCGAGAGTAGAAACTTTTTAATTGACGCGGCGAGAAAGTATATAGCAACAAGCGAGCACCCGCTCC
+
BBFFFFFGGGGFFGGFGHHHHHHHHHHHHHHHHHGGAEEEAFGGGHHFEGHHGHHHHHGHHGGGGFHHGG?EEG
#TWEETY:150:000000000-ACFKE:1:2109:5013:22093
AAAAAAATAATTCATATCGCCATATCGACTGACAGATAATCTATCTATAATCATAACTTTTCCCTCGCTCC
+
DAFAADDGF1EAGG3EG3A00ECGDFFAEGFCHHCAGHBGEAGBFDEDGGHBGHGFGHHFHHHBDG?/FA/
#TWEETY:150:000000000-ACFKE:1:2106:25318:19875
+
CCCCCCCCCCCCGGGGGGGGGGGGGGGGGGGGGGGGFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
Thanks everyone for the feedback!
It was all really useful. Thanks to your suggestions, I explored all the options and learned the unless statement.
The easiest solution given my existing code, was just to add an unless statement at the end.
### Write to output, but remove non-desired Gs
open OUT, ">$outfile";
my #accorder = #{$store0{"accorder"}};
foreach my $acc (#accorder){
# retrieve seq(2nd line) and qual(4th line)
my $seq = $store0{$acc}{"seq"};
my $qual = $store0{$acc}{"qual"};
# clean out polyG at end
$seq =~ s/G{3,}.{0,1}$//;
my $lenseq = length($seq);
my $lenqual = length($qual);
my $startqual = $lenqual - $lenseq;
$qual = substr($qual, 0, $lenseq);
#the above was in order to remove multiple G characters at the end of the
#second line, which is what led to empty lines (lines that were made up of
#only Gs got cut out)
# print to output, unless sequence has become empty
unless($lenseq == 0){ #this is the unless statement I added
print OUT "\#$acc\n$seq\n+\n$qual\n";
}
}
close(OUT);

Correct use of input file in perl?

database.Win.txt is a file that contains a multiple of 3 lines. The second of every three lines is a number. The code is supposed to print out the three lines (in a new order) on one line separated by tabs, but only if the second line is 1.
Am I, by this code, actually getting the loop to create an array with three lines of database.Win.txt each time it runs through the loop? That's my goal, but I suspect this isn't what the code does, since I get an error saying that the int() function expects a numeric value, and doesn't find one.
while(<database.Win.txt>){
$new_entry[0] = <database.Win.txt>;
$new_entry[1] = <database.Win.txt>;
$new_entry[2] = <database.Win.txt>;
if(int($new_entry[1]) == 1) {
chomp($new_entry);
print "$new_entry[1], \t $new_entry[2], \t $new_entry[0], \n"
}
}
I am a total beginner with Perl. Please explain as simply as possible!
I think you've got a good start on the solution. However, your while reads one line right before the next three lines are read (if those were <$file_handles>). int isn't necessary, but chomp is--before you check the value of $new_entry[1] else there's still a record separator at the end.
Given this, consider the following:
use strict;
use warnings;
my #entries;
open my $fh, '<', 'database.Win.txt' or die $!;
while (1) {
last if eof $fh;
chomp( $entries[$_] = <$fh> ) for 0 .. 2;
if ( $entries[1] == 1 ) {
print +( join "\t", #entries ), "\n";
}
}
close $fh;
Always start with use strict; use warnings. Next, open the file using the three-argument form of open. A while (1) is used here, so three lines at a time can be read within the while loop. Since it's an 'infinite' while loop, the last if eof $fh; gives a way out, viz., if the next file read produces an end of file, it's the last. Right below that is a for loop that effectively does what you did: assign a file line to an array position. Note that chomp is used to remove the record separator during the assignment. The last part is also similar to yours, as it checks whether the second of the three lines is 1, and then the line is printed if it is.
Hope this helps!

How to merge files with line-skipping

Have two files:
file f1 has the next structure (after the # are comments which are not in the file)
SomeText1 #Section name - one word [a-zA-Z]
acd:some text #code:text - the code contains only [a-z]
opo:some another text #variable number of code:text pairs
wed:text too #in the SomeText1 section are 3 pairs
SomeText2
xxx:textttt #here only 1 code:text pair
SomeText3
zzz:texxxxxxx #here only 1 code:text pair too
and file f2 what contains in the same order as the above file the next lines:
1000:acd:opo:wed:123.44:4545.23:1233.23 #3 codes - like in the above segment 1
304:xxx:10:11:12.12 #1 code - these lines contains only
4654:zzz:0 #codes and numbers
the desired output is
SomeText1:1000:acd:opo:wed:123.44:4545.23:1233.23
acd:some text:
opo:some another text:
wed:text too:
SomeText2:304:xxx:10:11:12
xxx:textttt:
SomeText3:4654:zzz:0
zzz:texxxxxxx:
So need to add the lines from the f2 to "section name" line. The codes in every line in the f2 file are the same as the codes in the code:text pairs in the f1
Haven't no idea how to start, because
can't use the paste command because i don't have the same line-count in the both files, and
can't use join, because here aren't common keys in both files.
So, would be really happy, when someone tell me SOME ALGORITHM, how to start - and I will program it myself.
I'm offering you different approach - I provide a code, and you should figure out how it works ;) :)
paste -d':' f1 <(perl -pe '$\="\n"x($c=()=/[a-z]+/g)' <f2)
produces exactly what you want from your inputs.
EDIT - Explanation:
The soultion comes from your comment the lines contains only codes and numbers. Therefore it is possible easily get the codes from the line.
therefore enough enter as many empty lines after each line - how many codes you have
the /[a-z]+/g matched every code and return them
the $c =()= is the "Rolex operator" - what allows count the list of matches
the count of matched codes gives the number how much empty lines are needed
the $\ = "\n" x NUMBER - mean repeat NUMBER times the string before `x, e.g. when have 3 codes, will repeat 3 times the "\n" (newline) character.
the newlines are added to the variabe $\ - output record sep.
and because the -p switch process the file by lines and print every line in the form "print $_$\;" - so after every line will print the output record separator - what contains a number of newlines.
therefore we get empty lines
I hope than my english was enough ok for the explanation.
Or wholly in Perl:
my $skip;
while (<$f1>) {
chomp;
my $suffix;
if ($skip--) {
$suffix = "\n";
} else {
$suffix = <$f2>;
$skip = () = $suffix =~ /[a-z]+/g;
}
print "$_:$suffix";
}

Perl script for creating two arrays

Input: A list of numbers on command line
Output: Two lists of numbers ,one with input numbers that are greater than zero and one with those that are less than zero (Ignoring zero valued numbers)
here is my code
#!/usr/bin/perl
$i++ = 0;
$j++ = 0;
while ($number = <>)
{
if($number<0)
$first[$i++]=$number;
else
$second[$j++]=$number;
}
print "The numbers with value less than zero are\n";
foreach $number (#first)
print $number;
print "The numbers with value greater than zero are\n"
foreach $number(#second)
print $number;
I am getting the following silly errors which i am not able to rectify.The errors are
divide.pl: 2: ++: not found
divide.pl: 3: ++: not found
divide.pl: 5: Syntax error: ")" unexpected
Can anybody help me out with rectifying these errors please? I am new to perl script
Curly braces on compound statements are not optional in Perl.
Your statements:
$i++=0;
$j++=0;
don't make sense; you probably just want to delete the "++".
You're missing a semicolon on one of your print statements.
Once you've got those problems fixed, you should add
use strict;
use warnings;
after the #! line. This will introduce more error messages; you'll need to fix those as well. For example, you'll need to declare your variables using my().
The code you present will hardly compile. Loops should have {} around the main block, arrays are better created with push (or unshift), you should use strict and warnings, and you can't do increments at the same time as assignments (e.g. $i++ = 0).
use v5.10;
use strict;
use warnings;
my (#first, #second);
while (<STDIN>) { # <STDIN> clearer than <> in this case
chomp;
if ($_ < 0) {
push #first, $_;
} elsif ($_ > 0) {
push #second, $_;
}
}
say "Numbers less than zero:";
say "#first";
say "Numbers greater than zero:";
say "#second";
I don't know what $i++ = 0 is supposed to mean, but change that to $i = 0 to initialize the variables.
Also, the first thing yuu should do in the while loop is call chomp($number) to remove spurious newlines - 5\n is not a number and treating it as one will confuse perl.
Once you've fixed that, post any new errors that show up - I don't see any other problems though.
How are you executing this perl script? Beyond the errors mentioned about the code itself. It looks like you are attempting to evaluate the code using dash instead of perl.
The errors you should be seeing if you were executing it with Perl would be like:
Can't modify postincrement (++) in scalar assignment at /tmp/foo.pl
line 2, near "0;"
But instead, your errors are more in line with what dash outputs:
$ dash /tmp/foo.pl
/tmp/foo.pl: 2: ++: not found
/tmp/foo.pl: 3: ++: not found
Once you've verified that you are running your perl script properly you can start working through the other problems people have mentioned your code. The easiest way to do this is to run it via perl divide.pl instead of whatever you are doing.

What does it mean to pre-increment $#array?

I've come across the following line of code. It has issues:
it is intended to do the same as push
it ought to have used push
it's hard to read, understand
I've since changed it to use push
it does something I thought was illegal, but clearly isn't
here it is:
$array [++$#array] = 'data';
My question is: what does it mean to pre-increment $#array? I always considered $#array to be an attribute of an array, and not writable.
perldata says:
"The length of an array is a scalar value. You may find the length of array #days by evaluating $#days , as in csh. However, this isn't the length of the array; it's the subscript of the last element, which is a different value since there is ordinarily a 0th element. Assigning to $#days actually changes the length of the array. Shortening an array this way destroys intervening values. Lengthening an array that was previously shortened does not recover values that were in those elements."
Modifying $#array is useful in some cases, but in this case, clearly push is better.
A post-increment will return the variable first and then increment it.
If you used post-increment you would be modifing the last element, since its returned first, and then pushing an empty element onto the end. On the second loop you would be modifing that empty value and pushing a new empty one for later. So it wouldn't work like a push at all.
The pre-increment will increment the variable and then return it. That way your example will always being writing to a new, last element of the array and work like push. Example below:
my (#pre, #post);
$pre[$#pre++] = '1';
$pre[$#pre++] = '2';
$pre[$#pre++] = '3';
$post[++$#post] = '1';
$post[++$#post] = '2';
$post[++$#post] = '3';
print "pre keys: ".#pre."\n";
print "pre: #pre\n";
print "post keys: ".#post."\n";
print "post: #post\n";
outputs:
pre keys: 3
pre: 2 3
post keys: 3
post: 1 2 3
Assigning a value larger than the current array length to $#array extends the array.
This code works too:
$ perl -le 'my #a; $a[#a]="Hello"; $a[#a]=" world!"; print #a'
Hello world!
Perl array is dynamic and grows when assign beyond limits.
First of all, that's foul.
That said, I'm also surprised that it works. I would have guessed that ++$#array would have gotten the "Can't modify constant" error you get when trying to increment a number. (Not that I ever accidentally do that, of course.) But, I guess that's exactly where we were wrong: $#array isn't a constant (a number); it's a variable expression. As such you can mess with it. Consider the following:
my #array = qw/1 2 3/;
++$#array;
$array[$#array] = qw/4/;
print "#array\n"
And even, for extra fun, this:
my #array = qw/1 2 3/;
$#array += 5;
foreach my $wtf (#array) {
if (defined $wtf) {
print "$wtf\n";
}
else {
print "undef\n";
}
}
And, yeah, the Perl Cookbook is happy to mess with $#array to grow or truncate arrays (Chapter 4, recipe 3). I still find it ugly, but maybe that's just a lingering "but it's a number" prejudice.