Counting the frequency of bases using while loop and substr with Perl - perl

I am trying to write in Perl to count the number of each A/C/G/T bases in a DNA sequence. But couldn't figure out what went wrong in my code. "ATCTAGCTAGCTAGCTA" is the kind of data I am given.
#!usr/bin/perl
use strict;
use warnings;
my $in_file = <$ARGV[0]>;
open( my $FH_IN, "<", $in_file );
my $dna = <$FH_IN>;
my $index = 0;
my ( $freq_a, $freq_c, $freq_g, $freq_t ) = 0;
my $dna_length = length($dna);
while ( $index < $dna_length ) {
my $base = substr( $dna, $index, 1 );
if ( $base eq "A" ) {
$freq_a++;
$index++;
next;
} elsif ( $base eq "C" ) {
$freq_c++;
$index++;
next;
} elsif ( $base eq "G" ) {
$freq_g++;
$index++;
next;
} elsif ( $base eq "T" ) {
$freq_t++;
$index++;
next;
} else {
next;
}
}
print "$freq_a\n$freq_c\n$freq_g\n$freq_t\n";
exit;
I know there are a lot of ways to do it, but what I want to know is what I did wrong so I can learn from mistakes.

Perl has a special file handle to use with these kinds of problems: The diamond operator <>. It will read input from either a file name, if provided, and standard input if not.
Secondly, since you are only interested in ACGT, might as well look for only them, using a regex: /([ACGT])/g.
Thirdly, using a hash is the idiomatic way to count characters in Perl: $count{A}++
So your script becomes:
use strict;
use warnings;
my %count;
while (<>) {
while (/([ACGT])/g) {
$count{$1}++;
}
}
print "$_\n" for #count{qw(A C G T)};
Usage:
script.pl input.txt

Okay, you've done well so far and there's only one problem that stops your program from working.
It's far from obvious, but each line that's read from the file has a newline character "\n" at the end. So what's happening is that $index reaches the newline in the string, which is processed by the else clause (because it's not A, C, G or T) which just does a next, so the same character is processed over and over again. Your program just hangs, right?
You could remove the newline with chomp, but a proper fix is to increment $index in the else clause just as you do with all the other characters. So it looks like
else {
++$index;
next;
}
As you've suspected, there are much better ways to write this. There are also a couple of other nasties in your code, but that change should get you on your way for now.

It would be instructive for you to print out the values in $dna_length, $index and $base each time you go round the loop - immediately after you assign a value to $base.
Your code would be more robust if you moved the incrementing of $index to the end of the loop (outside of the if/elsif/else block) and removed all of your next statements.
An alternative "quick fix" is to chomp() the input line before you start processing it.

Related

Perl sub skips foreach within which it is called

I'm having some problem with a subroutine that locates certain files and extracts some data out of them.
This subroutine is called inside a foreach loop, but whenever the call is made the loop skips to its next iteration. So I am wondering whether any of the next;'s are somehow escaping from the subroutine to the foreach loop where it is called?
To my knowledge the sub looks solid though so I'm hoping if anyone can see something I'm missing?
sub FindKit{
opendir(DH, "$FindBin::Bin\\data");
my #kitfiles = readdir(DH);
closedir(DH);
my $nametosearch = $_[0];
my $numr = 1;
foreach my $kitfile (#kitfiles)
{
# skip . and .. and Thumbs.db and non-K-files
if($kitfile =~ /^\.$/) {shift #kitfiles; next;}
if($kitfile =~ /^\.\.$/) {shift #kitfiles; next;}
if($kitfile =~ /Thumbs\.db/) {shift #kitfiles; next;}
if($kitfile =~ /^[^K]/) {shift #kitfiles; next;}
# $kitfile is the file used on this iteration of the loop
open (my $fhkits,"<","data\\$kitfile") or die "$!";
while (<$fhkits>) {}
if ($. <= 1) {
print " Empty File!";
next;
}
seek($fhkits,0,0);
while (my $kitrow = <$fhkits>) {
if ($. == 0 && $kitrow =~ /Maakartikel :\s*(\S+)\s+Montagekit.*?($nametosearch)\s{3,}/g) {
close $fhkits;
return $1;
}
}
$numr++;
close $fhkits;
}
return 0;
}
To summarize comments, the refactored code:
use File::Glob ':bsd_glob';
sub FindKit {
my $nametosearch = $_[0];
my #kitfiles = glob "$FindBin::Bin/data/K*"; # files that start with K
foreach my $kitfile (#kitfiles)
{
open my $fhkits, '<', $kitfile or die "$!";
my $kitrow_first_line = <$fhkits>; # read first line
return if eof; # next read is end-of-file so it was just header
my ($result) = $kitrow_first_line =~
/Maakartikel :\s*(\S+)\s+Montagekit.*?($nametosearch)\s{3,}/;
return $result if $result;
}
return 0;
}
I use core File::Glob and enable :bsd_glob option, which can handle spaces in filenames. I follow the docs note to use "real slash" on Win32 systems.
I check whether there is only a header line using eof.†
I do not see how this can affect the calling code, other than by its return value. Also, I don't see how the posted code can make the caller skip the beat, either. That problem is unlikely to be in this sub.
Please let me know if I missed some point with the above rewrite.
† Previous version used to check whether there is just one (header) line by
1 while <$fhkits>; # check number of lines ...
return if $. == 1; # there was only one line, the header
Also correct but eof is way better
The thing that is almost certainly screwing you here, is that you are shifting the list that you are iterating.
That's bad news, as you're deleting elements ... but in places you aren't necessarily thinking.
For example:
#!/usr/bin/env perl
use strict;
use warnings;
my #list = qw ( one two three );
my $count;
foreach my $value ( #list ) {
print "Iteration ", ++$count," value is $value\n";
if ( $value eq 'two' ) { shift #list; next };
}
print "#list";
How many times do you think that should iterate, and which values should end up in the array?
Because you shift you never process element 'three' and you delete element 'one'. That's almost certainly what's causing you problems.
You also:
open using a relative path, when your opendir used an absolute one.
skip a bunch of files, and then skip anything that doesn't start with K. Why not just search for things that do start with K?
read the file twice, and one is to just check if it's empty. The perl file test -z will do this just fine.
you set $kitrow for each line in the file, but don't really use it for anything other than pattern matching. It'd probably work better using implicit variables.
You only actually do anything on the first line - so you don't ever need to iterate the whole file. ($numr seems to be discarded).
you use a global match, but only use one result. The g flag seems redundant here.
I'd suggest a big rewrite, and do something like this:
#!/usr/bin/env perl
use strict;
use warnings;
use FindBin;
sub FindKit{
my ($nametosearch) = #_;
my $numr = 1;
foreach my $kitfile (glob "$FindBin::Bin\\data\\K*" )
{
if ( -z $kitfile ) {
print "$kitfile is empty\n";
next;
}
# $kitfile is the file used on this iteration of the loop
open (my $fhkits,"<", $kitfile) or die "$!";
<$kitfile> =~ m/Maakartikel :\s*(\S+)\s+Montagekit.*?($nametosearch)\s{3,}/
and return $1;
return 0;
}
}
As a big fan of the Path::Tiny module (me have it always installed and using it in every project) my solution would be:
use strict;
use warnings;
use Path::Tiny;
my $found = FindKit('mykit');
print "$found\n";
sub FindKit {
my($nametosearch) = #_;
my $datadir = path($0)->realpath->parent->child('data');
die "$datadir doesn't exists" unless -d $datadir;
for my $file ($datadir->children( qr /^K/ )) {
next if -z $file; #skip empty
my #lines = $file->lines;
return $1 if $lines[0] =~ /Maakartikel :\s*(\S+)\s+Montagekit.*?($nametosearch)\s{3,}/;
}
return;
}
Some comments and still opened issues:
Using the Path::Tiny you could always use forward slashes in the path-names, regardless of the OS (UNIX/Windows), e.g. the data/file will work on windows too.
AFAIK the FindBin is considered broken - so the above uses the $0 and realpath ...
what if the Kit is in multiple files? The above always returns on the 1st found one
the my #lines = $file->lines; reads all lines - unnecessary - but on small files doesn't big deal.
the the reality this function returns the arg for the Maakartikel, so probably better name would be find_articel_by_kit or find_articel :)
easy to switch to utf8 - just change the $file->lines to $file->lines_utf8;

Extract and filter a range of lines from the input using Perl

I'm quite new to Perl and I have some problems in skipping lines using a foreach loop. I want to copy some lines of a text file to a new one.
When the first words of a line are FIRST ITERATION, skip two more lines and print everything following until the end of the file or an empty line is encountered.
I've tried to find out a similar post but nobody talks about working with text files.
This is the form I thought of
use 5.010;
use strict;
use warnings;
open( INPUT, "xxx.txt" ) or die("Could not open log file.");
open( OUT, ">>yyy.txt" );
foreach my $line (<INPUT>) {
if ( $line =~ m/^FIRST ITERATION/ ) {
# print OUT
}
}
close(OUT);
close(INFO);
I tried using next and $line++ but my program prints only the line that begins with FIRST ITERATION.
I may try to use a for loop but I don't know how many lines my file may have, nor do I know how many lines there are between "First Iteration" and the next empty line.
The simplest way is to process the file a line at a time and keep a state flag which is set to 1 if the current line is begins with FIRST ITERATION and 0 if it is blank, otherwise it is incremented if it is already positive so that it provides a count of the line number within the current block
This solution expects the path to the input file as a parameter on the command line and prints its output to STDOUT, so you will need to redirect the output to the file on the command line as necessary
Note that the regex pattern /\S/ checks whether there is a non-blank character anywhere in the current line, so not /\S/ is true if the line is empty or all blank characters
use strict;
use warnings;
my $lines = 0;
while ( <> ) {
if ( /^FIRST ITERATION/ ) {
$lines = 1;
}
elsif ( not /\S/ ) {
$lines = 0;
}
elsif ( $lines > 0 ) {
++$lines;
}
print if $lines > 3;
}
This can be simplified substantially by using Perl's built-in range operator, which keeps its own internal state and returns the number of times it has been evaluated. So the above may be written
use strict;
use warnings;
while ( <> ) {
my $s = /^FIRST ITERATION/ ... not /\S/;
print if $s and $s > 3;
}
And the last can be rewritten as a one-line command line program like this
$ perl -ne '$s = /^FIRST ITERATION/ ... not /\S/; print if $s and $s > 3' myfile.txt
Use additional counter, that will say on which condition print line. Something like this:
$skipCounter = 3;
And in foreach:
if ($skipCounter == 2) {
// print OUT
}
if ( $line =~ m/^FIRST ITERATION/) {
$skipCounter = 0;
}
$skipCounter++;
Advice: Use STDIN and STDOUT instead of files, this will allowes you to change them without modifying script
Code:
#!/usr/bin/perl
use 5.010;
use strict;
use warnings;
open(INPUT, "xxx.txt" ) or die "Could not open log file: $!.";
open(OUT, ">yyy.txt") or die "Could not open output file: $!";
while( my $line = <INPUT> )
{
if ( $line =~ m/^FIRST ITERATION/) {
<INPUT>; # skip line
<INPUT>; # skip line
while( $line = <INPUT>) # print till empty line
{
last if $line eq "\n";
print OUT $line;
}
};
};
close (OUT);
close (INPUT);
You're on the right track. What you need to use is the flip-flop operator (which is basically the range operator) ... It will toggle for you between two matches, so you get everything in between. After that, it's a matter of keeping track of the lines you want to skip.
So basically we are checking for FIRST ITERATION and for an empty line, and grab everything in between those. $skip is used to remember how many lines were skipped. It starts at 0 and gets incremented for the first two lines after we start being in the flip-flop if block. In the else case, where we are after the flip-flop, it gets reset to 0 so we can start over with the next block.
Since you know how to open and write files, I'll skip that.
use strict;
use warnings;
my $skip = 0;
while (<DATA>) {
if (/^FIRST ITERATION$/ .. /^$/) {
next if $skip++ <= 2;
print $_;
} else {
$skip = 0;
}
}
__DATA__
FIRST ITERATION
skip1
skip2
foo
bar
baz
don't print this
The output of this is:
foo
bar
baz
To stick with your own code, here's a very verbose solution that uses a foreach and no flip-flop. It does the same thing, just with a lot more words.
my $skip = 0; # skip lines
my $match = 0; # keep track of if we're in between the borders
foreach my $line (<DATA>) {
if ( $line =~ m/^FIRST ITERATION/ ) {
$match = 1; # we are inside the match
next;
}
if ($line =~ m/^$/) {
$match = 0; # we are done matching
next;
}
if ($match) {
$skip++; # count skip-lines
if ($skip <= 2) {
next; # ... and skip the first two
}
print $line; # this is the content we want
}
}
Using paragraph mode (which returns blocks separated by blank lines rather than lines):
local $/ = ""; # Paragraph mode.
while (<>) {
s/\n\K\n+//; # Get rid of trailing empty lines.
print /^FIRST ITERATION\n[^\n]*\n[^\n]*\n(.*)/ms;
}
Using the flip-flop operator:
while (<>) {
if (my $line_num = /^FIRST ITERATION$/ .. /^$/) {
print if $line_num > 3 && $line_num !~ /E0/;
}
}
$line_num !~ /E0/ is true when the flip-flop is flopping (i.e. for the first empty line after FIRST ITERATION). This is checked to avoid printing the blank line.

eof in perl- going in infinite loop

When I am checking for eof condition for being false, it's going in infinite loop.
Here is what I am doing.
tie my #lines, 'Tie::File' , "abc.txt" or die;
for (my $i=0; ; $i++) {
# if I put if(!eof())condition instead of the below one,
# it goes in infinite loop. May I know why?
if ($lines[$i] =~ /^[\s]*hello/) {
splice #lines, $i, 1, "#$lines[$i]", "hi";
last;
}
#Below eof condition is working
if(eof){
last;
}
}
An eof without an argument uses the last file read. Using eof() with empty parentheses is very different. It refers to the pseudo file formed from the files listed on the command line and accessed via the <> operator.
EDIT:
Normally you dont need to use eof, because the input operators typically return undef when they run out of data, or if there was an error.
CODE:
Maybe the Problem is, that you dont use the 2nd condition of the for loop.
tie my #lines, 'Tie::File' , "abc.txt" or die;
for(my $i=0;$i<=$#lines; $i++) {
if ($lines[$i] =~ /^[\s]*hello/){
splice #lines, $i, 1, "#$lines[$i]", "hi";
last;
}
}
Tie::File presents a file as an array, and I would be very wary of using eof (and tell and seek for that matter) on a tied array. The module's documentation says nothing about these functions and you shouldn't assume anything as, when you write $lines[$i] the data may be returned from a cache, and be completely different from the last line read from the file.
I would make use of the properties of the array, and write simply
last if $i == $#lines
which is guaranteed to work because the size of the tied array varies consistently with the number of loines in the file. In fact the whole loop could be written better by putting the test on the value of $i in the for loop, like this.
for (my $i = 0; $i < #lines; ++$i) {
if ($lines[$i] =~ /^[\s]*hello/) {
splice #lines, $i, 1, "#$lines[$i]", "hi";
last;
}
}
or, perhaps using List::MoreUtils:
use strict;
use warnings;
use List::MoreUtils 'first_index';
use Tie::File;
tie my #lines, 'Tie::File' , "abc.txt" or die $!;
my $i = first_index { /^\s*hello/ } #lines;
splice #lines, $i, 1, "#$lines[$i]", "hi" if $i >= 0;

Perl if equals sign

I need to detect if the first character in a file is an equals sign (=) and display the line number. How should I write the if statement?
$i=0;
while (<INPUT>) {
my($line) = $_;
chomp($line);
$findChar = substr $_, 0, 1;
if($findChar == "=")
$output = "$i\n";
print OUTPUT $output;
$i++;
}
Idiomatic perl would use a regular expression (^ meaning beginning of line) plus one of the dreaded builtin variables which happens to mean "line in file":
while (<INPUT>) {
print "$.\n" if /^=/;
}
See also perldoc -v '$.'
Use $findChar eq "=". In Perl:
== and != are numeric comparisons. They will convert both operands to a number.
eq and ne are string comparisons. They will convert both operands to a string.
Yes, this is confusing. Yes, I still write == when I mean eq ALL THE TIME. Yes, it takes me forever to spot my mistake too.
It looks like you are not using strict and warnings. Use them, especially since you do not know Perl, you might also want to add diagnostics to the list of must-use pragmas.
You are keeping track of the input line number in a separate variable $i. Perl has various builtin variables documented in perlvar. Some of these, such as $. are very useful use them.
You are using my($line) = $_; in the body of the while loop. Instead, avoid $_ and assign to $line directly as in while ( my $line = <$input> ).
Note that bareword filehandles such as INPUT are package global. With the exception of the DATA filehandle, you are better off using lexical filehandles to properly limit the scope of your filehandles.
In your posts, include sample data in the __DATA_ section so others can copy, paste and run your code without further work.
With these comments in mind, you can print all lines that do not start with = using:
#!/usr/bin/perl
use strict; use warnings;
while (my $line = <DATA> ) {
my $first_char = substr $line, 0, 1;
if ( $first_char ne '=' ) {
print "$.:$first_char\n";
}
}
__DATA__
=
=
a
=
+
However, I would be inclined to write:
while (my $line = <DATA> ) {
# this will skip blank lines
if ( my ($first_char) = $line =~ /^(.)/ ) {
print "$.:$first_char\n" unless $first_char eq '=';
}
}

How can I write Perl that doesn't look like C?

My co-workers complain that my Perl looks too much like C, which is natural since I program in C most of the time, and Perl just a bit. Here's my latest effort. I'm interest in Perl that is easy to understand. I'm a bit of a Perl critic, and have little tolerance for cryptic Perl. But with readability in mind, how could the following code be more Perlish?
It's goal is to do a traffic analysis and find which IP addresses are within the ranges given in the file "ips". Here's my effort:
#!/usr/bin/perl -w
# Process the files named in the arguments, which will contain lists of IP addresses, and see if
# any of them are in the ranges spelled out in the local file "ip", which has contents of the
# form start-dotted-quad-ip-address,end-dotted-quad-ip_address,stuff_to_be_ignored
use English;
open(IPS,"ips") or die "Can't open 'ips' $OS_ERROR";
# Increment a dotted-quad ip address
# Ignore the fact that part1 could get erroneously large.
sub increment {
$ip = shift;
my ($part_1, $part_2, $part_3, $part_4) = split (/\./, $ip);
$part_4++;
if ( $part_4 > 255 ) {
$part_4 = 0;
($part_3++);
if ( $part_3 > 255 ) {
$part_3 = 0;
($part_2++);
if ( $part_2 > 255 ) {
$part_2 = 0;
($part_1++);
}
}
}
return ("$part_1.$part_2.$part_3.$part_4");
}
# Compare two dotted-quad ip addresses.
sub is_less_than {
$left = shift;
$right = shift;
my ($left_part_1, $left_part_2, $left_part_3, $left_part_4) = split (/\./, $left);
my ($right_part_1, $right_part_2, $right_part_3, $right_part_4) = split (/\./, $right);
if ($left_part_1 != $right_part_1 ) {
return ($left_part_1 < $right_part_1);
}
if ($left_part_2 != $right_part_2 ) {
return ($left_part_2 < $right_part_2);
}
if ($left_part_3 != $right_part_3 ) {
return ($left_part_3 < $right_part_3);
}
if ($left_part_4 != $right_part_4 ) {
return ($left_part_4 < $right_part_4);
}
return (false); # They're equal
}
my %addresses;
# Parse all the ip addresses and record them in a hash.
while (<IPS>) {
my ($ip, $end_ip, $junk) = split /,/;
while (is_less_than($ip, $end_ip) ) {
$addresses{$ip}=1;
$ip = increment($ip);
}
}
# print IP addresses in any of the found ranges
foreach (#ARGV) {
open(TRAFFIC, $_) or die "Can't open $_ $OS_ERROR";
while (<TRAFFIC> ) {
chomp;
if (defined $addresses{$_}) {
print "$_\n";
}
}
close (TRAFFIC);
}
From years of seeing Perl code written by C programmers, here's some generic advice:
Use hashes. Use lists. USE HASHES! USE LISTS! Use list operations (map, grep, split, join), especially for small loops. Don't use fancy list algorithms; pop, splice, push, shift and unshift are cheaper. Don't use trees; hashes are cheaper. Hashes are cheap, make them, use them and throw them out! Use the iterator for loop, not the 3-arg one. Don't call things $var1, $var2, $var3; use a list instead. Don't call things $var_foo, $var_bar, $var_baz; use a hash instead. Use $foo ||= "default". Don't use $_ if you have to type it.
Don't use prototypes, IT'S A TRAP!!
Use regexes, not substr() or index(). Love regexes. Use the /x modifier to make them readable.
Write statement if $foo when you want a block-less conditional. There's almost always a better way to write a nested condition: try recursion, try a loop, try a hash.
Declare variables when you need them, not at the top of the subroutine. use strict. use warnings, and fix them all. use diagnostics. Write tests. Write POD.
Use CPAN. Use CPAN! USE CPAN! Someone's probably already done it, better.
Run perlcritic. Run it with --brutal just for kicks. Run perltidy. Think about why you do everything. Change your style.
Use the time not spent fighting the language and debugging memory allocation to improve your code.
Ask questions. Take style commentary on your code graciously. Go to a Perl Mongers meeting. Go onto perlmonks.org. Go to YAPC or a Perl Workshop. Your Perl knowledge will grow by leaps and bounds.
Most of writing code to be "Perlish" would be taking advantage of the built-in functions in Perl.
For instance, this:
my ($part_1, $part_2, $part_3, $part_4) = split (/\./, $ip);
$part_4++;
if ( $part_4 > 255 ) {
$part_4 = 0;
($part_3++);
if ( $part_3 > 255 ) {
$part_3 = 0;
($part_2++);
if ( $part_2 > 255 ) {
$part_2 = 0;
($part_1++);
}
}
}
I would rewrite something like:
my #parts = split (/\./, $ip);
foreach my $part(reverse #parts){
$part++;
last unless ($part > 255 && !($part = 0));
}
That does what your code posted above does but is a little cleaner.
Are you sure the code does what you want though? Just to me it looks a little strange that you only move to the previous 'part' of the IP if the one after it is > 255.
Sometimes the most Perlish thing to do is to turn to CPAN instead of writing any code at all.
Here is a quick and dirty example using Net::CIDR::Lite and Net::IP::Match::Regexp:
#!/path/to/perl
use strict;
use warnings;
use English;
use IO::File;
use Net::CIDR::Lite;
use Net::IP::Match::Regexp qw(create_iprange_regexp match_ip);
my $cidr = Net::CIDR::Lite->new();
my $ips_fh = IO::File->new();
$ips_fh->open("ips") or die "Can't open 'ips': $OS_ERROR";
while (my $line = <$ips_fh>) {
chomp $line;
my ($start, $end) = split /,/, $line;
my $range = join('-', $start, $end);
$cidr->add_range($range);
}
$ips_fh->close();
my $regexp = create_iprange_regexp($cidr->list());
foreach my $traffic_fn (#ARGV) {
my $traffic_fh = IO::File->new();
$traffic_fh->open($traffic_fn) or die "Can't open '$traffic_fh': $OS_ERROR";
while (my $ip_address = <$traffic_fh>) {
chomp $ip_address;
if (match_ip($ip_address, $regexp)) {
print $ip_address, "\n";
}
}
$traffic_fh->close();
}
DISCLAIMER: I just banged that out, it's had minimal testing and no benchmarking. Sanity checks, error handling and comments omitted to keep the line count down. I didn't scrimp on the whitespace, though.
As for your code: There is no need to define your functions before you use them.
Another example rewrite:
sub is_less_than {
my $left = shift; # I'm sure you just "forgot" to put the my() here...
my $right = shift;
my ($left_part_1, $left_part_2, $left_part_3, $left_part_4) = split (/\./, $left);
my ($right_part_1, $right_part_2, $right_part_3, $right_part_4) = split (/\./, $right);
if ($left_part_1 != $right_part_1 ) {
return ($left_part_1 < $right_part_1);
}
if ($left_part_2 != $right_part_2 ) {
return ($left_part_2 < $right_part_2);
}
if ($left_part_3 != $right_part_3 ) {
return ($left_part_3 < $right_part_3);
}
if ($left_part_4 != $right_part_4 ) {
return ($left_part_4 < $right_part_4);
}
return (false); # They're equal
}
To this:
sub is_less_than {
my #left = split(/\./, shift);
my #right = split(/\./, shift);
# one way to do it...
for(0 .. 3) {
if($left[$_] != $right[$_]) {
return $left[$_] < $right[$_];
}
}
# another way to do it - let's avoid so much indentation...
for(0 .. 3) {
return $left[$_] < $right[$_] if $left[$_] != $right[$_];
}
# yet another way to do it - classic Perl unreadable one-liner...
$left[$_] == $right[$_] or return $left[$_] < $right[$_] for 0 .. 3;
# just a note - that last one uses the short-circuit logic to condense
# the if() statement to one line, so the for() can be added on the end.
# Perl doesn't allow things like do_this() if(cond) for(0 .. 3); You
# can only postfix one conditional. This is a workaround. Always use
# 'and' or 'or' in these spots, because they have the lowest precedence.
return 0 == 1; # false is not a keyword, or a boolean value.
# though honestly, it wouldn't hurt to just return 0 or "" or undef()
}
Also, here:
my ($ip, $end_ip, $junk) = split /,/;
$junk might need to be #junk to capture all the junk, or you can probably leave it off - if you assign an unknown-sized array to an "array" of two elements, it will silently discard all the extra stuff. So
my($ip, $end_ip) = split /,/;
And here:
foreach (#ARGV) {
open(TRAFFIC, $_) or die "Can't open $_ $OS_ERROR";
while (<TRAFFIC> ) {
chomp;
if (defined $addresses{$_}) {
print "$_\n";
}
}
close (TRAFFIC);
}
Instead of TRAFFIC, use a variable to store the filehandle. Also, in general, you should use exists() to check if a hash element exists, rather than defined() - it might exist but be set to undef (this shouldn't happen in your program, but it's a nice habit for when your program gets more complicated):
foreach (#ARGV) {
open(my $traffic, $_) or die "Can't open $_ $OS_ERROR";
while (<$traffic> ) {
chomp;
print "$_\n" if exists $addresses{$_};
}
# $traffic goes out of scope, and implicitly closes
}
Of course, you could also use Perl's wonderful <> operator, which opens each element of #ARGV for reading, and acts as a filehandle that iterates through them:
while(<>) {
chomp;
print "$_\n" if exists $addresses{$_};
}
As has been noted before, try to avoid useing English unless you use English qw( -no_match_vars ); to avoid the significant performance penalty of those evil match_vars in there. And as hasn't been noted yet, but should be...
ALWAYS ALWAYS ALWAYS always use strict; and use warnings; or else Larry Wall will descend from heaven and break your code. I see you have -w - this is enough, because even off of Unix, Perl parses the shebang line, and will find your -w and will use warnings; like it should. However, you need to use strict;. This will catch a lot of serious errors in your code, like not declaring variables with my or using false as a language keyword.
Making your code work under strict as well as warnings will result in MUCH cleaner code that never breaks for reasons you can't figure out. You'll spend hours at the debugger debugging and you'll probably end up using strict and warnings anyway just to figure out what the errors are. Only remove them if (and only if) your code is finished and you're releasing it and it never generates any errors.
While doing this certainly is one way to do it in Perl.
use strict;
use warnings;
my $new_ip;
{
my #parts = split ('\.', $ip);
foreach my $part(reverse #parts){
$part++;
if( $part > 255 ){
$part = 0;
next;
}else{
last;
}
}
$new_ip = join '.', reverse #parts;
}
This is how I would actually implement it.
use NetAddr::IP;
my $new_ip = ''.(NetAddr::IP->new($ip,0) + 1) or die;
I can't say that this solution will make your program more Perl-ish, but it might simplify your algorithm.
Rather than treating an IP address as a dotted-quad, base-256 number which needs the nested-if structure to implement the increment function, consider an IP address to be a 32-bit integer. Convert an IP of the form a.b.c.d into an integer with this (not tested):
sub ip2int {
my $ip = shift;
if ($ip =~ /(\d+)\.(\d+)\.(\d+)\.(\d+)/) {
return ($1 << 24) + ($2 << 16) + ($3 << 8) + $4;
} else {
return undef;
}
}
Now it's easy to determine if an IP falls between two endpoint IPs. Just do simple integer arithmetic and comparisons.
$begin = "192.168.5.0";
$end = "192.168.10.255";
$target = "192.168.6.2";
if (ip2int($target) >= ip2int($begin) && ip2int($target) <= ip2int($end)) {
print "$target is between $begin and $end\n";
} else {
print "$target is not in range\n";
}
Tell your coworkers that their perl looks too much like line noise. Please don't obfuscate your code just for the sake of obfuscation - it's asinine development goals like that which give perl such a bad reputation for being unreadable, when it's really bad programmers (apparently, like your coworkers) who write sloppy code. Nicely structured, indented, and logical code is a good thing. C is a good thing.
Seriously, though - the best place to figure out how to write perl is in the O'Reilly "Perl Best Practices", by Damian Conway. It tells you how he thinks you should do things, and he always gives good reasons for his position as well as occasionally giving good reasons to disagree. I do disagree with him on some points, but his reasoning is sound. The odds that you work with anyone who knows perl better than Mr. Conway are pretty slim, and having a printed book (or at least a Safari subscription) gives you some more solid backing for your arguments. Pick up a copy of the Perl Cookbook while you're at it, as looking at code examples for solving common problems should get you on the right track. I hate to say "buy the book", but those are exceptionally good books that any perl developer should read.
With regards to your specific code, you're using foreach, $_, split with no parens, shift, etc. It looks plenty perl-ish to my eyes - which have been developing with perl for quite a while. One note, though - I hate the English module. If you must use it, do it like use English qw( -no_match_vars );. The match_vars option slows down regexp parsing measurably, and the $PREMATCH / $POSTMATCH variables it provides aren't usually useful.
There is only 1 advice: use strict. Rest of it is hardly relevant.
I know exactly how you feel. My first language was FORTRAN and like a good FORTRAN programmer, I wrote FORTRAN in every language since :).
I have this really wonderful book Effective Perl Programming that I keep re-reading every now and then. Especially a chapter called "Idiomatic Perl". Here are a few things I use to keep my Perl looking like Perl: List Operators like for map and grep, slices and hash slices, the quote operators.
Another thing that keeps my Perl from looking like FORTRAN/C is a regular reading of module sources especially those of the masters.
You could use Acme::Bleach or Acme::Morse
While this would work:
use strict;
use warnings;
use 5.010;
use NetAddr::IP;
my %addresses;
# Parse all the ip addresses and record them in a hash.
{
open( my $ips_file, '<', 'ips') or die;
local $_; # or my $_ on Perl 5.10 or later
while( my $line = <$ips_file> ){
my ($ip, $end_ip) = split ',', $line;
next unless $ip and $end_ip;
$ip = NetAddr::IP->new( $ip, 0 ) or die;
$end_ip = NetAddr::IP->new( $end_ip ) or die;
while( $ip <= $end_ip ){
$addresses{$ip->addr} = 1;
$ip++;
}
}
close $ips_file
}
# print IP addresses in any of the found ranges
use English;
for my $arg (#ARGV) {
open(my $traffic, '<',$arg) or die "Can't open $arg $OS_ERROR";
while( my $ip = <$traffic> ){
chomp $ip;
if( $addresses{$ip} ){
say $ip
}
}
close ($traffic);
}
I would if possible use netmasks, because it gets even simpler:
use Modern::Perl;
use NetAddr::IP;
my #addresses;
{
open( my $file, '<', 'ips') or die;
while( (my $ip = <$file>) =~ s(,.*){} ){
next unless $ip;
$ip = NetAddr::IP->new( $ip ) or die;
push #addresses, $ip
}
close $file
}
for my $filename (#ARGV) {
open( my $traffic, '<', $filename )
or die "Can't open $filename";
while( my $ip = <$traffic> ) {
chomp $ip;
next unless $ip;
$ip = NetAddr::IP->new($ip) or next; # skip line on error
my #match;
for my $cmp ( #addresses ){
if( $ip->within($cmp) ){
push #match, $cmp;
#last;
}
}
say "$ip => #match" if #match;
say "# no match for $ip" unless #match;
}
close ($traffic);
}
Test ips file:
192.168.0.1/24
192.168.0.0
0:0:0:0:0:0:C0A8:0/128
Test traffic file:
192.168.1.0
192.168.0.0
192.168.0.5
Output:
# no match for 192.168.1.0/32
192.168.0.0/32 => 192.168.0.1/24 192.168.0.0/32 0:0:0:0:0:0:C0A8:0/128
192.168.0.5/32 => 192.168.0.1/24
Instead of doing this :
if ($left_part_1 != $right_part_1 ) {
return ($left_part_1 < $right_part_1);
}
you could do this :
return $left_part_1 < $right_part_1 if($left_part_1 != $right_part_1);
Also, you could use the Fatal module, to avoid checking stuff for errors.
The only criteria I use for "how my code looks" is how easy it is to read and understand the purpose of the code (especially by programmers unfamiliar with Perl), not whether it follows a particular style.
If a Perl language feature makes some logic easier to understand then I use it, if not I don't - even if it can do it in less code.
Your co-workers may think my code is extremely "un perl-ish", but I'll bet they understood exactly what the code is doing and could modify it to fix / extend it without any trouble:
my version:
#******************************************************************************
# Load the allowable ranges into a hash
#******************************************************************************
my %ipRanges = loadIPAddressFile("../conf/ip.cfg");
#*****************************************************************************
# Get the IP to check on the command line
#*****************************************************************************
my ( $in_ip_address ) = #ARGV;
# Convert it to number for comparison
my $ipToCheckNum = 1 * sprintf("%03d%03d%03d%03d", split(/\./, $in_ip_address));
#*****************************************************************************
# Loop through the ranges and see if the number is in any of them
#*****************************************************************************
my $startIp;
my $endIp;
my $msg = "IP [$in_ip_address] is not in range.\n";
foreach $startIp (keys(%ipRanges))
{
$endIp = $ipRanges{$startIp};
if ( $startIp <= $ipToCheckNum and $endIp >= $ipToCheckNum )
{
$msg = "IP [$in_ip_address] is in range [$startIp] to [$endIp]\n";
}
}
print $msg;
#******************************************************************************
# Function: loadIPAddressFile()
# Author: Ron Savage
# Date: 04/10/2009
#
# Description:
# loads the allowable IP address ranges into a hash from the specified file.
# Hash key is the starting value of the range, value is the end of the range.
#******************************************************************************
sub loadIPAddressFile
{
my $ipFileHandle;
my $startIP;
my $endIP;
my $startIPnum;
my $endIPnum;
my %rangeList;
#***************************************************************************
# Get the arguments sent
#***************************************************************************
my ( $ipFile ) = #_;
if ( open($ipFileHandle, "< $ipFile") )
{
while (<$ipFileHandle>)
{
( $startIP, $endIP ) = split(/\,/, $_ );
# Convert them to numbers for comparison
$startIPnum = 1 * sprintf("%03d%03d%03d%03d", split(/\./, $startIP));
$endIPnum = 1 * sprintf("%03d%03d%03d%03d", split(/\./, $endIP));
$rangeList{$startIPnum} = $endIPnum;
}
close($ipFileHandle);
}
else
{
print "Couldn't open [$ipFile].\n";
}
return(%rangeList);
}
(Note: the extra "#" lines are in there to preserve my freakin' spacing, which always gets whacked when posting code here)
Am I missing something... will any of the above array versions work? The mods are performed on variables local to the for loop. I think Brad Gilbert's Net::IP solution would be my choice. Chris Lutz pretty much cleaned the rest the way I would've.
As an aside - some of the comments about readability strike me as curious. Are there fewer [vigorous] complaints about the readability of Erlang/Lisp syntax because there is ONLY ONE way to write code in them?
This is probably more like C, but is also more simple:
use Socket qw(inet_aton inet_ntoa);
my $ip = ("192.156.255.255");
my $ip_1 = inet_ntoa(pack("N", unpack("N", inet_aton($ip))+1));
print "$ip $ip_1\n";
Update: I posted this before reading all of the code in the question. The code here just does the incrementing of the ip address.