How can I write Perl that doesn't look like C? - perl

My co-workers complain that my Perl looks too much like C, which is natural since I program in C most of the time, and Perl just a bit. Here's my latest effort. I'm interest in Perl that is easy to understand. I'm a bit of a Perl critic, and have little tolerance for cryptic Perl. But with readability in mind, how could the following code be more Perlish?
It's goal is to do a traffic analysis and find which IP addresses are within the ranges given in the file "ips". Here's my effort:
#!/usr/bin/perl -w
# Process the files named in the arguments, which will contain lists of IP addresses, and see if
# any of them are in the ranges spelled out in the local file "ip", which has contents of the
# form start-dotted-quad-ip-address,end-dotted-quad-ip_address,stuff_to_be_ignored
use English;
open(IPS,"ips") or die "Can't open 'ips' $OS_ERROR";
# Increment a dotted-quad ip address
# Ignore the fact that part1 could get erroneously large.
sub increment {
$ip = shift;
my ($part_1, $part_2, $part_3, $part_4) = split (/\./, $ip);
$part_4++;
if ( $part_4 > 255 ) {
$part_4 = 0;
($part_3++);
if ( $part_3 > 255 ) {
$part_3 = 0;
($part_2++);
if ( $part_2 > 255 ) {
$part_2 = 0;
($part_1++);
}
}
}
return ("$part_1.$part_2.$part_3.$part_4");
}
# Compare two dotted-quad ip addresses.
sub is_less_than {
$left = shift;
$right = shift;
my ($left_part_1, $left_part_2, $left_part_3, $left_part_4) = split (/\./, $left);
my ($right_part_1, $right_part_2, $right_part_3, $right_part_4) = split (/\./, $right);
if ($left_part_1 != $right_part_1 ) {
return ($left_part_1 < $right_part_1);
}
if ($left_part_2 != $right_part_2 ) {
return ($left_part_2 < $right_part_2);
}
if ($left_part_3 != $right_part_3 ) {
return ($left_part_3 < $right_part_3);
}
if ($left_part_4 != $right_part_4 ) {
return ($left_part_4 < $right_part_4);
}
return (false); # They're equal
}
my %addresses;
# Parse all the ip addresses and record them in a hash.
while (<IPS>) {
my ($ip, $end_ip, $junk) = split /,/;
while (is_less_than($ip, $end_ip) ) {
$addresses{$ip}=1;
$ip = increment($ip);
}
}
# print IP addresses in any of the found ranges
foreach (#ARGV) {
open(TRAFFIC, $_) or die "Can't open $_ $OS_ERROR";
while (<TRAFFIC> ) {
chomp;
if (defined $addresses{$_}) {
print "$_\n";
}
}
close (TRAFFIC);
}

From years of seeing Perl code written by C programmers, here's some generic advice:
Use hashes. Use lists. USE HASHES! USE LISTS! Use list operations (map, grep, split, join), especially for small loops. Don't use fancy list algorithms; pop, splice, push, shift and unshift are cheaper. Don't use trees; hashes are cheaper. Hashes are cheap, make them, use them and throw them out! Use the iterator for loop, not the 3-arg one. Don't call things $var1, $var2, $var3; use a list instead. Don't call things $var_foo, $var_bar, $var_baz; use a hash instead. Use $foo ||= "default". Don't use $_ if you have to type it.
Don't use prototypes, IT'S A TRAP!!
Use regexes, not substr() or index(). Love regexes. Use the /x modifier to make them readable.
Write statement if $foo when you want a block-less conditional. There's almost always a better way to write a nested condition: try recursion, try a loop, try a hash.
Declare variables when you need them, not at the top of the subroutine. use strict. use warnings, and fix them all. use diagnostics. Write tests. Write POD.
Use CPAN. Use CPAN! USE CPAN! Someone's probably already done it, better.
Run perlcritic. Run it with --brutal just for kicks. Run perltidy. Think about why you do everything. Change your style.
Use the time not spent fighting the language and debugging memory allocation to improve your code.
Ask questions. Take style commentary on your code graciously. Go to a Perl Mongers meeting. Go onto perlmonks.org. Go to YAPC or a Perl Workshop. Your Perl knowledge will grow by leaps and bounds.

Most of writing code to be "Perlish" would be taking advantage of the built-in functions in Perl.
For instance, this:
my ($part_1, $part_2, $part_3, $part_4) = split (/\./, $ip);
$part_4++;
if ( $part_4 > 255 ) {
$part_4 = 0;
($part_3++);
if ( $part_3 > 255 ) {
$part_3 = 0;
($part_2++);
if ( $part_2 > 255 ) {
$part_2 = 0;
($part_1++);
}
}
}
I would rewrite something like:
my #parts = split (/\./, $ip);
foreach my $part(reverse #parts){
$part++;
last unless ($part > 255 && !($part = 0));
}
That does what your code posted above does but is a little cleaner.
Are you sure the code does what you want though? Just to me it looks a little strange that you only move to the previous 'part' of the IP if the one after it is > 255.

Sometimes the most Perlish thing to do is to turn to CPAN instead of writing any code at all.
Here is a quick and dirty example using Net::CIDR::Lite and Net::IP::Match::Regexp:
#!/path/to/perl
use strict;
use warnings;
use English;
use IO::File;
use Net::CIDR::Lite;
use Net::IP::Match::Regexp qw(create_iprange_regexp match_ip);
my $cidr = Net::CIDR::Lite->new();
my $ips_fh = IO::File->new();
$ips_fh->open("ips") or die "Can't open 'ips': $OS_ERROR";
while (my $line = <$ips_fh>) {
chomp $line;
my ($start, $end) = split /,/, $line;
my $range = join('-', $start, $end);
$cidr->add_range($range);
}
$ips_fh->close();
my $regexp = create_iprange_regexp($cidr->list());
foreach my $traffic_fn (#ARGV) {
my $traffic_fh = IO::File->new();
$traffic_fh->open($traffic_fn) or die "Can't open '$traffic_fh': $OS_ERROR";
while (my $ip_address = <$traffic_fh>) {
chomp $ip_address;
if (match_ip($ip_address, $regexp)) {
print $ip_address, "\n";
}
}
$traffic_fh->close();
}
DISCLAIMER: I just banged that out, it's had minimal testing and no benchmarking. Sanity checks, error handling and comments omitted to keep the line count down. I didn't scrimp on the whitespace, though.
As for your code: There is no need to define your functions before you use them.

Another example rewrite:
sub is_less_than {
my $left = shift; # I'm sure you just "forgot" to put the my() here...
my $right = shift;
my ($left_part_1, $left_part_2, $left_part_3, $left_part_4) = split (/\./, $left);
my ($right_part_1, $right_part_2, $right_part_3, $right_part_4) = split (/\./, $right);
if ($left_part_1 != $right_part_1 ) {
return ($left_part_1 < $right_part_1);
}
if ($left_part_2 != $right_part_2 ) {
return ($left_part_2 < $right_part_2);
}
if ($left_part_3 != $right_part_3 ) {
return ($left_part_3 < $right_part_3);
}
if ($left_part_4 != $right_part_4 ) {
return ($left_part_4 < $right_part_4);
}
return (false); # They're equal
}
To this:
sub is_less_than {
my #left = split(/\./, shift);
my #right = split(/\./, shift);
# one way to do it...
for(0 .. 3) {
if($left[$_] != $right[$_]) {
return $left[$_] < $right[$_];
}
}
# another way to do it - let's avoid so much indentation...
for(0 .. 3) {
return $left[$_] < $right[$_] if $left[$_] != $right[$_];
}
# yet another way to do it - classic Perl unreadable one-liner...
$left[$_] == $right[$_] or return $left[$_] < $right[$_] for 0 .. 3;
# just a note - that last one uses the short-circuit logic to condense
# the if() statement to one line, so the for() can be added on the end.
# Perl doesn't allow things like do_this() if(cond) for(0 .. 3); You
# can only postfix one conditional. This is a workaround. Always use
# 'and' or 'or' in these spots, because they have the lowest precedence.
return 0 == 1; # false is not a keyword, or a boolean value.
# though honestly, it wouldn't hurt to just return 0 or "" or undef()
}
Also, here:
my ($ip, $end_ip, $junk) = split /,/;
$junk might need to be #junk to capture all the junk, or you can probably leave it off - if you assign an unknown-sized array to an "array" of two elements, it will silently discard all the extra stuff. So
my($ip, $end_ip) = split /,/;
And here:
foreach (#ARGV) {
open(TRAFFIC, $_) or die "Can't open $_ $OS_ERROR";
while (<TRAFFIC> ) {
chomp;
if (defined $addresses{$_}) {
print "$_\n";
}
}
close (TRAFFIC);
}
Instead of TRAFFIC, use a variable to store the filehandle. Also, in general, you should use exists() to check if a hash element exists, rather than defined() - it might exist but be set to undef (this shouldn't happen in your program, but it's a nice habit for when your program gets more complicated):
foreach (#ARGV) {
open(my $traffic, $_) or die "Can't open $_ $OS_ERROR";
while (<$traffic> ) {
chomp;
print "$_\n" if exists $addresses{$_};
}
# $traffic goes out of scope, and implicitly closes
}
Of course, you could also use Perl's wonderful <> operator, which opens each element of #ARGV for reading, and acts as a filehandle that iterates through them:
while(<>) {
chomp;
print "$_\n" if exists $addresses{$_};
}
As has been noted before, try to avoid useing English unless you use English qw( -no_match_vars ); to avoid the significant performance penalty of those evil match_vars in there. And as hasn't been noted yet, but should be...
ALWAYS ALWAYS ALWAYS always use strict; and use warnings; or else Larry Wall will descend from heaven and break your code. I see you have -w - this is enough, because even off of Unix, Perl parses the shebang line, and will find your -w and will use warnings; like it should. However, you need to use strict;. This will catch a lot of serious errors in your code, like not declaring variables with my or using false as a language keyword.
Making your code work under strict as well as warnings will result in MUCH cleaner code that never breaks for reasons you can't figure out. You'll spend hours at the debugger debugging and you'll probably end up using strict and warnings anyway just to figure out what the errors are. Only remove them if (and only if) your code is finished and you're releasing it and it never generates any errors.

While doing this certainly is one way to do it in Perl.
use strict;
use warnings;
my $new_ip;
{
my #parts = split ('\.', $ip);
foreach my $part(reverse #parts){
$part++;
if( $part > 255 ){
$part = 0;
next;
}else{
last;
}
}
$new_ip = join '.', reverse #parts;
}
This is how I would actually implement it.
use NetAddr::IP;
my $new_ip = ''.(NetAddr::IP->new($ip,0) + 1) or die;

I can't say that this solution will make your program more Perl-ish, but it might simplify your algorithm.
Rather than treating an IP address as a dotted-quad, base-256 number which needs the nested-if structure to implement the increment function, consider an IP address to be a 32-bit integer. Convert an IP of the form a.b.c.d into an integer with this (not tested):
sub ip2int {
my $ip = shift;
if ($ip =~ /(\d+)\.(\d+)\.(\d+)\.(\d+)/) {
return ($1 << 24) + ($2 << 16) + ($3 << 8) + $4;
} else {
return undef;
}
}
Now it's easy to determine if an IP falls between two endpoint IPs. Just do simple integer arithmetic and comparisons.
$begin = "192.168.5.0";
$end = "192.168.10.255";
$target = "192.168.6.2";
if (ip2int($target) >= ip2int($begin) && ip2int($target) <= ip2int($end)) {
print "$target is between $begin and $end\n";
} else {
print "$target is not in range\n";
}

Tell your coworkers that their perl looks too much like line noise. Please don't obfuscate your code just for the sake of obfuscation - it's asinine development goals like that which give perl such a bad reputation for being unreadable, when it's really bad programmers (apparently, like your coworkers) who write sloppy code. Nicely structured, indented, and logical code is a good thing. C is a good thing.
Seriously, though - the best place to figure out how to write perl is in the O'Reilly "Perl Best Practices", by Damian Conway. It tells you how he thinks you should do things, and he always gives good reasons for his position as well as occasionally giving good reasons to disagree. I do disagree with him on some points, but his reasoning is sound. The odds that you work with anyone who knows perl better than Mr. Conway are pretty slim, and having a printed book (or at least a Safari subscription) gives you some more solid backing for your arguments. Pick up a copy of the Perl Cookbook while you're at it, as looking at code examples for solving common problems should get you on the right track. I hate to say "buy the book", but those are exceptionally good books that any perl developer should read.
With regards to your specific code, you're using foreach, $_, split with no parens, shift, etc. It looks plenty perl-ish to my eyes - which have been developing with perl for quite a while. One note, though - I hate the English module. If you must use it, do it like use English qw( -no_match_vars );. The match_vars option slows down regexp parsing measurably, and the $PREMATCH / $POSTMATCH variables it provides aren't usually useful.

There is only 1 advice: use strict. Rest of it is hardly relevant.

I know exactly how you feel. My first language was FORTRAN and like a good FORTRAN programmer, I wrote FORTRAN in every language since :).
I have this really wonderful book Effective Perl Programming that I keep re-reading every now and then. Especially a chapter called "Idiomatic Perl". Here are a few things I use to keep my Perl looking like Perl: List Operators like for map and grep, slices and hash slices, the quote operators.
Another thing that keeps my Perl from looking like FORTRAN/C is a regular reading of module sources especially those of the masters.

You could use Acme::Bleach or Acme::Morse

While this would work:
use strict;
use warnings;
use 5.010;
use NetAddr::IP;
my %addresses;
# Parse all the ip addresses and record them in a hash.
{
open( my $ips_file, '<', 'ips') or die;
local $_; # or my $_ on Perl 5.10 or later
while( my $line = <$ips_file> ){
my ($ip, $end_ip) = split ',', $line;
next unless $ip and $end_ip;
$ip = NetAddr::IP->new( $ip, 0 ) or die;
$end_ip = NetAddr::IP->new( $end_ip ) or die;
while( $ip <= $end_ip ){
$addresses{$ip->addr} = 1;
$ip++;
}
}
close $ips_file
}
# print IP addresses in any of the found ranges
use English;
for my $arg (#ARGV) {
open(my $traffic, '<',$arg) or die "Can't open $arg $OS_ERROR";
while( my $ip = <$traffic> ){
chomp $ip;
if( $addresses{$ip} ){
say $ip
}
}
close ($traffic);
}
I would if possible use netmasks, because it gets even simpler:
use Modern::Perl;
use NetAddr::IP;
my #addresses;
{
open( my $file, '<', 'ips') or die;
while( (my $ip = <$file>) =~ s(,.*){} ){
next unless $ip;
$ip = NetAddr::IP->new( $ip ) or die;
push #addresses, $ip
}
close $file
}
for my $filename (#ARGV) {
open( my $traffic, '<', $filename )
or die "Can't open $filename";
while( my $ip = <$traffic> ) {
chomp $ip;
next unless $ip;
$ip = NetAddr::IP->new($ip) or next; # skip line on error
my #match;
for my $cmp ( #addresses ){
if( $ip->within($cmp) ){
push #match, $cmp;
#last;
}
}
say "$ip => #match" if #match;
say "# no match for $ip" unless #match;
}
close ($traffic);
}
Test ips file:
192.168.0.1/24
192.168.0.0
0:0:0:0:0:0:C0A8:0/128
Test traffic file:
192.168.1.0
192.168.0.0
192.168.0.5
Output:
# no match for 192.168.1.0/32
192.168.0.0/32 => 192.168.0.1/24 192.168.0.0/32 0:0:0:0:0:0:C0A8:0/128
192.168.0.5/32 => 192.168.0.1/24

Instead of doing this :
if ($left_part_1 != $right_part_1 ) {
return ($left_part_1 < $right_part_1);
}
you could do this :
return $left_part_1 < $right_part_1 if($left_part_1 != $right_part_1);
Also, you could use the Fatal module, to avoid checking stuff for errors.

The only criteria I use for "how my code looks" is how easy it is to read and understand the purpose of the code (especially by programmers unfamiliar with Perl), not whether it follows a particular style.
If a Perl language feature makes some logic easier to understand then I use it, if not I don't - even if it can do it in less code.
Your co-workers may think my code is extremely "un perl-ish", but I'll bet they understood exactly what the code is doing and could modify it to fix / extend it without any trouble:
my version:
#******************************************************************************
# Load the allowable ranges into a hash
#******************************************************************************
my %ipRanges = loadIPAddressFile("../conf/ip.cfg");
#*****************************************************************************
# Get the IP to check on the command line
#*****************************************************************************
my ( $in_ip_address ) = #ARGV;
# Convert it to number for comparison
my $ipToCheckNum = 1 * sprintf("%03d%03d%03d%03d", split(/\./, $in_ip_address));
#*****************************************************************************
# Loop through the ranges and see if the number is in any of them
#*****************************************************************************
my $startIp;
my $endIp;
my $msg = "IP [$in_ip_address] is not in range.\n";
foreach $startIp (keys(%ipRanges))
{
$endIp = $ipRanges{$startIp};
if ( $startIp <= $ipToCheckNum and $endIp >= $ipToCheckNum )
{
$msg = "IP [$in_ip_address] is in range [$startIp] to [$endIp]\n";
}
}
print $msg;
#******************************************************************************
# Function: loadIPAddressFile()
# Author: Ron Savage
# Date: 04/10/2009
#
# Description:
# loads the allowable IP address ranges into a hash from the specified file.
# Hash key is the starting value of the range, value is the end of the range.
#******************************************************************************
sub loadIPAddressFile
{
my $ipFileHandle;
my $startIP;
my $endIP;
my $startIPnum;
my $endIPnum;
my %rangeList;
#***************************************************************************
# Get the arguments sent
#***************************************************************************
my ( $ipFile ) = #_;
if ( open($ipFileHandle, "< $ipFile") )
{
while (<$ipFileHandle>)
{
( $startIP, $endIP ) = split(/\,/, $_ );
# Convert them to numbers for comparison
$startIPnum = 1 * sprintf("%03d%03d%03d%03d", split(/\./, $startIP));
$endIPnum = 1 * sprintf("%03d%03d%03d%03d", split(/\./, $endIP));
$rangeList{$startIPnum} = $endIPnum;
}
close($ipFileHandle);
}
else
{
print "Couldn't open [$ipFile].\n";
}
return(%rangeList);
}
(Note: the extra "#" lines are in there to preserve my freakin' spacing, which always gets whacked when posting code here)

Am I missing something... will any of the above array versions work? The mods are performed on variables local to the for loop. I think Brad Gilbert's Net::IP solution would be my choice. Chris Lutz pretty much cleaned the rest the way I would've.
As an aside - some of the comments about readability strike me as curious. Are there fewer [vigorous] complaints about the readability of Erlang/Lisp syntax because there is ONLY ONE way to write code in them?

This is probably more like C, but is also more simple:
use Socket qw(inet_aton inet_ntoa);
my $ip = ("192.156.255.255");
my $ip_1 = inet_ntoa(pack("N", unpack("N", inet_aton($ip))+1));
print "$ip $ip_1\n";
Update: I posted this before reading all of the code in the question. The code here just does the incrementing of the ip address.

Related

Perl sub skips foreach within which it is called

I'm having some problem with a subroutine that locates certain files and extracts some data out of them.
This subroutine is called inside a foreach loop, but whenever the call is made the loop skips to its next iteration. So I am wondering whether any of the next;'s are somehow escaping from the subroutine to the foreach loop where it is called?
To my knowledge the sub looks solid though so I'm hoping if anyone can see something I'm missing?
sub FindKit{
opendir(DH, "$FindBin::Bin\\data");
my #kitfiles = readdir(DH);
closedir(DH);
my $nametosearch = $_[0];
my $numr = 1;
foreach my $kitfile (#kitfiles)
{
# skip . and .. and Thumbs.db and non-K-files
if($kitfile =~ /^\.$/) {shift #kitfiles; next;}
if($kitfile =~ /^\.\.$/) {shift #kitfiles; next;}
if($kitfile =~ /Thumbs\.db/) {shift #kitfiles; next;}
if($kitfile =~ /^[^K]/) {shift #kitfiles; next;}
# $kitfile is the file used on this iteration of the loop
open (my $fhkits,"<","data\\$kitfile") or die "$!";
while (<$fhkits>) {}
if ($. <= 1) {
print " Empty File!";
next;
}
seek($fhkits,0,0);
while (my $kitrow = <$fhkits>) {
if ($. == 0 && $kitrow =~ /Maakartikel :\s*(\S+)\s+Montagekit.*?($nametosearch)\s{3,}/g) {
close $fhkits;
return $1;
}
}
$numr++;
close $fhkits;
}
return 0;
}
To summarize comments, the refactored code:
use File::Glob ':bsd_glob';
sub FindKit {
my $nametosearch = $_[0];
my #kitfiles = glob "$FindBin::Bin/data/K*"; # files that start with K
foreach my $kitfile (#kitfiles)
{
open my $fhkits, '<', $kitfile or die "$!";
my $kitrow_first_line = <$fhkits>; # read first line
return if eof; # next read is end-of-file so it was just header
my ($result) = $kitrow_first_line =~
/Maakartikel :\s*(\S+)\s+Montagekit.*?($nametosearch)\s{3,}/;
return $result if $result;
}
return 0;
}
I use core File::Glob and enable :bsd_glob option, which can handle spaces in filenames. I follow the docs note to use "real slash" on Win32 systems.
I check whether there is only a header line using eof.†
I do not see how this can affect the calling code, other than by its return value. Also, I don't see how the posted code can make the caller skip the beat, either. That problem is unlikely to be in this sub.
Please let me know if I missed some point with the above rewrite.
† Previous version used to check whether there is just one (header) line by
1 while <$fhkits>; # check number of lines ...
return if $. == 1; # there was only one line, the header
Also correct but eof is way better
The thing that is almost certainly screwing you here, is that you are shifting the list that you are iterating.
That's bad news, as you're deleting elements ... but in places you aren't necessarily thinking.
For example:
#!/usr/bin/env perl
use strict;
use warnings;
my #list = qw ( one two three );
my $count;
foreach my $value ( #list ) {
print "Iteration ", ++$count," value is $value\n";
if ( $value eq 'two' ) { shift #list; next };
}
print "#list";
How many times do you think that should iterate, and which values should end up in the array?
Because you shift you never process element 'three' and you delete element 'one'. That's almost certainly what's causing you problems.
You also:
open using a relative path, when your opendir used an absolute one.
skip a bunch of files, and then skip anything that doesn't start with K. Why not just search for things that do start with K?
read the file twice, and one is to just check if it's empty. The perl file test -z will do this just fine.
you set $kitrow for each line in the file, but don't really use it for anything other than pattern matching. It'd probably work better using implicit variables.
You only actually do anything on the first line - so you don't ever need to iterate the whole file. ($numr seems to be discarded).
you use a global match, but only use one result. The g flag seems redundant here.
I'd suggest a big rewrite, and do something like this:
#!/usr/bin/env perl
use strict;
use warnings;
use FindBin;
sub FindKit{
my ($nametosearch) = #_;
my $numr = 1;
foreach my $kitfile (glob "$FindBin::Bin\\data\\K*" )
{
if ( -z $kitfile ) {
print "$kitfile is empty\n";
next;
}
# $kitfile is the file used on this iteration of the loop
open (my $fhkits,"<", $kitfile) or die "$!";
<$kitfile> =~ m/Maakartikel :\s*(\S+)\s+Montagekit.*?($nametosearch)\s{3,}/
and return $1;
return 0;
}
}
As a big fan of the Path::Tiny module (me have it always installed and using it in every project) my solution would be:
use strict;
use warnings;
use Path::Tiny;
my $found = FindKit('mykit');
print "$found\n";
sub FindKit {
my($nametosearch) = #_;
my $datadir = path($0)->realpath->parent->child('data');
die "$datadir doesn't exists" unless -d $datadir;
for my $file ($datadir->children( qr /^K/ )) {
next if -z $file; #skip empty
my #lines = $file->lines;
return $1 if $lines[0] =~ /Maakartikel :\s*(\S+)\s+Montagekit.*?($nametosearch)\s{3,}/;
}
return;
}
Some comments and still opened issues:
Using the Path::Tiny you could always use forward slashes in the path-names, regardless of the OS (UNIX/Windows), e.g. the data/file will work on windows too.
AFAIK the FindBin is considered broken - so the above uses the $0 and realpath ...
what if the Kit is in multiple files? The above always returns on the 1st found one
the my #lines = $file->lines; reads all lines - unnecessary - but on small files doesn't big deal.
the the reality this function returns the arg for the Maakartikel, so probably better name would be find_articel_by_kit or find_articel :)
easy to switch to utf8 - just change the $file->lines to $file->lines_utf8;

Counting the frequency of bases using while loop and substr with Perl

I am trying to write in Perl to count the number of each A/C/G/T bases in a DNA sequence. But couldn't figure out what went wrong in my code. "ATCTAGCTAGCTAGCTA" is the kind of data I am given.
#!usr/bin/perl
use strict;
use warnings;
my $in_file = <$ARGV[0]>;
open( my $FH_IN, "<", $in_file );
my $dna = <$FH_IN>;
my $index = 0;
my ( $freq_a, $freq_c, $freq_g, $freq_t ) = 0;
my $dna_length = length($dna);
while ( $index < $dna_length ) {
my $base = substr( $dna, $index, 1 );
if ( $base eq "A" ) {
$freq_a++;
$index++;
next;
} elsif ( $base eq "C" ) {
$freq_c++;
$index++;
next;
} elsif ( $base eq "G" ) {
$freq_g++;
$index++;
next;
} elsif ( $base eq "T" ) {
$freq_t++;
$index++;
next;
} else {
next;
}
}
print "$freq_a\n$freq_c\n$freq_g\n$freq_t\n";
exit;
I know there are a lot of ways to do it, but what I want to know is what I did wrong so I can learn from mistakes.
Perl has a special file handle to use with these kinds of problems: The diamond operator <>. It will read input from either a file name, if provided, and standard input if not.
Secondly, since you are only interested in ACGT, might as well look for only them, using a regex: /([ACGT])/g.
Thirdly, using a hash is the idiomatic way to count characters in Perl: $count{A}++
So your script becomes:
use strict;
use warnings;
my %count;
while (<>) {
while (/([ACGT])/g) {
$count{$1}++;
}
}
print "$_\n" for #count{qw(A C G T)};
Usage:
script.pl input.txt
Okay, you've done well so far and there's only one problem that stops your program from working.
It's far from obvious, but each line that's read from the file has a newline character "\n" at the end. So what's happening is that $index reaches the newline in the string, which is processed by the else clause (because it's not A, C, G or T) which just does a next, so the same character is processed over and over again. Your program just hangs, right?
You could remove the newline with chomp, but a proper fix is to increment $index in the else clause just as you do with all the other characters. So it looks like
else {
++$index;
next;
}
As you've suspected, there are much better ways to write this. There are also a couple of other nasties in your code, but that change should get you on your way for now.
It would be instructive for you to print out the values in $dna_length, $index and $base each time you go round the loop - immediately after you assign a value to $base.
Your code would be more robust if you moved the incrementing of $index to the end of the loop (outside of the if/elsif/else block) and removed all of your next statements.
An alternative "quick fix" is to chomp() the input line before you start processing it.

calculate average of small parts of column, grouped by key in perl?

This question is quite similar to this one How can I get the average and standard deviations grouped by key? but I don't manage to modify it to fit my problem.
I have a lot of files (.csv) with 7 columns, the last three columns look like this:
col5,col6,col7
1408,1,123
1408,2,234
1408,3,345
1408,4,456
1408,5,567
1408,6,678
1409,0,123
1409,1,234
1409,2,345
1409,3,456
1409,4,567
1409,5,678
1409,6,789
...
N,0,123
N,1,234
N,2,345
N,3,456
N,4,567
N,5,678
N,6,789
What I want to do is to calculate the average of the last column (col7) for all the values that have the same value in column 5 (col5), so 1408, 1409, 1410, ... until N and I don't know N. I want to print this average value next to the line (in col8) which contains a 3 in column 6 (col6). Do note that the value in column 6 (col6) goes from 0 to 6, but the first number of the file is not always 0. So what I want is:
col1,col2,col3,col4,col5,col6,col7,col8
bla,bla,bla,bla,1408,3,345,400.5
bla,bla,bla,bla,1409,3,456,456
...
bla,bla,bla,bla,N,3,456,456
I have some script I can use to calculate the average, but I have to be able to put my values into an array for that. Below is what I tried to do, but it doesn't work. Also, I'm just trying to learn Perl on my own, so if it looks like crap, I'm just trying!
open (FILE, "<", $dir.$file) or die;
my #lines = <FILE>;
foreach my $line(#lines) {
my ($col1,$col2,$col3,$col4,$col5,$col6,$col7) = split(/\,/, $line);
push #arrays5, $col5;
}
foreach my $array5(#arrays5) {
foreach my $line(#lines) {
my ($col1,$col2,$col3,$col4,$col5,$col6,$col7) = split(/\,/, $line);
if ($array5 == $col5) {
push #arrays7, $col7;
}
}
}
close(FILE);
One way using Text::CSV_XS module. It's not a built-in one, so it has to be installed from CPAN or similar tool.
Content of script.pl:
use warnings;
use strict;
use Text::CSV_XS;
my ($offset, $col_total, $row3, $rows_processed);
## Check arguments to the script.
die qq[Usage: perl $0 <input-file>\n] unless #ARGV == 1;
## Open input file.
open my $fh, q[<], shift or die qq[Open error: $!\n];
## Create the CSV object.
my $csv = Text::CSV_XS->new or
die qq[ERROR: ] . Text::CSV_XS->error_diag();
## Read file content seven lines each time.
while ( my $rows = $csv->getline_all( $fh, $offset, 7 ) ) {
## End when there is no more rows.
last unless #$rows;
## For each row in the group of seven...
for my $row ( 0 .. $#{$rows} ) {
## Get value of last column.
my $last_col_value = $rows->[ $row ][ $#{$rows->[$row]} ];
## If last column is not a number it is the header, so print it
## appending the eigth column and read next one.
unless ( $last_col_value =~ m/\A\d+\Z/ ) {
$csv->print( \*STDOUT, $rows->[ $row ] );
printf qq[,%s\n], q[col8];
next;
}
## Acumulate total amount for last column.
$col_total += $last_col_value;
## Get third row. The output will be this row with the
## average appended.
if ( $rows->[ $row ][-2] == 3 ) {
$row3 = [ #{ $rows->[ $row ] } ];
}
## Count processed rows.
++$rows_processed;
}
## Print row with its average.
if ( $rows_processed > 0 && ref $row3 ) {
$csv->print( \*STDOUT, $row3 );
printf qq[,%g\n], $col_total / $rows_processed;
}
## Initialize variables.
$col_total = $rows_processed = 0;
undef $row3;
}
Content of infile:
col1,col2,col3,col4,col5,col6,col7
bla,bla,bla,bla,1408,1,123
bla,bla,bla,bla,1408,2,234
bla,bla,bla,bla,1408,3,345
bla,bla,bla,bla,1408,4,456
bla,bla,bla,bla,1408,5,567
bla,bla,bla,bla,1408,6,678
bla,bla,bla,bla,1409,0,123
bla,bla,bla,bla,1409,1,234
bla,bla,bla,bla,1409,2,345
bla,bla,bla,bla,1409,3,456
bla,bla,bla,bla,1409,4,567
bla,bla,bla,bla,1409,5,678
bla,bla,bla,bla,1409,6,789
Run it like:
perl script.pl infile
With following output:
col1,col2,col3,col4,col5,col6,col7,col8
bla,bla,bla,bla,1408,3,345,400.5
bla,bla,bla,bla,1409,3,456,456
Before we try to complete the answer, would you try this and tell me how close it comes to what you want?
#!/usr/bin/perl
use warnings;
use strict;
my $target = 3;
my %summary;
while(<>) {
chomp;
my ($col1,$col2,$col3,$col4,$col5,$col6,$col7) = split /\,/;
$summary{$col5}{total} += $col7;
++$summary{$col5}{count};
$summary{$col5}{line} = $_ if $col6 == $target;
}
$summary{$_}{average} = $summary{$_}{total} / $summary{$_}{count}
for keys %summary;
print "${summary{$_}{line}},${summary{$_}{average}}\n"
for sort keys %summary;
If close enough, then you may wish to finish on your own. If not, then we can discuss the matter further.
Note that you can replace the <> with <FILE> if you prefer to read from your data file rather than from standard input.
IMPLEMENTATION NOTES
The code relies on Perl's autovivification feature. Observe for instance the line ++$summary{$col5}{count};, which seems initially to increment a nonexistent counter. However, this is actually standard Perl idiom. If you try to do something arithmetical (like incrementation) to an object that does not exist, Perl implicitly creates the object, initializes it to zero, and then does the thing you wanted (like incrementation) to it.
It would probably be unwise for a more sober programming language like C++ to autovivify, but years of experience suggest that autovivification strikes the right balance between order and convenience in a slightly less sober language like Perl.
On a more elementary level, the code will probably make sense only to those used to Perl's hashes. However, if you've not used Perl's hashes before, this would be as good a chance as any to learn them. The hash is a central pillar of the language, and the above makes a fairly typical example of its use.
In this case, we have a hash of hashes, which again is fairly typical.
This should do the trick. Replace Cols[index] appropriately.
use Data::Dumper ;
open (FILE, "<", '/tmp/myfile') or die;
my #lines ;
my (%Sum,%Count);
chomp(#lines = <FILE>);
foreach my $line(#lines) {
next if $line =~ /col/;
my #Cols = split /,/, $line;
$Sum{$Cols[0]} += $Cols[2] ;
$Count{$Cols[0]}++;
}
foreach my $line(#lines) {
if($line=~/col/) {
print "$line,colX\n" ;
next;
}
my #Cols = split /,/, $line;
if($Cols[1]==3) {
print "$line,",$Sum{$Cols[0]}/$Count{$Cols[0]},"\n" ;
} else {
print "$line,-1\n";
}
}
Sample input /tmp/myfile
col5,col6,col7
1408,1,123
1408,2,234
1408,3,345
1408,4,456
1408,5,567
1408,6,678
1409,0,123
1409,1,234
Sample output
col5,col6,col7,colX
1408,1,123,-1
1408,2,234,-1
1408,3,345,400.5
1408,4,456,-1
1408,5,567,-1
1408,6,678,-1
1409,0,123,-1
1409,1,234,-1

Statistics in Perl Script

I have the following question:
I want to create a perl script that reads from a text file (file with several columns of numbers) and calculate some statistics (mean, median, sd, variance). I already built one script, but as I am not in love yet with perl, I can't fix the problems of syntax on it...
Here is my perl script..
#!/usr/bin/perl -w
use strict;
open(FILEHANDLE, data.txt);
while (<FILEHANDLE>) {
shift #ARGV;
my #array = split(\t,$_);
}
close(FILEHANDLE);
###### mean, sum and size
$N = $sum = 0;
$array[$x-1];
$N++;
$sum += $array[$x-1];
###### minimum and the maximum
($min = 0, $max = 0);
$max = $array[$x-1] if ($max < $array[$x-1]), (my#sorted = sort { $a <=> $b } #samples) {
print join(" ",#sorted);
}
##### median
if ($N % 2==1) {
print "$median = $sorted[int($N/2)]\n"; ## check this out
};
else ($median = ($sorted[$N/2] + $sorted[($N/2)-1]) / 2)) {
print "$median\n"; # check this out
};
##### quantiles 1º and 3º
if $qt1 = $sorted[$r25-1] {
print "\n"; # check this out
};
else $qt1 = $fr*($sorted[$ir] - $sorted[$ir-1]) + $sorted[$ir-1] {
print "\n"; # check this out
};
##### variance
for (my $i=0;
$i<scalar(#samples);
$i++)
{
$Var += ($samples[$i]-$mean)**2;
$Var = $Var/($N-1);
};
###### standard error
($Std = sqrt($Var)/ sqrt($N));
############################################################
print "$min\n";
print "$max\n";
print "$mean\n";
print "$median\n";
print "$qt1\n";
print "$var\n";
print "$std\n";
exit(0);
I want to get it working. Please help. THANKS IN ADVANCE!
Errors in your code:
open(FILEHANDLE, data.txt);
data.txt needs to be quoted. You are not checking the return value of the open, e.g. ... or die $!. You should use a lexical filehandle and three argument open, e.g. open my $fh, '<', "data.txt" or die $!.
shift #ARGV;
This does nothing except remove the first value from you argument list, which is then promptly discarded.
my #array = split(\t,$_);
You are using \t as a bareword, it should be a regex, /\t/. Your #array is declared inside a lexical scope of the while loop, and will be undefined outside this block.
$N = $sum = 0;
Both variables are not declared, which will cause the script to die when you use strict (which is a very good idea). Use my $N to solve that. Also, $N is not a very good variable name.
$array[$x-1];
This will do nothing. $x is not declared (see above), and also undefined. The whole statement does nothing, it is like having a line 3;. I believe you will get an error such as Useless use of variable in void context.
$N++;
This increments $N to 1, which is a useless thing to do, since you only a few lines above initialized it to 0.
Well.. the list goes on. I suggest you start smaller, use strict and warnings since they are very good tools, and work out the errors one by one. A very good idea would be to make subroutines of your calculations, e.g.:
sub sum {
# code here
return $sum;
}
Go to perldoc.perl.org and read the documentation. Especially useful would be the syntax related ones and perlfunc.
Also, you should be aware that this functionality can be found in modules, which you can find at CPAN.
Your main problem is you have not declared your variables such as $N, $max, etc.
You need to introduce all new variables with my the first time you reference them. Just like you did with $array and $i. So for example
$N = $sum = 0;
Should become
my( $N, $sum ) = ( 0, 0 );

Perl if equals sign

I need to detect if the first character in a file is an equals sign (=) and display the line number. How should I write the if statement?
$i=0;
while (<INPUT>) {
my($line) = $_;
chomp($line);
$findChar = substr $_, 0, 1;
if($findChar == "=")
$output = "$i\n";
print OUTPUT $output;
$i++;
}
Idiomatic perl would use a regular expression (^ meaning beginning of line) plus one of the dreaded builtin variables which happens to mean "line in file":
while (<INPUT>) {
print "$.\n" if /^=/;
}
See also perldoc -v '$.'
Use $findChar eq "=". In Perl:
== and != are numeric comparisons. They will convert both operands to a number.
eq and ne are string comparisons. They will convert both operands to a string.
Yes, this is confusing. Yes, I still write == when I mean eq ALL THE TIME. Yes, it takes me forever to spot my mistake too.
It looks like you are not using strict and warnings. Use them, especially since you do not know Perl, you might also want to add diagnostics to the list of must-use pragmas.
You are keeping track of the input line number in a separate variable $i. Perl has various builtin variables documented in perlvar. Some of these, such as $. are very useful use them.
You are using my($line) = $_; in the body of the while loop. Instead, avoid $_ and assign to $line directly as in while ( my $line = <$input> ).
Note that bareword filehandles such as INPUT are package global. With the exception of the DATA filehandle, you are better off using lexical filehandles to properly limit the scope of your filehandles.
In your posts, include sample data in the __DATA_ section so others can copy, paste and run your code without further work.
With these comments in mind, you can print all lines that do not start with = using:
#!/usr/bin/perl
use strict; use warnings;
while (my $line = <DATA> ) {
my $first_char = substr $line, 0, 1;
if ( $first_char ne '=' ) {
print "$.:$first_char\n";
}
}
__DATA__
=
=
a
=
+
However, I would be inclined to write:
while (my $line = <DATA> ) {
# this will skip blank lines
if ( my ($first_char) = $line =~ /^(.)/ ) {
print "$.:$first_char\n" unless $first_char eq '=';
}
}