perl cookbook fixstyle2 perplexed by & 1 - perl

From once again perl cookbook, I know what this program does and I understand most of it but below code is escapes me.
It is using basically if else but what is ( $i++ & 1 ) mean??
#!/usr/bin/perl -w
# fixstyle2 - like fixstyle but faster for many many matches
use strict;
my $verbose = (#ARGV && $ARGV[0] eq '-v' && shift);
my %change = ();
while (<DATA>) {
chomp;
my ($in, $out) = split /\s*=>\s*/;
next unless $in && $out;
$change{$in} = $out;
}
if (#ARGV) {
$^I = ".orig";
} else {
warn "$0: Reading from stdin\n" if -t STDIN;
}
while (<>) {
my $i = 0;
s/^(\s+)// && print $1; # emit leading whitespace
for (split /(\s+)/, $_, -1) { # preserve trailing whitespace
print( ($i++ & 1) ? $_ : ($change{$_} || $_));
}
}
__END__
analysed analyzed

$i++ returns the value of $i and increments $i afterwards. & is the "bitwise and" operator, so it takes the before mentioned value of $i and checks its last bit (as 1 in binary is 00..01).
As $i is incremented by 1 in each iteration, in binary its last bit changes from 1 to 0 and vice versa in each step, therefore the expression just determines odd versus even words.

Related

Is there a better way to count occurrence of char in a string?

I felt there must a better way to count occurrence instead of writing a sub in perl, shell in Linux.
#/usr/bin/perl -w
use strict;
return 1 unless $0 eq __FILE__;
main() if $0 eq __FILE__;
sub main{
my $str = "ru8xysyyyyyyysss6s5s";
my $char = "y";
my $count = count_occurrence($str, $char);
print "count<$count> of <$char> in <$str>\n";
}
sub count_occurrence{
my ($str, $char) = #_;
my $len = length($str);
$str =~ s/$char//g;
my $len_new = length($str);
my $count = $len - $len_new;
return $count;
}
If the character is constant, the following is best:
my $count = $str =~ tr/y//;
If the character is variable, I'd use the following:
my $count = length( $str =~ s/[^\Q$char\E]//rg );
I'd only use the following if I wanted compatibility with versions of Perl older than 5.14 (as it is slower and uses more memory):
my $count = () = $str =~ /\Q$char/g;
The following uses no memory, but might be a bit slow:
my $count = 0;
++$count while $str =~ /\Q$char/g;
Counting the occurences of a character in a string can be performed with one line in Perl (as compared to your 4 lines). There is no need for a sub (although there is nothing wrong with encapsulating functionality in a sub). From perlfaq4 "How can I count the number of occurrences of a substring within a string?"
use warnings;
use strict;
my $str = "ru8xysyyyyyyysss6s5s";
my $char = "y";
my $count = () = $str =~ /\Q$char/g;
print "count<$count> of <$char> in <$str>\n";
In a beautiful* Bash/Coreutils/Grep one-liner:
$ str=ru8xysyyyyyyysss6s5s
$ char=y
$ fold -w 1 <<< "$str" | grep -c "$char"
8
Or maybe
$ grep -o "$char" <<< "$str" | wc -l
8
The first one works only if the substring is just one character long; the second one works only if the substrings are non-overlapping.
* Not really.
toolic has given a correct answer, but you might consider not hardcoding your values to make the program reusable.
use strict;
use warnings;
die "Usage: $0 <text> <characters>" if #ARGV < 1;
my $search = shift; # the string you are looking for
my $str; # the input string
if (#ARGV && -e $ARGV[0] || !#ARGV) { # if str is file, or there is no str
local $/; # slurp input
$str = <>; # use diamond operator
} else { # else just use the string
$str = shift;
}
my $count = () = $str =~ /\Q$search\E/gms;
print "Found $count of '$search' in '$str'\n";
This will allow you to use the program to count for the occurrence of a character, or a string, inside a string, a file, or standard input. For example:
count.pl needles haystack.txt
some_process | count.pl foo
count.pl x xyzzy

How do I speed up pattern recognition in perl

This is the program as it stands right now, it takes in a .fasta file (a file containing genetic code), creates a hash table with the data and prints it, however, it is quite slow. It splits a string an compares it against all other letters in the file.
use strict;
use warnings;
use Data::Dumper;
my $total = $#ARGV + 1;
my $row;
my $compare;
my %hash;
my $unique = 0;
open( my $f1, '<:encoding(UTF-8)', $ARGV[0] ) or die "Could not open file '$ARGV[0]' $!\n";
my $discard = <$f1>;
while ( $row = <$f1> ) {
chomp $row;
$compare .= $row;
}
my $size = length($compare);
close $f1;
for ( my $i = 0; $i < $size - 6; $i++ ) {
my $vs = ( substr( $compare, $i, 5 ) );
for ( my $j = 0; $j < $size - 6; $j++ ) {
foreach my $value ( substr( $compare, $j, 5 ) ) {
if ( $value eq $vs ) {
if ( exists $hash{$value} ) {
$hash{$value} += 1;
} else {
$hash{$value} = 1;
}
}
}
}
}
foreach my $val ( values %hash ) {
if ( $val == 1 ) {
$unique++;
}
}
my $OUTFILE;
open $OUTFILE, ">output.txt" or die "Error opening output.txt: $!\n";
print {$OUTFILE} "Number of unique keys: " . $unique . "\n";
print {$OUTFILE} Dumper( \%hash );
close $OUTFILE;
Thanks in advance for any help!
It is not clear from the description what is wanted from this script, but if you're looking for matching sets of 5 characters, you don't actually need to do any string matching: you can just run through the whole sequence and keep a tally of how many times each 5-letter sequence occurs.
use strict;
use warnings;
use Data::Dumper;
my $str; # store the sequence here
my %hash;
# slurp in the whole file
open(IN, '<:encoding(UTF-8)', $ARGV[0]) or die "Could not open file '$ARGV[0]' $!\n";
while (<IN>) {
chomp;
$str .= $_;
}
close(IN);
# not sure if you were deliberately omitting the last two letters of sequence
# this looks at all the sequence
my $l_size = length($str) - 4;
for (my $i = 0; $i < $l_size; $i++) {
$hash{ substr($str, $i, 5) }++;
}
# grep in a scalar context will count the values.
my $unique = grep { $_ == 1 } values %hash;
open OUT, ">output.txt" or die "Error opening output.txt: $!\n";
print OUT "Number of unique keys: ". $unique."\n";
print OUT Dumper(\%hash);
close OUT;
It might help to remove searching for information that you already have.
I don't see that $j depends upon $i so you're actually matching values to themselves.
So you're getting bad counts as well. It works for 1, because 1 is the square of 1.
But if for each five-character string you're counting strings that match, you're going
to get the square of the actual number.
You would actually get better results if you did it this way:
# compute it once.
my $lim = length( $compare ) - 6;
for ( my $i = 0; $i < $lim; $i++ ){
my $vs = substr( $compare, $i, 5 );
# count each unique identity *once*
# if it's in the table, we've already counted it.
next if $hash{ $vs };
$hash{ $vs }++; # we've found it, record it.
for ( my $j = $i + 1; $j < $lim; $j++ ) {
my $value = substr( $compare, $j, 5 );
$hash{ $value }++ if $value eq $vs;
}
}
However, it could be an improvement on this to do an index for your second loop
and let the c-level of perl do your matching for you.
my $pos = $i;
while ( $pos > -1 ) {
$pos = index( $compare, $vs, ++$pos );
$hash{ $vs }++ if $pos > -1;
}
Also, if you used index, and wanted to omit the last two characters--as you do, it might make sense to remove those from the characters you have to search:
substr( $compare, -2 ) = ''
But you could do all of this in one pass, as you loop through file. I believe the code
below is almost an equivalent.
my $last_4 = '';
my $last_row = '';
my $discard = <$f1>;
# each row in the file after the first...
while ( $row = <$f1> ) {
chomp $row;
$last_row = $row;
$row = $last_4 . $row;
my $lim = length( $row ) - 5;
for ( my $i = 0; $i < $lim; $i++ ) {
$hash{ substr( $row, $i, 5 ) }++;
}
# four is the maximum we can copy over to the new row and not
# double count a strand of characters at the end.
$last_4 = substr( $row, -4 );
}
# I'm not sure what you're getting by omitting the last two characters of
# the last row, but this would replicate it
foreach my $bad_key ( map { substr( $last_row, $_ ) } ( -5, -6 )) {
--$hash{ $bad_key };
delete $hash{ $bad_key } if $hash{ $bad_key } < 1;
}
# grep in a scalar context will count the values.
$unique = grep { $_ == 1 } values %hash;
You may be interested in this more concise version of your code that uses a global regex match to find all the subsequences of five characters. It also reads the entire input file in one go, and removes the newlines afterwards.
The path to the input file is expected as a parameter on the command line, and the output is sent to STDIN, and can be redirected to a file on the command line, like this
perl subseq5.pl input.txt > output.txt
I've also used Data::Dump instead of Data::Dumper because I believe it to be vastly superior. However it is not a core module, and so you will probably need to install it.
use strict;
use warnings;
use open qw/ :std :encoding(utf-8) /;
use Data::Dump;
my $str = do { local $/; <>; };
$str =~ tr|$/||d;
my %dups;
++$dups{$1} while $str =~ /(?=(.{5}))/g;
my $unique = grep $_ == 1, values %dups;
print "Number of unique keys: $unique\n";
dd \%dups;

Automate paired analysis

I have a growing number of files to process using a simple Perl script I wrote. The script takes two files as input and prints an output. I want to use a bash script (or anything really) to automate the following usage:
perl Program.pl GeneLevels_A GeneLevels_B > GeneLevels_A_B
with every paired, non-directional combination of files in a particular directory.
Here is the Perl script:
#!/usr/bin/perl
use strict;
use warnings;
die "Usage: $0 <File_1> <File_2>\n" unless #ARGV == 2;
my $file1 = shift #ARGV;
my $file2 = shift #ARGV;
my %hash1;
my %hash2;
my $counter = 0;
my $amt = 25;
my $start = 244 - $amt;
open (REF, $file1);
while (<REF>) {
my $line = $_;
chomp $line;
if ($counter < $start) {
$counter++;
next;
}
my #cells = split('\t', $line);
my $ID = $cells[2];
my $row = $cells[0];
$hash1{$ID} = $row;
$counter++;
}
close REF;
$counter = 0;
open (FILE, $file2);
while (<FILE>) {
my $line = $_;
chomp $line;
if ($counter < $start) {
$counter++;
next;
}
my #cells = split('\t', $line);
my $ID = $cells[2];
my $row = $cells[0];
$hash2{$ID} = $row;
$counter++;
}
close FILE;
while ( my ($key, $value) = each(%hash1) ) {
if ( exists $hash2{$key} ) {
print "$key\t$value\t$hash2{$key}\n";
}
}
A good solution would allow me to run the Perl script on every file with an appropriate suffix.
An even better solution would assess the suffixes of existing files to determine which pairs of files have already been processed this way and omit those. For example if File_A, File_B, File_C, and File_B_C exist then only File_A_B and File_A_C would be produced. Note that File_A_B and File_B_A are equivalent.
This should work. Better checks for bad arguments would be a good thing to add:
#!/bin/bash
if [ $# != 2 ]; then
echo "usage: pair <suffix1> <suffix2>"
exit
fi
suffix1=$1
suffix2=$2
for file1 in *_${suffix1}; do
fileCheck=$(echo $file1 | sed -e "s#_$suffix2##")
if [ "$fileCheck" = "$file1" ]; then
file2=${file1/_$suffix1/_$suffix2}
if [[ ( ! -f ${file1}_${suffix2} ) && ( ! -f ${file2}_${suffix1} ) ]]; then
echo processing ${file1}_${suffix2}
perl Program.pl $file1 $file2 > ${file1}_${suffix2}
fi
fi
done

Combining two one line Perl commands into a script

I am trying to combine the following two one line perl codes into a single perl script that carries out both on a line of a file before progressing to the next line. Note that this is not my own original code, it was very thoughtfully provided here: Adding a blank line between unrelated data entries
1
perl -pae 'print $/ if (defined $x && $x ne $F[0]); $x = $F[0];' DF-data2pfa.csv >DF-data2pfb.txt
2
perl -pae 'print $/ if (defined $x && $x ne $F[3]); $x = $F[3];' DF-data2pfb.txt >DF-data2pfc.txt
The script does exactly what I want it to (compares the F[0] field of a line in my dataset to the F[0] of the previous line and adds a blank line between those entries if they are different), except I realized that I need it to look at F[0] and F[3] on a single line and compare both to the previous line. Much to my embarrassment I tried just running one after another, and did not realize that this was adding an extra blank line every time the script encountered the blank line added by the previous script, which is unacceptable to the program I am trying to input this data to.
So I tried using the Deparse tool to convert both into script format and than use an elsif statement to add the second to the first. This became messy. Also I'm not sure how to achieve the pae function of the command line in the script. I'm not sure the e is necessary in the script but it seems like first printing each line and then splitting it into an array ( with pa ) is a rather integral component to this whole code and I'm not sure how to achieve that here.
Here's what I got:
while (defined($_ = <ARGV>)) {
our(#F) = split(' ', $_, 0);
$x = $F[0];
$y = $F[3];
if defined $x and $x ne $F[0];
elsif defined $y and $y ne $F[3];
print $/
}
continue {
die "-p destination: $!\n" unless print $_;
}
I'm also open to not using the deparse module if that's unnecessary here. Thanks for any help/explanations you can provide!
It's getting a bit wordy for a one-liner, but you could do this:
perl -pae 'print $/ if ((defined $x && $x ne $F[0]) && (defined $y && $y ne $F[3])); $x = $F[0]; $y = $F[3]' DF-data2pfa.csv >DF-data2pfb.txt
or as a script
open my $fh, "<", "input_file_name";
open my $out, ">", "output_file_name";
my ($x, $y);
foreach (<$fh>) {
my #F = split(' ', $_);
if ( ( defined($x) && $x ne $F[0] ) && (defined($y) && $y ne $F[3]) ) {
print $OUT $\;
}
$x = $F[0];
$y = $F[3];
print $OUT $_;
}
I'm not sure that I'm reading your requirements correctly - if you need to print an extra line if either $F[0] or $F[3] matches the previous row, then the conditional would be:
( ( defined($x) && $x ne $F[0] ) || (defined($y) && $y ne $F[3]) )
I'm not 100% sure what you are doing and so this script may not be exactly what you want, but it hopefully can get you started. It uses the strict and warnings pragmas which will help you prevent certain errors.
#!/usr/bin/env perl
use strict;
use warnings;
my ($x, $y, #F);
while ( <> ) {
#F = split ' ';
if ( defined $x and $x ne $F[0] ) {
print $/;
} elsif ( defined $y and $y ne $F[3] ) {
print $/;
}
$x = $F[0];
$y = $F[3];
print;
}
This implicitly uses the $_ variable (while implicitly sets it, split implicitly uses it). It also shows how your conditional statements should look; when not used in posfix style, the conditions NEED round braces. I have left in the continue block, but in practice I've never needed to use one, that is probably a remnant of the deparse and probably could go at the end of the while loop (and print can implicitly use $_ too). Finally the <> operator is the magic-open/read operator, it will use the files in ARGV sequentially or use STDIN as needed.
If you need more help just ping.

How to display user input on one line in a palandrome assignment?

In perl, I have to determine whether user input is a palindrome or not and it must display like this:
Enter in 7 characters: ghghghg #one line here #
Palindrome! #second line answer#
But instead this is what it does:
Enter in 7 characters: g #one line#
h #second line#
g #third line#
h #fourth line#
g #fifth line#
h #sixth line#
g Palindrom
e! #seventh line#
My problem seems to be on the chomp lines with all the variables but I just can't figure out what to do and I've been at if for hours. I need a simple solution, but have not progressed to arrays yet so need some simple to fix this. Thanks
And here is what i have so far, the formula seems to work but it keeps printing a new line for each character:
use strict;
use warnings;
my ($a, $b, $c, $d, $e, $f, $g);
print "Enter in 7 characters:";
chomp ($a = <>); chomp ($b = <>); chomp ($c = <>); chomp ($d = <>); chomp ($e = <>); chomp ($f = <>); chomp ($g = <>);
if (($a eq $g) && ($b eq $f) && ($c eq $e) && ($d eq $d) && ($e eq $c) && ($f eq $b) && ($g eq $a))
{print "Palindrome! \n";}
else
{print "Not Palindrome! \n";}
If you're going to determine if a word is the same backwards, may I suggest using reverse and lc?
chomp(my $word = <>);
my $reverse = reverse $word;
if (lc($word) eq lc($reverse)) {
print "Palindrome!";
} else {
print "Not palindrome!";
}
Perl is famous for its TIMTOWTDI. Here are two more ways of doing it:
print "Enter 7 characters: ";
chomp(my $i= <STDIN>);
say "reverse: ", pal_reverse($i) ? "yes" : "no";
say "regex: ", pal_regex($i) ? "yes" : "no";
sub pal_reverse {
my $i = (#_ ? shift : $_);
return $i eq reverse $i;
}
sub pal_regex {
return (#_ ? shift() : $_) =~ /^(.?|(.)(?1)\2)$/ + 0;
}
use strict;
use warnings;
use feature 'say';
print "Please enter 7 characters : ";
my $input = <>; # Read in input
chomp $input; # To remove trailing "\n"
# Season with input validation
warn 'Expected 7 characters, got ', length $input, ' instead'
unless length $input == 7;
# Determine if it's palindromic or not
say $input eq reverse $input
? 'Palindrome'
: 'Not palindrome' ;
TIMTOWTDI for the recursion-prone:
sub is_palindrome {
return 1 if length $_[0] < 2; # Whole string is palindromic
goto \&is_palindrome
if substr $_[0], 0, 1, '' eq substr $_[0], -1, 1, ''; # Check next chars
return; # Not palindromic if we reach here
}
say is_palindrome( 'ghghghg' ) ? 'Palindromic' : 'Not palindromic' ;
And perldoc perlretut for those who aren't :)
Recursive patterns
This feature (introduced in Perl 5.10) significantly extends the power
of Perl's pattern matching. By referring to some other capture group
anywhere in the pattern with the construct (?group-ref), the pattern
within the referenced group is used as an independent subpattern in
place of the group reference itself. Because the group reference may
be contained within the group it refers to, it is now possible to
apply pattern matching to tasks that hitherto required a recursive
parser.
To illustrate this feature, we'll design a pattern that matches if a
string contains a palindrome. (This is a word or a sentence that,
while ignoring spaces, interpunctuation and case, reads the same
backwards as forwards. We begin by observing that the empty string or
a string containing just one word character is a palindrome. Otherwise
it must have a word character up front and the same at its end, with
another palindrome in between.
/(?: (\w) (?...Here be a palindrome...) \g{-1} | \w? )/x
Adding \W* at either end to eliminate what is to be ignored, we
already have the full pattern:
my $pp = qr/^(\W* (?: (\w) (?1) \g{-1} | \w? ) \W*)$/ix;
for $s ( "saippuakauppias", "A man, a plan, a canal: Panama!" ){
print "'$s' is a palindrome\n" if $s =~ /$pp/;
}