Perl : $_ and $array[$_] issue - perl

Below is my code , please review.
use strict;
my #people = qw{a b c d e f};
foreach (#people){
print $_,"$people[$_]\n";
}
Below is the output,
[~/perl]$ perl test.pl
aa #why the output of $people[$_] is not same with $_?
ba
ca
da
ea
fa
Thanks for your asking.

$_ is the actual element you're looking at. $people[$_] is getting the $_th element out of #people. It's intended to be used with numerical indices, so it numifies its argument. Taking a letter and "converting" it to a number just converts to zero since it can't parse a number out of it. So $people[$_] is just $people[0] when $_ is not a number. If you try the same experiment where some of the list elements are actually numerical, you'll get some more interesting results.
Try:
use strict;
my #people = qw{a b c 3 2 1};
foreach (#people){
print $_,"$people[$_]\n";
}
Output:
aa
ba
ca
33
2c
1b
Since, in the first three cases, we couldn't parse a, b, or c as numbers so we got the zeroth element. Then, we can actually convert 3, 2, and 1 to numbers, so we get elements 3, 2, and then 1.
EDIT: As mentioned in a comment by #ikegami, put use warnings at the top of your file in addition to use strict to get warned about this sort of thing.

Related

Sort the column values and search the value

Input to my script is this file which contains data as below.
A food 75
B car 136
A car 69
A house 179
B food 75
C car 136
C food 85
For each distinct value of the second column, I want to print any line where the number in the third column is different.
Example output
C food 85
A car 69
Here is my Perl code.
#! /usr/local/bin/perl
use strict;
use warning;
my %data = ();
open FILE, '<', 'data.txt' or die $!;
while ( <FILE> ) {
chomp;
$data{$1} = $2 while /\s*(\S+),(\S+)/g;
}
close FILE;
print $_, '-', $data{$_}, $/ for keys %data;
I am able to print the hash keys and values, but not able to get the desired output.
Any pointers on how to do that using Perl?
As far as I can tell from your question, you want a list of all the lines where there is an "odd one out" with the same item type and a different number in the third column from all the rest
I think this is what you need
It reads all the data into hash %data, so that $data{$type}{$n} is a (reference to an) array of all the data lines that use that object type and number
Then the hash is scanned again, looking for and printing all instances that have only a single line with the given type/number and where there are other values for the same object type (otherwise it would be the only entry and not an "odd one out")
use strict;
use warnings 'all';
use autodie;
my %data;
open my $fh, '<', 'data.txt';
while ( <$fh> ) {
my ( $label, $type, $n) = split;
push #{ $data{$type}{$n} }, $_;
}
for my $type ( keys %data ) {
my $items = $data{$type};
next unless keys %$items > 1;
for my $n ( keys %$items ) {
print $items->{$n}[0] if #{ $items->{$n} } == 1;
}
}
output
C food 85
A car 69
Note that this may print multiple lines for a given object type if the input looks like, say
B car 22
A car 33
B car 136
C car 136
This has two "odd ones out" that appear only once for the given object type, so both B car 22 and A car 33 will be printed
Here are the pointers:
First, you need to remember lines somewhere before outputting them.
Second, you need to discard previously remembered line for the object according to the rules you set.
In your case, the rule is to discard when the number for the object differs from the previous remembered.
Both tasks can be accomplished with the hash.
For each line:
my ($letter, $object, $number)=split /\s+/, $line;
if (!defined($hash{$object}) || $hash{$object}[0]!=$number) {
$hash{$object}=[$number, $line];
}
Third, you need to output the hash:
for my $object(keys %hash) {
print $hash{$object}[1];
}
But there is the problem: a hash is an unordered structure, it won't return its keys in the order you put them into the hash.
So, the fourth: you need to add the ordering to your hash data, which can be accomplished like this:
$hash{$object}=[$number,$line,$.]; # $. is the row number over all the input files or STDIN, we use it for sorting
And in the output part you sort with the stored row number
(see sort for details about $a, $b variables):
for my $object(sort { $hash{$a}[2]<=>$hash{$b}[2] } keys %hash) {
print $hash{$object}[1];
}
Regarding the comments
I am certain that my code does not contain any errors.
If we look at the question before it was edited by some high rep users, it states:
[cite]
Now where if Numeric column(Third) has different value (Where in 2nd column matches) ...Then print only the mismatched number line. example..
A food 75
B car 136
A car 69
A house 179
B food 75
B car 136
C food 85
Example output (As number columns are not matching)
C food 85
[/cite]
I can only interpret that print only the mismatched number line as: to print the last line for the object where the number changed. That clearly matches the example the OP provided.
Even so, in my answer I addressed the possibility of misinterpretation, by stating that line omitting is done according to whatever rules the OP wants.
And below that I indicated what was the rule by that time in my opinion.
I think it well addressed the OP problem, because, after all, the OP wanted the pointers.
And now my answer is critiqued because it does not match the edited (long after and not by OP) requirements.
I disagree.
Regarding the whitespace: specifying /\s+/ for split is not an error here, despite of some comments trying to assert that.
While I agree that " " is common for split, I would disagree that there are a lot of cases where you must use " " instead of /\s+/.
/\s+/ is a regular expression which is the conventional argument for split, while " " is the shorthand, that actually masks the meaning.
With that I decided to use explicit split /\s+/, $line in my example instead of just split " ", $line or just split specifically to show the innerworkings of perl.
I think it is important to any one new to perl.
It is perfectly ok to use /\s+/, but be careful if you expect to have leading whitespace in your data, consult perldoc -f split and decide whether /\s+/ suits your needs or not.

Splitting an Array into n accessible parts within perl?

My goal is to take an array of letters and cut it up into "n" parts. In this case no more than 10 letters each piece. But I want these arrays to be stored into an array reference which I can access on a counter.
For example, I have the following script to split an array of English alphabetical letters into 1 array of 10 letters. But since the English Alphabet has 26 letters, I need 2 more arrays to access in an array reference.
#!/usr/bin/env perl
#split an array into parts.
use strict;
use warnings;
use feature 'say';
my #letters = ('A' .. 'Z');
say "These are my letters:";
for(#letters){print "$_ ";}
my #letters_selected = splice(#letters, 0, 10);
say "\nThese are my selected letters:";
for(#letters_selected){print "$_ ";}
The output is this:
These are my letters:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
These are my selected letters:
A B C D E F G H I J
This little script only gives me one piece of 10 letters of the alphabet. But I want all three pieces of 10 letters of the alphabet, so I would like to know how I can achieve this:
Goal:
Have an array reference called letters_selected of letters which contains all letters A - Z. But ... I can access all three pieces of size less than or equal to 10 letters like this.
foreach(#{$letters_selected[0]}){say "$_ ";}
returns: A B C D E F G H I J # These are the initial 10 elements of the alphabet.
foreach(#{$letters_selected[1]}){say "$_ ";}
returns: K L M N O P Q R S T # The next 10 after that.
foreach(#{$letters_selected[2]}){say "$_ ";}
returns: U V W X Y Z # The next no more than 10 after that.
Since splice is destructive to its target you can keep applying it
use warnings;
use strict;
use feature 'say';
my #letters = 'A'..'Z';
my #letter_groups;
push #letter_groups, [ splice #letters, 0, 10 ] while #letters;
say "#$_" for #letter_groups;
After this #letters is empty. So make a copy of it and work with that if you will need it.
Every time through, splice removes and returns elements from #letters and [ ] makes an anonymous array of that list. This reference is pushed on #letter_groups.
Since splice takes as many elements as there are (if there aren't 10) once fewer than 10 remain splice removes and returns that, the #letters gets emptied, and while terminates.

Perl forming string random string combination

I have a file with around 25000 records, each records has more than 13 entries are drug names. I want to form all the possible pair combination for these entries. Eg: if a line has three records A, B, C. I should form combinations as 1) A B 2) A C 3)B C. Below is the code I got from internet, it works only if a single line is assigned to an array:
use Math::Combinatorics;
my #n = qw(a b c);
my $combinat = Math::Combinatorics->new(
count => 2,
data => [#n],
);
while ( my #combo = $combinat->next_combination ) {
print join( ' ', #combo ) . "\n";
}
The code I am using, it doesn't produce any output:
open IN, "drugs.txt" or die "Cannot open the drug file";
open OUT, ">Combination.txt";
use Math::Combinatorics;
while (<IN>) {
chomp $_;
#Drugs = split /\t/, $_;
#n = $Drugs[1];
my $combinat = Math::Combinatorics->new(
count => 2,
data => [#n],
);
while ( my #combo = $combinat->next_combination ) {
print join( ' ', #combo ) . "\n";
}
print "\n";
}
Can you please suggest me a solution to this problem?
You're setting #n to be an array containing the second value of the #Drugs array, try just using data => \#Drugs in the Math::Combinatorics constructor.
Also, use strict; use warnings; blahblahblah.
All pairs from an array are straightforward to compute. Using drugs A, B, and C as from your question, you might think of them forming a square matrix.
AA AB AC
BA BB BC
CA CB CC
You probably do not want the “diagonal” pairs AA, BB, and CC. Note that the remaining elements are symmetrical. For example, element (0,1) is AB and (1,0) is BA. Here again, I assume these are the same and that you do not want duplicates.
To borrow a term from linear algebra, you want the upper triangle. Doing it this way eliminates duplicates by construction, assuming that each drug name on a given line is unique. An algorithm for this is below.
Select in turn each drug q on the line. For each of these, perform steps 2 and 3.
Beginning with the drug immediately following q and then for each drug r in the rest of the list, perform step 3.
Record the pair (q, r).
The recorded list is the list of all unique pairs.
In Perl, this looks like
#! /usr/bin/env perl
use strict;
use warnings;
sub pairs {
my #a = #_;
my #pairs;
foreach my $i (0 .. $#a) {
foreach my $j ($i+1 .. $#a) {
push #pairs, [ #a[$i,$j] ];
}
}
wantarray ? #pairs : \#pairs;
}
my $line = "Perlix\tScalaris\tHashagra\tNextium";
for (pairs split /\t/, $line) {
print "#$_\n";
}
Output:
Perlix Scalaris
Perlix Hashagra
Perlix Nextium
Scalaris Hashagra
Scalaris Nextium
Hashagra Nextium
I've answered something like this before for someone else. For them, they had a question on how to combine a list of letters into all possible words.
Take a look at How Can I Generate a List of Words from a group of Letters Using Perl. In it, you'll see an example of using Math::Combinatorics from my answer and the correct answer that ikegami had. (He did something rather interesting with regular expressions).
I'm sure one of these will lead you to the answer you need. Maybe when I have more time, I'll flesh out an answer specifically for your question. I hope this link helps.

How to parse a file, create records and perform manipulations on records including frequency of terms and distance calculations

I'm a student in an intro Perl class, looking for suggestions and feedback on my approach to writing a small (but tricky) program that analyzes data about atoms. My professor encourages forums. I am not advanced with Perl subs or modules (including Bioperl) so please limit responses to an appropriate 'beginner level' so that I can understand and learn from your suggestions and/or code (also limit "Magic" please).
The requirements of the program are as follows:
Read a file (containing data about Atoms) from the command line & create an array of atom records (one record/atom per newline). For each record the program will need to store:
• The atom's serial number (cols 7 - 11)
• The three-letter name of the amino acid to which it belongs (cols 18 - 20)
• The atom's three coordinates (x,y,z) (cols 31 - 54 )
• The atom's one- or two-letter element name (e.g. C, O, N, Na) (cols 77-78 )
Prompt for one of three commands: freq, length, density d (d is some number):
• freq - how many of each type of atom is in the file (example Nitrogen, Sodium, etc would be displayed like this: N: 918 S: 23
• length - The distances among coordinates
• density d (where d is a number) - program will prompt for the name of a file to save computations to and will containing the distance between that atom and every other atom. If that distance is less than or equal to the number d, it increments the count of the number of atoms that are within that distance, unless that count is zero into the file. The output will look something like:
1: 5
2: 3
3: 6
... (very big file) and will close when it finishes.
I'm looking for feedback on what I have written (and need to write) in the code below. I especially appreciate any feedback in how to approach writing my subs. I've included sample input data at the bottom.
The program structure and function descriptions as I see it:
$^W = 1; # turn on warnings
use strict; # behave!
my #fields;
my #recs;
while ( <DATA> ) {
chomp;
#fields = split(/\s+/);
push #recs, makeRecord(#fields);
}
for (my $i = 0; $i < #recs; $i++) {
printRec( $recs[$i] );
}
my %command_table = (
freq => \&freq,
length => \&length,
density => \&density,
help => \&help,
quit => \&quit
);
print "Enter a command: ";
while ( <STDIN> ) {
chomp;
my #line = split( /\s+/);
my $command = shift #line;
if ($command !~ /^freq$|^density$|length|^help$|^quit$/ ) {
print "Command must be: freq, length, density or quit\n";
}
else {
$command_table{$command}->();
}
print "Enter a command: ";
}
sub makeRecord
# Read the entire line and make records from the lines that contain the
# word ATOM or HETATM in the first column. Not sure how to do this:
{
my %record =
(
serialnumber => shift,
aminoacid => shift,
coordinates => shift,
element => [ #_ ]
);
return\%record;
}
sub freq
# take an array of atom records, return a hash whose keys are
# distinct atom names and whose values are the frequences of
# these atoms in the array.
sub length
# take an array of atom records and return the max distance
# between all pairs of atoms in that array. My instructor
# advised this would be constructed as a for loop inside a for loop.
sub density
# take an array of atom records and a number d and will return a
# hash whose keys are atom serial numbers and whose values are
# the number of atoms within that distance from the atom with that
# serial number.
sub help
{
print "To use this program, type either\n",
"freq\n",
"length\n",
"density followed by a number, d,\n",
"help\n",
"quit\n";
}
sub quit
{
exit 0;
}
# truncating for testing purposes. Actual data is aprox. 100 columns
# and starts with ATOM or HETATM.
__DATA__
ATOM 4743 CG GLN A 704 19.896 32.017 54.717 1.00 66.44 C
ATOM 4744 CD GLN A 704 19.589 30.757 55.525 1.00 73.28 C
ATOM 4745 OE1 GLN A 704 18.801 29.892 55.098 1.00 75.91 O
It looks like your Perl skills are advancing nicely -- using references and complex data structures. Here are a few tips and pieces of general advice.
Enable warnings with use warnings rather than $^W = 1. The former is self-documenting and has the advantage being local to the enclosing block rather than being a global setting.
Use well-named variables, which will help document the program's behavior, rather than relying on Perl's special $_. For example:
while (my $input_record = <DATA>){
}
In user-input scenarios, an endless loop provides a way to avoid repeated instructions like "Enter a command". See below.
Your regex can be simplified to avoid the need for repeated anchors. See below.
As a general rule, affirmative tests are easier to understand than negative tests. See the modified if-else structure below.
Enclose each part of program within its own subroutine. This is a good general practice for a bunch of reasons, so I would just start the habit.
A related good practice is to minimize the use of global variables. As an exercise, you could try to write the program so that it uses no global variables at all. Instead, any needed information would be passed around between the subroutines. With small programs one does not necessarily need to be rigid about the avoidance of globals, but it's not a bad idea to keep the ideal in mind.
Give your length subroutine a different name. That name is already used by the built-in length function.
Regarding your question about makeRecord, one approach is to ignore the filtering issue inside makeRecord. Instead, makeRecord could include an additional hash field, and the filtering logic would reside elsewhere. For example:
my $record = makeRecord(#fields);
push #recs, $record if $record->{type} =~ /^(ATOM|HETATM)$/;
An illustration of some of the points above:
use strict;
use warnings;
run();
sub run {
my $atom_data = load_atom_data();
print_records($atom_data);
interact_with_user($atom_data);
}
...
sub interact_with_user {
my $atom_data = shift;
my %command_table = (...);
while (1){
print "Enter a command: ";
chomp(my $reply = <STDIN>);
my ($command, #line) = split /\s+/, $reply;
if ( $command =~ /^(freq|density|length|help|quit)$/ ) {
# Run the command.
}
else {
# Print usage message for user.
}
}
}
...
FM's answer is pretty good. I'll just mention a couple of additional things:
You already have a hash with the valid commands (which is a good idea). There's no need to duplicate that list in a regex. I'd do something like this:
if (my $routine = $command_table{$command}) {
$routine->(#line);
} else {
print "Command must be: freq, length, density or quit\n";
}
Notice I'm also passing #line to the subroutine, because you'll need that for the density command. Subroutines that don't take arguments can just ignore them.
You could also generate the list of valid commands for the error message by using keys %command_table, but I'll leave that as an exercise for you.
Another thing is that the description of the input file mentions column numbers, which suggests that it's a fixed-width format. That's better parsed with substr or unpack. If a field is ever blank or contains a space, then your split will not parse it correctly. (If you use substr, be aware that it numbers columns starting at 0, when people often label the first column 1.)

Why do I get an error when I try to use the reptition assignment operator with an array?

#!/usr/bin/perl
use strict;
use warnings;
my #a = qw/a b c/;
(#a) x= 3;
print join(", ", #a), "\n";
I would expect the code above to print "a, b, c, a, b, c, a, b, c\n", but instead it dies with the message:
Can't modify private array in repeat (x) at z.pl line 7, near "3;"
This seems odd because the X <op>= Y are documented as being equivalent to X = X <op> Y, and the following code works as I expect it to:
#!/usr/bin/perl
use strict;
use warnings;
my #a = qw/a b c/;
(#a) = (#a) x 3;
print join(", ", #a), "\n";
Is this a bug in Perl or am I misunderstanding what should happen here?
My first thought was that it was a misunderstanding of some subtlety on Perl's part, namely that the parens around #a made it parse as an attempt to assign to a list. (The list itself, not normal list assignment.) That conclusion seems to be supported by perldiag:
Can't modify %s in %s
(F) You aren't allowed to assign to the item indicated, or otherwise try to
change it, such as with an auto-increment.
Apparently that's not the case, though. If it were this should have the same error:
($x) x= 3; # ok
More conclusively, this gives the same error:
#a x= 3; # Can't modify private array in repeat...
Ergo, definitely a bug. File it.
My guess is that Perl is not a language with full symbolic transformations. It tries to figure out what you mean. If you "list-ify" #a, by putting it in parens, it sort of loses what you wanted to assign it to.
Notice that this does not do what we want:
my #b = #a x 3; # we'll get scalar( #a ) --> '3' x 3 --> '333'
But, this does:
my #b = ( #a ) x 3;
As does:
( #a ) = ( #a ) x 3;
So it seems that when the expression actally appears on both sides Perl interprets them in different contexts. It knows that we're assigning something, so it tries to find out what we're assigning to.
I'd chalk it up to a bug, from a very seldom used syntax.
The problem is that you're trying to modify #a in place, which Perl evidently doesn't allow you to do. Your second example is doing something subtly different, which is to create a new array that consists of #a repeated three times, then overwriting #a with that value.
Arguably the first form should be transparently translated to the second form, but that isn't what actually happens. You could consider this a bug... file it in the appropriate places and see what happens.