Sort the column values and search the value - perl

Input to my script is this file which contains data as below.
A food 75
B car 136
A car 69
A house 179
B food 75
C car 136
C food 85
For each distinct value of the second column, I want to print any line where the number in the third column is different.
Example output
C food 85
A car 69
Here is my Perl code.
#! /usr/local/bin/perl
use strict;
use warning;
my %data = ();
open FILE, '<', 'data.txt' or die $!;
while ( <FILE> ) {
chomp;
$data{$1} = $2 while /\s*(\S+),(\S+)/g;
}
close FILE;
print $_, '-', $data{$_}, $/ for keys %data;
I am able to print the hash keys and values, but not able to get the desired output.
Any pointers on how to do that using Perl?

As far as I can tell from your question, you want a list of all the lines where there is an "odd one out" with the same item type and a different number in the third column from all the rest
I think this is what you need
It reads all the data into hash %data, so that $data{$type}{$n} is a (reference to an) array of all the data lines that use that object type and number
Then the hash is scanned again, looking for and printing all instances that have only a single line with the given type/number and where there are other values for the same object type (otherwise it would be the only entry and not an "odd one out")
use strict;
use warnings 'all';
use autodie;
my %data;
open my $fh, '<', 'data.txt';
while ( <$fh> ) {
my ( $label, $type, $n) = split;
push #{ $data{$type}{$n} }, $_;
}
for my $type ( keys %data ) {
my $items = $data{$type};
next unless keys %$items > 1;
for my $n ( keys %$items ) {
print $items->{$n}[0] if #{ $items->{$n} } == 1;
}
}
output
C food 85
A car 69
Note that this may print multiple lines for a given object type if the input looks like, say
B car 22
A car 33
B car 136
C car 136
This has two "odd ones out" that appear only once for the given object type, so both B car 22 and A car 33 will be printed

Here are the pointers:
First, you need to remember lines somewhere before outputting them.
Second, you need to discard previously remembered line for the object according to the rules you set.
In your case, the rule is to discard when the number for the object differs from the previous remembered.
Both tasks can be accomplished with the hash.
For each line:
my ($letter, $object, $number)=split /\s+/, $line;
if (!defined($hash{$object}) || $hash{$object}[0]!=$number) {
$hash{$object}=[$number, $line];
}
Third, you need to output the hash:
for my $object(keys %hash) {
print $hash{$object}[1];
}
But there is the problem: a hash is an unordered structure, it won't return its keys in the order you put them into the hash.
So, the fourth: you need to add the ordering to your hash data, which can be accomplished like this:
$hash{$object}=[$number,$line,$.]; # $. is the row number over all the input files or STDIN, we use it for sorting
And in the output part you sort with the stored row number
(see sort for details about $a, $b variables):
for my $object(sort { $hash{$a}[2]<=>$hash{$b}[2] } keys %hash) {
print $hash{$object}[1];
}
Regarding the comments
I am certain that my code does not contain any errors.
If we look at the question before it was edited by some high rep users, it states:
[cite]
Now where if Numeric column(Third) has different value (Where in 2nd column matches) ...Then print only the mismatched number line. example..
A food 75
B car 136
A car 69
A house 179
B food 75
B car 136
C food 85
Example output (As number columns are not matching)
C food 85
[/cite]
I can only interpret that print only the mismatched number line as: to print the last line for the object where the number changed. That clearly matches the example the OP provided.
Even so, in my answer I addressed the possibility of misinterpretation, by stating that line omitting is done according to whatever rules the OP wants.
And below that I indicated what was the rule by that time in my opinion.
I think it well addressed the OP problem, because, after all, the OP wanted the pointers.
And now my answer is critiqued because it does not match the edited (long after and not by OP) requirements.
I disagree.
Regarding the whitespace: specifying /\s+/ for split is not an error here, despite of some comments trying to assert that.
While I agree that " " is common for split, I would disagree that there are a lot of cases where you must use " " instead of /\s+/.
/\s+/ is a regular expression which is the conventional argument for split, while " " is the shorthand, that actually masks the meaning.
With that I decided to use explicit split /\s+/, $line in my example instead of just split " ", $line or just split specifically to show the innerworkings of perl.
I think it is important to any one new to perl.
It is perfectly ok to use /\s+/, but be careful if you expect to have leading whitespace in your data, consult perldoc -f split and decide whether /\s+/ suits your needs or not.

Related

Finding equal lines in file with Perl

I have a CSV file which contains duplicated items in different rows.
x1,y1
x2,y2
y1,x1
x3,y3
The two rows containing x1,y1 and y1,x1 are a match as they contain the same data in a diffrent order.
I need your help to find an algorithm to search for such lines in a 12MB file.
If you can define some ordering and equality relations between fields, you could store a normalized form and test your lines for equality against that.
As an example, we will use string comparision for your fields, but after lowercasing them. We can then sort the parts according to this relation, and create a lookup table via a nested hash:
use strict; use warnings;
my $cache; # A hash of hashes. Will be autovivified later.
while (<DATA>) {
chomp;
my #fields = split;
# create the normalized representation by lowercasing and sorting the fields
my #normalized_fields = sort map lc, #fields;
# find or create the path in the lookup
my $pointer = \$cache;
$pointer = \${$pointer}->{$_} for #normalized_fields;
# if this is an unknow value, make it known, and output the line
unless (defined $$pointer) {
$$pointer = 1; # set some defined value
print "$_\n"; # emit the unique line
}
}
__DATA__
X1 y1
X2 y2
Y1 x1
X3 y3
In this example I used the scalar 1 as value of the lookup data structure, but in more complex scenarios the original fields or the line number could be stored here. For the sake of the example, I used space-seperated values here, but you could replace the split with a call to Text::CSV or something.
This hash-of-hashes approach has sublinear space complexity, and worst case linear space complexity. The lookup time only depends on the number (and size) of fields in a record, not on the total number of records.
Limitation: All records must have the same number of fields, or some shorter records could be falsely considered “seen”. To circumvent these problems, we can use more complex nodes:
my $pointer = \$cache;
$pointer = \$$pointer->[0]{$_} for #normalized_fields;
unless (defined $$pointer->[1]) {
$$pointer->[1] = 1; ...
}
or introduce a default value for nonexistant field (e.g. the seperator of the original file). Here an example with the NUL character:
my $fields = 3;
...;
die "record too long" if #fields > $fields;
...; # make normalized fields
push #normalized_fields, ("\x00") x ($fields - #normalized_fields);
...; # do the lookup
A lot depends on what you want to know about duplicate lines once they have been found. This program uses a simple hash to list the line numbers of those lines that are equivalent.
use strict;
use warnings;
my %data;
while (<DATA>) {
chomp;
my $key = join ',', sort map lc, split /,/;
push #{$data{$key}}, $.;
}
foreach my $list (values %data) {
next unless #$list > 1;
print "Lines ", join(', ', #$list), " are equivalent\n";
}
__DATA__
x1,y1
x2,y2
y1,x1
x3,y3
output
Lines 1, 3 are equivalent
Make two hash tables A and B
Stream through your input one line at a time
For the first line pair x and y, use each as key and the other as value for both hash tables (e.g., $A->{x} = y; $B->{y} = x;)
For the second and subsequent line pairs, test if the second field's value exists as a key for either A or B — if it does, you have a reverse match — if not, then repeat the addition process from step 3 to add it to the hash tables
To do a version of amon's answer without a hash table, if your data are numerical, you could:
Stream through input line by line, sorting fields one and two by numerical ordering
Pipe result to UNIX sort on first and second fields
Stream through sorted output line by line, checking if current line matches the previous line (reporting a reverse match, if true)
This has the advantage of using less memory than hash tables, but may take more time to process.
amon already provided the answer I would've provided, so please enjoy this bad answer:
#! /usr/bin/perl
use common::sense;
my $re = qr/(?!)/; # always fails
while (<DATA>) {
warn "Found duplicate: $_" if $_ =~ $re;
next unless /^(.*),(.*)$/;
die "Unexpected input at line $.: $_" if "$1$2" =~ tr/,//;
$re = qr/^\Q$2,$1\E$|$re/
}
__DATA__
x1,y1
x2,y2
y1,x1
x3,y3

Perl forming string random string combination

I have a file with around 25000 records, each records has more than 13 entries are drug names. I want to form all the possible pair combination for these entries. Eg: if a line has three records A, B, C. I should form combinations as 1) A B 2) A C 3)B C. Below is the code I got from internet, it works only if a single line is assigned to an array:
use Math::Combinatorics;
my #n = qw(a b c);
my $combinat = Math::Combinatorics->new(
count => 2,
data => [#n],
);
while ( my #combo = $combinat->next_combination ) {
print join( ' ', #combo ) . "\n";
}
The code I am using, it doesn't produce any output:
open IN, "drugs.txt" or die "Cannot open the drug file";
open OUT, ">Combination.txt";
use Math::Combinatorics;
while (<IN>) {
chomp $_;
#Drugs = split /\t/, $_;
#n = $Drugs[1];
my $combinat = Math::Combinatorics->new(
count => 2,
data => [#n],
);
while ( my #combo = $combinat->next_combination ) {
print join( ' ', #combo ) . "\n";
}
print "\n";
}
Can you please suggest me a solution to this problem?
You're setting #n to be an array containing the second value of the #Drugs array, try just using data => \#Drugs in the Math::Combinatorics constructor.
Also, use strict; use warnings; blahblahblah.
All pairs from an array are straightforward to compute. Using drugs A, B, and C as from your question, you might think of them forming a square matrix.
AA AB AC
BA BB BC
CA CB CC
You probably do not want the “diagonal” pairs AA, BB, and CC. Note that the remaining elements are symmetrical. For example, element (0,1) is AB and (1,0) is BA. Here again, I assume these are the same and that you do not want duplicates.
To borrow a term from linear algebra, you want the upper triangle. Doing it this way eliminates duplicates by construction, assuming that each drug name on a given line is unique. An algorithm for this is below.
Select in turn each drug q on the line. For each of these, perform steps 2 and 3.
Beginning with the drug immediately following q and then for each drug r in the rest of the list, perform step 3.
Record the pair (q, r).
The recorded list is the list of all unique pairs.
In Perl, this looks like
#! /usr/bin/env perl
use strict;
use warnings;
sub pairs {
my #a = #_;
my #pairs;
foreach my $i (0 .. $#a) {
foreach my $j ($i+1 .. $#a) {
push #pairs, [ #a[$i,$j] ];
}
}
wantarray ? #pairs : \#pairs;
}
my $line = "Perlix\tScalaris\tHashagra\tNextium";
for (pairs split /\t/, $line) {
print "#$_\n";
}
Output:
Perlix Scalaris
Perlix Hashagra
Perlix Nextium
Scalaris Hashagra
Scalaris Nextium
Hashagra Nextium
I've answered something like this before for someone else. For them, they had a question on how to combine a list of letters into all possible words.
Take a look at How Can I Generate a List of Words from a group of Letters Using Perl. In it, you'll see an example of using Math::Combinatorics from my answer and the correct answer that ikegami had. (He did something rather interesting with regular expressions).
I'm sure one of these will lead you to the answer you need. Maybe when I have more time, I'll flesh out an answer specifically for your question. I hope this link helps.

PERL -- Regex incl all hash keys (sorted) + deleting empty fields from $_ in file read

I'm working on a program and I have a couple of questions, hope you can help:
First I need to access a file and retrieve specific information according to an index that is obtained from a previous step, in which the indexes to retrieve are found and store in a hash.
I've been looking for a way to include all array elements in a regex that I can use in the file search, but I haven´t been able to make it work. Eventually i've found a way that works:
my #atoms = ();
my $natoms=0;
foreach my $atomi (keys %{$atome}){
push (#atoms,$atomi);
$natoms++;
}
#atoms = sort {$b cmp $a} #atoms;
and then I use it as a regex this way:
while (<IN_LIG>){
if (!$natoms) {last;}
......
if ($_ =~ m/^\s*$atoms[$natoms-1]\s+/){
$natoms--;
.....
}
Is there any way to create a regex expression that would include all hash keys? They are numeric and must be sorted. The keys refer to the line index in IN_LIG, whose content is something like this:
8 C5 9.9153 2.3814 -8.6988 C.ar 1 MLK -0.1500
The key is to be found in column 0 (8). I have added ^ and \s+ to make sure it refers only to the first column.
My second problem is that sometimes input files are not always identical and they make contain white spaces before the index, so when I create an array from $_ I get column0 = " " instead of column0=8
I don't understand why this "empty column" is not eliminated on the split command and I'm having some trouble to remove it. This is what I have done:
#info = split (/[\s]+/,$_);
if ($info[0] eq " ") {splice (#info, 0,1);} # also tried $info[0] =~ m/\s+/
and when I print the array #info I get this:
Array:
Array: 8
Array: C5
Array: 9.9153
Array: 2.3814
.....
How can I get rid of the empty column?
Many thanks for your help
Merche
There is a special form of split where it will remove both leading and trailing spaces. It looks like this, try it:
my $line = ' begins with spaces and ends with spaces ';
my #tokens = split ' ', $line;
# This prints |begins:with:spaces:and:ends:with:spaces|
print "|", join(':', #tokens), "|\n";
See the documentation for split at http://p3rl.org/split (or with perldoc split)
Also, the first part of your program might be simpler as:
my #atoms = sort {$b cmp $a} keys %$atome;
my $natoms = #atoms;
But, what is your ultimate goal with the atoms? If you simply want to verify that the atoms you're given are indeed in the file, then you don't need to sort them, nor to count them:
my #atoms = keys %$atome;
while (<IN_LIG>){
# The atom ID on this line
my ($atom_id) = split ' ';
# Is this atom ID in the array of atom IDs that we are looking for
if (grep { /$atom_id/ } #atoms) {
# This line of the file has an atom that was in the array: $atom_id
}
}
Lets warm up by refining and correcting some of your code:
# If these are all numbers, do a numerical sort: <=> not cmp
my #atoms = ( sort { $b <=> $a } keys %{$atome} );
my $natoms = scalar #atoms;
No need to loop through the keys, you can insert them into the array right away. You can also sort them right away, and if they are numbers, the sort must be numerical, otherwise you will get a sort like: 1, 11, 111, 2, 22, 222, ...
$natoms can be assigned directly by the count of values in #atoms.
while(<IN_LIG>) {
last unless $natoms;
my $key = (split)[0]; # split splits on whitespace and $_ by default
$natoms-- if ($key == $atoms[$natoms - 1]);
}
I'm not quite sure what you are doing here, and if it is the best way, but this code should work, whereas your regex would not. Inside a regex, [] are meta characters. Split by default splits $_ on whitespace, so you need not be explicit about that. This split will also definitely remove all whitespace. Your empty field is most likely an empty string, '', and not a space ' '.
The best way to compare two numbers is not by a regex, but with the equality operator ==.
Your empty field should be gone by splitting on whitespace. The default for split is split ' '.
Also, if you are not already doing it, you should use:
use strict;
use warnings;
It will save you a lot of headaches.
for your second question you could use this line:
#info = $_ =~ m{^\s*(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)}xms;
in order to capture 9 items from each line (assuming they do not contain whitespace).
The first question I do not understand.
Update: I would read alle the lines of the file and use them in a hash with $info[0] as the key and [#info[1..8]] as the value. Then you can lookup the entries by your index.
my %details;
while (<IN_LIG>) {
#info = $_ =~ m{^\s*(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)}xms;
$details{ $info[0] } = [ #info[1..$#info] ];
}
Later you can lookup details for the indices you are interested in and process as needed. This assumes the index is unique (has the property of keys).
thanks for all your replies. I tried the split form with ' ' and it saved me several lines of code. thanks!
As for the regex, I found something that could make all keys as part of the string expression with join and quotemeta, but I couldn't make it work. Nevertheless I found an alternative that works, but I liked the join/quotemeta solution better
The atom indexes are obtained from a text file according to some energy threshold. Later, in the IN_LIG loop, I need to access the molecule file to obtain more information about the atoms selected, thus I use the atom "index" in the molecule to identify which lines of the file I have to read and process. This is a subroutine to which I send a hash with the atom index and some other information.
I tried this for the regex:
my $strings = join "|" map quotemeta,
sort { $hash->{$b} <=> $hash->{$a}} keys %($hash);
but I did something wrong cos it wouldn't take all keys

How to parse a file, create records and perform manipulations on records including frequency of terms and distance calculations

I'm a student in an intro Perl class, looking for suggestions and feedback on my approach to writing a small (but tricky) program that analyzes data about atoms. My professor encourages forums. I am not advanced with Perl subs or modules (including Bioperl) so please limit responses to an appropriate 'beginner level' so that I can understand and learn from your suggestions and/or code (also limit "Magic" please).
The requirements of the program are as follows:
Read a file (containing data about Atoms) from the command line & create an array of atom records (one record/atom per newline). For each record the program will need to store:
• The atom's serial number (cols 7 - 11)
• The three-letter name of the amino acid to which it belongs (cols 18 - 20)
• The atom's three coordinates (x,y,z) (cols 31 - 54 )
• The atom's one- or two-letter element name (e.g. C, O, N, Na) (cols 77-78 )
Prompt for one of three commands: freq, length, density d (d is some number):
• freq - how many of each type of atom is in the file (example Nitrogen, Sodium, etc would be displayed like this: N: 918 S: 23
• length - The distances among coordinates
• density d (where d is a number) - program will prompt for the name of a file to save computations to and will containing the distance between that atom and every other atom. If that distance is less than or equal to the number d, it increments the count of the number of atoms that are within that distance, unless that count is zero into the file. The output will look something like:
1: 5
2: 3
3: 6
... (very big file) and will close when it finishes.
I'm looking for feedback on what I have written (and need to write) in the code below. I especially appreciate any feedback in how to approach writing my subs. I've included sample input data at the bottom.
The program structure and function descriptions as I see it:
$^W = 1; # turn on warnings
use strict; # behave!
my #fields;
my #recs;
while ( <DATA> ) {
chomp;
#fields = split(/\s+/);
push #recs, makeRecord(#fields);
}
for (my $i = 0; $i < #recs; $i++) {
printRec( $recs[$i] );
}
my %command_table = (
freq => \&freq,
length => \&length,
density => \&density,
help => \&help,
quit => \&quit
);
print "Enter a command: ";
while ( <STDIN> ) {
chomp;
my #line = split( /\s+/);
my $command = shift #line;
if ($command !~ /^freq$|^density$|length|^help$|^quit$/ ) {
print "Command must be: freq, length, density or quit\n";
}
else {
$command_table{$command}->();
}
print "Enter a command: ";
}
sub makeRecord
# Read the entire line and make records from the lines that contain the
# word ATOM or HETATM in the first column. Not sure how to do this:
{
my %record =
(
serialnumber => shift,
aminoacid => shift,
coordinates => shift,
element => [ #_ ]
);
return\%record;
}
sub freq
# take an array of atom records, return a hash whose keys are
# distinct atom names and whose values are the frequences of
# these atoms in the array.
sub length
# take an array of atom records and return the max distance
# between all pairs of atoms in that array. My instructor
# advised this would be constructed as a for loop inside a for loop.
sub density
# take an array of atom records and a number d and will return a
# hash whose keys are atom serial numbers and whose values are
# the number of atoms within that distance from the atom with that
# serial number.
sub help
{
print "To use this program, type either\n",
"freq\n",
"length\n",
"density followed by a number, d,\n",
"help\n",
"quit\n";
}
sub quit
{
exit 0;
}
# truncating for testing purposes. Actual data is aprox. 100 columns
# and starts with ATOM or HETATM.
__DATA__
ATOM 4743 CG GLN A 704 19.896 32.017 54.717 1.00 66.44 C
ATOM 4744 CD GLN A 704 19.589 30.757 55.525 1.00 73.28 C
ATOM 4745 OE1 GLN A 704 18.801 29.892 55.098 1.00 75.91 O
It looks like your Perl skills are advancing nicely -- using references and complex data structures. Here are a few tips and pieces of general advice.
Enable warnings with use warnings rather than $^W = 1. The former is self-documenting and has the advantage being local to the enclosing block rather than being a global setting.
Use well-named variables, which will help document the program's behavior, rather than relying on Perl's special $_. For example:
while (my $input_record = <DATA>){
}
In user-input scenarios, an endless loop provides a way to avoid repeated instructions like "Enter a command". See below.
Your regex can be simplified to avoid the need for repeated anchors. See below.
As a general rule, affirmative tests are easier to understand than negative tests. See the modified if-else structure below.
Enclose each part of program within its own subroutine. This is a good general practice for a bunch of reasons, so I would just start the habit.
A related good practice is to minimize the use of global variables. As an exercise, you could try to write the program so that it uses no global variables at all. Instead, any needed information would be passed around between the subroutines. With small programs one does not necessarily need to be rigid about the avoidance of globals, but it's not a bad idea to keep the ideal in mind.
Give your length subroutine a different name. That name is already used by the built-in length function.
Regarding your question about makeRecord, one approach is to ignore the filtering issue inside makeRecord. Instead, makeRecord could include an additional hash field, and the filtering logic would reside elsewhere. For example:
my $record = makeRecord(#fields);
push #recs, $record if $record->{type} =~ /^(ATOM|HETATM)$/;
An illustration of some of the points above:
use strict;
use warnings;
run();
sub run {
my $atom_data = load_atom_data();
print_records($atom_data);
interact_with_user($atom_data);
}
...
sub interact_with_user {
my $atom_data = shift;
my %command_table = (...);
while (1){
print "Enter a command: ";
chomp(my $reply = <STDIN>);
my ($command, #line) = split /\s+/, $reply;
if ( $command =~ /^(freq|density|length|help|quit)$/ ) {
# Run the command.
}
else {
# Print usage message for user.
}
}
}
...
FM's answer is pretty good. I'll just mention a couple of additional things:
You already have a hash with the valid commands (which is a good idea). There's no need to duplicate that list in a regex. I'd do something like this:
if (my $routine = $command_table{$command}) {
$routine->(#line);
} else {
print "Command must be: freq, length, density or quit\n";
}
Notice I'm also passing #line to the subroutine, because you'll need that for the density command. Subroutines that don't take arguments can just ignore them.
You could also generate the list of valid commands for the error message by using keys %command_table, but I'll leave that as an exercise for you.
Another thing is that the description of the input file mentions column numbers, which suggests that it's a fixed-width format. That's better parsed with substr or unpack. If a field is ever blank or contains a space, then your split will not parse it correctly. (If you use substr, be aware that it numbers columns starting at 0, when people often label the first column 1.)

Reading a large file into Perl array of arrays and manipulating the output for different purposes

I am relatively new to Perl and have only used it for converting small files into different formats and feeding data between programs.
Now, I need to step it up a little. I have a file of DNA data that is 5,905 lines long, with 32 fields per line. The fields are not delimited by anything and vary in length within the line, but each field is the same size on all 5905 lines.
I need each line fed into a separate array from the file, and each field within the line stored as its own variable. I am having no problems storing one line, but I am having difficulties storing each line successively through the entire file.
This is how I separate the first line of the full array into individual variables:
my $SampleID = substr("#HorseArray", 0, 7);
my $PopulationID = substr("#HorseArray", 9, 4);
my $Allele1A = substr("#HorseArray", 14, 3);
my $Allele1B = substr("#HorseArray", 17, 3);
my $Allele2A = substr("#HorseArray", 21, 3);
my $Allele2B = substr("#HorseArray", 24, 3);
...etc.
My issues are: 1) I need to store each of the 5905 lines as a separate array. 2) I need to be able to reference each line based on the sample ID, or a group of lines based on population ID and sort them.
I can sort and manipulate the data fine once it is defined in variables, I am just having trouble constructing a multidimensional array with each of these fields so I can reference each line at will. Any help or direction is much appreciated. I've poured over the Q&A sections on here, but have not found the answer to my questions yet.
Do not store each line in it's own array. You need to construct a data structure. Start by reading the following tutorials form perldoc:
perlreftut
perldsc
perllol
Here's some starter code:
use strict;
use warnings;
# Array of data samples. We could use a hash as well; which is better
# depends on how you want to use the data.
my #sample;
while (my $line = <DATA>) {
chomp $line;
# Parse the input line
my ($sample_id, $population_id, $rest) = split(/\s+/, $line, 3);
# extract A/B allele pairs
my #pairs;
while ($rest =~ /(\d{1,3})(\d{3})|(\d{1,3}) (\d{1,2})/g) {
push #pairs, {
A => defined $1 ? $1 : $3,
B => defined $2 ? $2 : $4,
};
}
# Add this sample to the list of samples. Store it as a hashref so
# we can access attributes by name
push #sample, {
sample => $sample_id,
population => $population_id,
alleles => \#pairs,
};
}
# Print out all the values of alleles 2A and 2B for the samples in
# population py18. Note that array indexing starts at 0, so allele 2
# is at index 1.
foreach my $sample (grep { $_->{population} eq 'py18' } #sample) {
printf("%s: %d / %d\n",
$sample->{sample},
$sample->{alleles}[1]{A},
$sample->{alleles}[1]{B},
);
}
__DATA__
00292-97 py17 97101 129129 152164 177177 100100 134136 163165 240246 105109 124124 166166 292292 000000 000000 000000
00293-97 py18 89 97 129139 148154 179179 84 90 132134 167169 222222 105105 126128 164170 284292 000000 000000 000000
00294-97 py17 91 97 129133 152154 177183 100100 134140 161163 240240 103105 120128 164166 290292 000000 000000 000000
00295-97 py18 97 97 131133 148162 177179 84100 132134 161167 240252 111111 124128 164166 284290 000000 000000 000000
I'd start by looping through the lines and parsing each into a hash of fields, and I'd build a hash for each index along the way.
my %by_sample_id; # this will be a hash of hashes
my %by_population_id; # a hash of lists of hashes
foreach (<FILEHANDLE>) {
chomp; # remove newline
my %h; # new hash
$h{SampleID} = substr($_, 0, 7);
$h{PopulationID} = substr($_, 9, 4);
# etc...
$by_sample_id{ $h{SampleID} } = \%h; # a reference to %h
push #{$by_population_id{ $h{PopulationID} }}, \%h; # pushes hashref onto list
}
Then, you can use either index to access the data in which you're interested:
say "Allele1A for sample 123123: ", $by_sample_id{123123}->{Allele1A};
say "all the Allele1A values for population 432432: ",
join(", ", map {$_->{Allele1A}} #{$by_population_id{432432}});
I'm going to assume this isn't a one-off program, so my approach would be slightly different.
I've done a fair amount of data-mashing, and after a while, I get tired of writing queries against data structures.
So -
I would feed the data into a SQLite database(or other sql DB), and then write Perl queries off of that, using Perl DBI. This cranks up the complexity to well past a simple 'parse-and-hack', but after you've written several scripts doing queries on the same data, it becomes obvious that this is a pain, there must be a better way.
You would have a schema that looks similar to this
create table brians_awesome_data (id integer, population_id varchar(32), chunk1 integer, chunk2 integer...);
Then, after you used some of mobrule and Michael's excellent parsing, you'd loop and do some INSERT INTO your awesome_data table.
Then, you could use a CLI for your SQL program and do "select ... where ..." queries to quickly get the data you need.
Or, if it's more analytical/pipeliney, you could Perl up a script with DBI and get the data into your analysis routines.
Trust me, this is the better way to do it than writing queries against data structures over and over.