Reading a large file into Perl array of arrays and manipulating the output for different purposes - perl

I am relatively new to Perl and have only used it for converting small files into different formats and feeding data between programs.
Now, I need to step it up a little. I have a file of DNA data that is 5,905 lines long, with 32 fields per line. The fields are not delimited by anything and vary in length within the line, but each field is the same size on all 5905 lines.
I need each line fed into a separate array from the file, and each field within the line stored as its own variable. I am having no problems storing one line, but I am having difficulties storing each line successively through the entire file.
This is how I separate the first line of the full array into individual variables:
my $SampleID = substr("#HorseArray", 0, 7);
my $PopulationID = substr("#HorseArray", 9, 4);
my $Allele1A = substr("#HorseArray", 14, 3);
my $Allele1B = substr("#HorseArray", 17, 3);
my $Allele2A = substr("#HorseArray", 21, 3);
my $Allele2B = substr("#HorseArray", 24, 3);
...etc.
My issues are: 1) I need to store each of the 5905 lines as a separate array. 2) I need to be able to reference each line based on the sample ID, or a group of lines based on population ID and sort them.
I can sort and manipulate the data fine once it is defined in variables, I am just having trouble constructing a multidimensional array with each of these fields so I can reference each line at will. Any help or direction is much appreciated. I've poured over the Q&A sections on here, but have not found the answer to my questions yet.

Do not store each line in it's own array. You need to construct a data structure. Start by reading the following tutorials form perldoc:
perlreftut
perldsc
perllol
Here's some starter code:
use strict;
use warnings;
# Array of data samples. We could use a hash as well; which is better
# depends on how you want to use the data.
my #sample;
while (my $line = <DATA>) {
chomp $line;
# Parse the input line
my ($sample_id, $population_id, $rest) = split(/\s+/, $line, 3);
# extract A/B allele pairs
my #pairs;
while ($rest =~ /(\d{1,3})(\d{3})|(\d{1,3}) (\d{1,2})/g) {
push #pairs, {
A => defined $1 ? $1 : $3,
B => defined $2 ? $2 : $4,
};
}
# Add this sample to the list of samples. Store it as a hashref so
# we can access attributes by name
push #sample, {
sample => $sample_id,
population => $population_id,
alleles => \#pairs,
};
}
# Print out all the values of alleles 2A and 2B for the samples in
# population py18. Note that array indexing starts at 0, so allele 2
# is at index 1.
foreach my $sample (grep { $_->{population} eq 'py18' } #sample) {
printf("%s: %d / %d\n",
$sample->{sample},
$sample->{alleles}[1]{A},
$sample->{alleles}[1]{B},
);
}
__DATA__
00292-97 py17 97101 129129 152164 177177 100100 134136 163165 240246 105109 124124 166166 292292 000000 000000 000000
00293-97 py18 89 97 129139 148154 179179 84 90 132134 167169 222222 105105 126128 164170 284292 000000 000000 000000
00294-97 py17 91 97 129133 152154 177183 100100 134140 161163 240240 103105 120128 164166 290292 000000 000000 000000
00295-97 py18 97 97 131133 148162 177179 84100 132134 161167 240252 111111 124128 164166 284290 000000 000000 000000

I'd start by looping through the lines and parsing each into a hash of fields, and I'd build a hash for each index along the way.
my %by_sample_id; # this will be a hash of hashes
my %by_population_id; # a hash of lists of hashes
foreach (<FILEHANDLE>) {
chomp; # remove newline
my %h; # new hash
$h{SampleID} = substr($_, 0, 7);
$h{PopulationID} = substr($_, 9, 4);
# etc...
$by_sample_id{ $h{SampleID} } = \%h; # a reference to %h
push #{$by_population_id{ $h{PopulationID} }}, \%h; # pushes hashref onto list
}
Then, you can use either index to access the data in which you're interested:
say "Allele1A for sample 123123: ", $by_sample_id{123123}->{Allele1A};
say "all the Allele1A values for population 432432: ",
join(", ", map {$_->{Allele1A}} #{$by_population_id{432432}});

I'm going to assume this isn't a one-off program, so my approach would be slightly different.
I've done a fair amount of data-mashing, and after a while, I get tired of writing queries against data structures.
So -
I would feed the data into a SQLite database(or other sql DB), and then write Perl queries off of that, using Perl DBI. This cranks up the complexity to well past a simple 'parse-and-hack', but after you've written several scripts doing queries on the same data, it becomes obvious that this is a pain, there must be a better way.
You would have a schema that looks similar to this
create table brians_awesome_data (id integer, population_id varchar(32), chunk1 integer, chunk2 integer...);
Then, after you used some of mobrule and Michael's excellent parsing, you'd loop and do some INSERT INTO your awesome_data table.
Then, you could use a CLI for your SQL program and do "select ... where ..." queries to quickly get the data you need.
Or, if it's more analytical/pipeliney, you could Perl up a script with DBI and get the data into your analysis routines.
Trust me, this is the better way to do it than writing queries against data structures over and over.

Related

Sort the column values and search the value

Input to my script is this file which contains data as below.
A food 75
B car 136
A car 69
A house 179
B food 75
C car 136
C food 85
For each distinct value of the second column, I want to print any line where the number in the third column is different.
Example output
C food 85
A car 69
Here is my Perl code.
#! /usr/local/bin/perl
use strict;
use warning;
my %data = ();
open FILE, '<', 'data.txt' or die $!;
while ( <FILE> ) {
chomp;
$data{$1} = $2 while /\s*(\S+),(\S+)/g;
}
close FILE;
print $_, '-', $data{$_}, $/ for keys %data;
I am able to print the hash keys and values, but not able to get the desired output.
Any pointers on how to do that using Perl?
As far as I can tell from your question, you want a list of all the lines where there is an "odd one out" with the same item type and a different number in the third column from all the rest
I think this is what you need
It reads all the data into hash %data, so that $data{$type}{$n} is a (reference to an) array of all the data lines that use that object type and number
Then the hash is scanned again, looking for and printing all instances that have only a single line with the given type/number and where there are other values for the same object type (otherwise it would be the only entry and not an "odd one out")
use strict;
use warnings 'all';
use autodie;
my %data;
open my $fh, '<', 'data.txt';
while ( <$fh> ) {
my ( $label, $type, $n) = split;
push #{ $data{$type}{$n} }, $_;
}
for my $type ( keys %data ) {
my $items = $data{$type};
next unless keys %$items > 1;
for my $n ( keys %$items ) {
print $items->{$n}[0] if #{ $items->{$n} } == 1;
}
}
output
C food 85
A car 69
Note that this may print multiple lines for a given object type if the input looks like, say
B car 22
A car 33
B car 136
C car 136
This has two "odd ones out" that appear only once for the given object type, so both B car 22 and A car 33 will be printed
Here are the pointers:
First, you need to remember lines somewhere before outputting them.
Second, you need to discard previously remembered line for the object according to the rules you set.
In your case, the rule is to discard when the number for the object differs from the previous remembered.
Both tasks can be accomplished with the hash.
For each line:
my ($letter, $object, $number)=split /\s+/, $line;
if (!defined($hash{$object}) || $hash{$object}[0]!=$number) {
$hash{$object}=[$number, $line];
}
Third, you need to output the hash:
for my $object(keys %hash) {
print $hash{$object}[1];
}
But there is the problem: a hash is an unordered structure, it won't return its keys in the order you put them into the hash.
So, the fourth: you need to add the ordering to your hash data, which can be accomplished like this:
$hash{$object}=[$number,$line,$.]; # $. is the row number over all the input files or STDIN, we use it for sorting
And in the output part you sort with the stored row number
(see sort for details about $a, $b variables):
for my $object(sort { $hash{$a}[2]<=>$hash{$b}[2] } keys %hash) {
print $hash{$object}[1];
}
Regarding the comments
I am certain that my code does not contain any errors.
If we look at the question before it was edited by some high rep users, it states:
[cite]
Now where if Numeric column(Third) has different value (Where in 2nd column matches) ...Then print only the mismatched number line. example..
A food 75
B car 136
A car 69
A house 179
B food 75
B car 136
C food 85
Example output (As number columns are not matching)
C food 85
[/cite]
I can only interpret that print only the mismatched number line as: to print the last line for the object where the number changed. That clearly matches the example the OP provided.
Even so, in my answer I addressed the possibility of misinterpretation, by stating that line omitting is done according to whatever rules the OP wants.
And below that I indicated what was the rule by that time in my opinion.
I think it well addressed the OP problem, because, after all, the OP wanted the pointers.
And now my answer is critiqued because it does not match the edited (long after and not by OP) requirements.
I disagree.
Regarding the whitespace: specifying /\s+/ for split is not an error here, despite of some comments trying to assert that.
While I agree that " " is common for split, I would disagree that there are a lot of cases where you must use " " instead of /\s+/.
/\s+/ is a regular expression which is the conventional argument for split, while " " is the shorthand, that actually masks the meaning.
With that I decided to use explicit split /\s+/, $line in my example instead of just split " ", $line or just split specifically to show the innerworkings of perl.
I think it is important to any one new to perl.
It is perfectly ok to use /\s+/, but be careful if you expect to have leading whitespace in your data, consult perldoc -f split and decide whether /\s+/ suits your needs or not.

Shortest Perl solution for outputing 4 random words

I have this one-line Unix shell script
for i in 1 2 3 4; do sed "$(tr -dc '0-9' < /dev/urandom | fold -w 5 |
awk '$0>=35&&$0<=65570' | head -1)q;d" "$0"; done | perl -p00e
's/\n(?!\Z)/ /g'
The script has 65K words in it, one per line, from line 35 to 65570. The code and the data are in the same file.
This script outputs 4 space-separated random words from this list with a newline at the end. For example
first fourth third second
How can I make this one-liner much shorter with Perl, keeping the
tr -dc '0-9' < /dev/urandom
part?
Keeping it is important since it provides Cryptographically Secure Pseudo-Random Numbers (CSPRNs) for all Unix OSs. Of course, if Perl can get numbers from /dev/urandom then the tr can be replaced with Perl too, but the numbers from urandom need to stay.
For convenience, I shared the base script with 65K words
65kwords.txt
or
65kwords.txt
Please use only core modules. It would be used for generating "human memorable passwords".
Later, the (hashing) iteration count, where we would use this to store the passwords would be extremely high, so brute-force would be very slow, even with many many GPUs/FPGAs.
You mention needing a CSPRN, which makes this a non trivial exercise - if you need cryptographic randomness, then using built in stuff (like rand) is not a good choice, as the implementation is highly variable across platforms.
But you've got Rand::Urandom which looks like it does the trick:
By default it uses the getentropy() (only available in > Linux 3.17) and falls back to /dev/arandom then /dev/urandom.
#!/usr/bin/env perl
use strict;
use warnings;
use Rand::Urandom;
chomp ( my #words = <DATA> );
print $words[rand #words], " " for 1..4;
print "\n";
__DATA__
yarn
yard
wound
worst
worry
work
word
wool
wolf
wish
wise
wipe
winter
wing
wind
wife
whole
wheat
water
watch
walk
wake
voice
Failing that though - you can just read bytes from /dev/urandom directly:
#!/usr/bin/env perl
use strict;
use warnings;
my #number_of_words = 4;
chomp ( my #words = <DATA> );
open ( my $urandom, '<:raw', '/dev/urandom' ) or die $!;
my $bytes;
read ( $urandom, $bytes, 2 * $number_of_words ); #2 bytes 0 - 65535
#for testing
#unpack 'n' is n An unsigned short (16-bit)
# unpack 'n*' in a list context returns a list of these.
foreach my $value ( unpack ( "n*", $bytes ) ) {
print $value,"\n";
}
#actually print the words.
#note - this assumes that you have the right number in your list.
# you could add a % #words to the map, e.g. $words[$_ % #words]
#but that will mean wrapping occurs, and will alter the frequency distribution.
#a more robust solution would be to fetch additional bytes if the 'slot' is
#empty.
print join " ", ( map { $words[$_] } unpack ( "n*", $bytes )),"\n";
__DATA__
yarn
yard
wound
worst
#etc.
Note - the above relies on the fact that your wordlist is the same size as two bytes (16 bits) - if this assumption isn't true, you'll need to deal with 'missed' words. A crude approach would be to take a modulo, but that would mean some wrapping and therefore not quite truly even distribution of word picks. Otherwise you can bit-mask and reroll, as indicated below:
On a related point though - have you considered not using a wordlist, and instead using consonant-vowel-consonant groupings?
E.g.:
#!/usr/bin/env perl
use strict;
use warnings;
#uses /dev/urandom to fetch bytes.
#generates consonant-vowel-consonant groupings.
#each are 11.22 bits of entropy, meaning a 4-group is 45 bits.
#( 20 * 6 * 20 = 2400, which is 11.22 bits of entropy log2 2400
#log2(2400 ^ 4) = 44.91
#but because it's generated 'true random' it's a know entropy string.
my $num = 4;
my $format = "CVC";
my %letters = (
V => [qw ( a e i o u y )],
C => [ grep { not /[aeiouy]/ } "a" .. "z" ], );
my %bitmask_for;
foreach my $type ( keys %letters ) {
#find the next power of 2 for the number of 'letters' in the set.
#So - for the '20' letter group, that's 31. (0x1F)
#And for the 6 letter group that's 7. (0x07)
$bitmask_for{$type} = ( 2 << log ( #{$letters{$type}} ) / log 2 ) - 1 ;
}
open( my $urandom, '<:raw', '/dev/urandom' ) or die $!;
for ( 1 .. $num ) {
for my $type ( split //, $format ) {
my $value;
while ( not defined $value or $value >= #{ $letters{$type} } ) {
my $byte;
read( $urandom, $byte, 1 );
#byte is 0-255. Our key space is 20 or 6.
#So rather than modulo, which would lead to an uneven distribution,
#we just bitmask and discard and 'too high'.
$value = (unpack "C", $byte ) & $bitmask_for{$type};
}
print $letters{$type}[$value];
}
print " ";
}
print "\n";
close($urandom);
This generates 3 character CVC symbols, with a known entropy level (11.22 per 'group') for making reasonably robust passwords. (45 bits as opposed to the 64 bits of your original, although obviously you can add extra 'groups' to gain 11.22 bits per time).
This answer is not cryptographically safe!
I would do this completely in Perl. No need for a one-liner. Just grab your word-list and put it into a Perl program.
use strict;
use warnings;
my #words = qw(
first
second
third
fourth
);
print join( q{ }, map { $words[int rand #words] } 1 .. 4 ), "\n";
This grabs four random words from the list and outputs them.
rand #words evaluates #words in scalar context, which gives the number of elements, and creates a random floating point value between 0 and smaller than that number. int cuts off the decimals. This is used as the index to grab an element out of #words. We repeat this four times with the map statement, where the 1 .. 4 is the same as passing a list of (1, 2, 3, 4) into map as an argument. This argument is ignored, but instead our random word is picked. map returns a list, which we join on one space. Finally we print the resulting string, and a newline.
The word list is created with the quoted words qw() operator, which returns a list of quoted words. It's shorthand so you don't need to type all the quotes ' and commas ,.
If you'd want to have the word list at the bottom you could either put the qw() in a sub and call it at the top, or use a __DATA__ section and read from it like a filehandle.
The particular method using tr and fold on /dev/urandom is a lot less efficient than it could be, so let's fix it up a little bit, while keeping the /dev/urandom part.
Assuming that available memory is enough to contain your script (including wordlist):
chomp(#words = <DATA>);
open urandom, "/dev/urandom" or die;
read urandom, $randbytes, 4 * 2 or die;
print join(" ", map $words[$_], unpack "S*", $randbytes), "\n";
__DATA__
word
list
goes
here
This goes for brevity and simplicity without outright obfuscation — of course you could make it shorter by removing whitespace and such, but there's no reason to. It's self-contained and will work with several decades of perls (yes, those bareword filehandles are deliberate :-P)
It still expects exactly 65536 entries in the wordlist, because that way we don't have to worry about introducing bias to the random number choice using a modulus operator. A slightly more ambitious approach might be to read 48 bytes from urandom for each word, turning it into a floating-point value between 0 and 1 (portable to most systems) and multiplying it by the size of the word list, allowing for a word list of any reasonable size.
A lot of nonsense is talked about password strength, and I think you're overestimating the worth of several of your requirements here
I don't understand your preoccupation with making your code "much shorter with perl". (Why did you pick Perl?) Savings here can only really be useful to make the script quicker to read and compile, but they will be dwarfed by the half megabyte of data following the code which must also be read
In this context, the usefulness to a hacker of a poor random number generator depends on prior knowledge of the construction of the password together with the passwords that have been most recently generated. With a sample of only 65,000 words, even the worst random number generator will show insignificant correlation between successive passwords
In general, a password is more secure if it is longer, regardless of its contents. Forming a long password out of a sequence of English words is purely a way of making the sequence more memorable
"Of course later, the (hashing) iteration count ... would be extreme high, so brute-force [hacking?] would be very slow"
This doesn't follow at all. Cracking algorithms won't try to guess the four words you've chosen: they will see only a thirty-character (or so) string consisting only of lower-case letters and spaces, and whose origin is insignificant. It will be no more or less crackable than any other password of the same length with the same character set
I suggest that you should rethink your requirements and so make things easier for yourself. I don't find it hard to think of four English words, and don't need a program to do it for me. Hint: pilchard is a good one: they never guess that!
If you still insist, then I would write something like this in Perl. I've used only the first 18 lines of your data for
use strict;
use warnings 'all';
use List::Util 'shuffle';
my #s = map /\S+/g, ( shuffle( <DATA> ) )[ 0 .. 3 ];
print "#s\n";
__DATA__
yarn
yard
wound
worst
worry
work
word
wool
wolf
wish
wise
wipe
winter
wing
wind
wife
whole
wheat
output
wind wise winter yarn
You could use Data::Random::rand_words()
perl -MData::Random -E 'say join $/, Data::Random::rand_words(size => 4)'

Finding equal lines in file with Perl

I have a CSV file which contains duplicated items in different rows.
x1,y1
x2,y2
y1,x1
x3,y3
The two rows containing x1,y1 and y1,x1 are a match as they contain the same data in a diffrent order.
I need your help to find an algorithm to search for such lines in a 12MB file.
If you can define some ordering and equality relations between fields, you could store a normalized form and test your lines for equality against that.
As an example, we will use string comparision for your fields, but after lowercasing them. We can then sort the parts according to this relation, and create a lookup table via a nested hash:
use strict; use warnings;
my $cache; # A hash of hashes. Will be autovivified later.
while (<DATA>) {
chomp;
my #fields = split;
# create the normalized representation by lowercasing and sorting the fields
my #normalized_fields = sort map lc, #fields;
# find or create the path in the lookup
my $pointer = \$cache;
$pointer = \${$pointer}->{$_} for #normalized_fields;
# if this is an unknow value, make it known, and output the line
unless (defined $$pointer) {
$$pointer = 1; # set some defined value
print "$_\n"; # emit the unique line
}
}
__DATA__
X1 y1
X2 y2
Y1 x1
X3 y3
In this example I used the scalar 1 as value of the lookup data structure, but in more complex scenarios the original fields or the line number could be stored here. For the sake of the example, I used space-seperated values here, but you could replace the split with a call to Text::CSV or something.
This hash-of-hashes approach has sublinear space complexity, and worst case linear space complexity. The lookup time only depends on the number (and size) of fields in a record, not on the total number of records.
Limitation: All records must have the same number of fields, or some shorter records could be falsely considered “seen”. To circumvent these problems, we can use more complex nodes:
my $pointer = \$cache;
$pointer = \$$pointer->[0]{$_} for #normalized_fields;
unless (defined $$pointer->[1]) {
$$pointer->[1] = 1; ...
}
or introduce a default value for nonexistant field (e.g. the seperator of the original file). Here an example with the NUL character:
my $fields = 3;
...;
die "record too long" if #fields > $fields;
...; # make normalized fields
push #normalized_fields, ("\x00") x ($fields - #normalized_fields);
...; # do the lookup
A lot depends on what you want to know about duplicate lines once they have been found. This program uses a simple hash to list the line numbers of those lines that are equivalent.
use strict;
use warnings;
my %data;
while (<DATA>) {
chomp;
my $key = join ',', sort map lc, split /,/;
push #{$data{$key}}, $.;
}
foreach my $list (values %data) {
next unless #$list > 1;
print "Lines ", join(', ', #$list), " are equivalent\n";
}
__DATA__
x1,y1
x2,y2
y1,x1
x3,y3
output
Lines 1, 3 are equivalent
Make two hash tables A and B
Stream through your input one line at a time
For the first line pair x and y, use each as key and the other as value for both hash tables (e.g., $A->{x} = y; $B->{y} = x;)
For the second and subsequent line pairs, test if the second field's value exists as a key for either A or B — if it does, you have a reverse match — if not, then repeat the addition process from step 3 to add it to the hash tables
To do a version of amon's answer without a hash table, if your data are numerical, you could:
Stream through input line by line, sorting fields one and two by numerical ordering
Pipe result to UNIX sort on first and second fields
Stream through sorted output line by line, checking if current line matches the previous line (reporting a reverse match, if true)
This has the advantage of using less memory than hash tables, but may take more time to process.
amon already provided the answer I would've provided, so please enjoy this bad answer:
#! /usr/bin/perl
use common::sense;
my $re = qr/(?!)/; # always fails
while (<DATA>) {
warn "Found duplicate: $_" if $_ =~ $re;
next unless /^(.*),(.*)$/;
die "Unexpected input at line $.: $_" if "$1$2" =~ tr/,//;
$re = qr/^\Q$2,$1\E$|$re/
}
__DATA__
x1,y1
x2,y2
y1,x1
x3,y3

How to parse a file, create records and perform manipulations on records including frequency of terms and distance calculations

I'm a student in an intro Perl class, looking for suggestions and feedback on my approach to writing a small (but tricky) program that analyzes data about atoms. My professor encourages forums. I am not advanced with Perl subs or modules (including Bioperl) so please limit responses to an appropriate 'beginner level' so that I can understand and learn from your suggestions and/or code (also limit "Magic" please).
The requirements of the program are as follows:
Read a file (containing data about Atoms) from the command line & create an array of atom records (one record/atom per newline). For each record the program will need to store:
• The atom's serial number (cols 7 - 11)
• The three-letter name of the amino acid to which it belongs (cols 18 - 20)
• The atom's three coordinates (x,y,z) (cols 31 - 54 )
• The atom's one- or two-letter element name (e.g. C, O, N, Na) (cols 77-78 )
Prompt for one of three commands: freq, length, density d (d is some number):
• freq - how many of each type of atom is in the file (example Nitrogen, Sodium, etc would be displayed like this: N: 918 S: 23
• length - The distances among coordinates
• density d (where d is a number) - program will prompt for the name of a file to save computations to and will containing the distance between that atom and every other atom. If that distance is less than or equal to the number d, it increments the count of the number of atoms that are within that distance, unless that count is zero into the file. The output will look something like:
1: 5
2: 3
3: 6
... (very big file) and will close when it finishes.
I'm looking for feedback on what I have written (and need to write) in the code below. I especially appreciate any feedback in how to approach writing my subs. I've included sample input data at the bottom.
The program structure and function descriptions as I see it:
$^W = 1; # turn on warnings
use strict; # behave!
my #fields;
my #recs;
while ( <DATA> ) {
chomp;
#fields = split(/\s+/);
push #recs, makeRecord(#fields);
}
for (my $i = 0; $i < #recs; $i++) {
printRec( $recs[$i] );
}
my %command_table = (
freq => \&freq,
length => \&length,
density => \&density,
help => \&help,
quit => \&quit
);
print "Enter a command: ";
while ( <STDIN> ) {
chomp;
my #line = split( /\s+/);
my $command = shift #line;
if ($command !~ /^freq$|^density$|length|^help$|^quit$/ ) {
print "Command must be: freq, length, density or quit\n";
}
else {
$command_table{$command}->();
}
print "Enter a command: ";
}
sub makeRecord
# Read the entire line and make records from the lines that contain the
# word ATOM or HETATM in the first column. Not sure how to do this:
{
my %record =
(
serialnumber => shift,
aminoacid => shift,
coordinates => shift,
element => [ #_ ]
);
return\%record;
}
sub freq
# take an array of atom records, return a hash whose keys are
# distinct atom names and whose values are the frequences of
# these atoms in the array.
sub length
# take an array of atom records and return the max distance
# between all pairs of atoms in that array. My instructor
# advised this would be constructed as a for loop inside a for loop.
sub density
# take an array of atom records and a number d and will return a
# hash whose keys are atom serial numbers and whose values are
# the number of atoms within that distance from the atom with that
# serial number.
sub help
{
print "To use this program, type either\n",
"freq\n",
"length\n",
"density followed by a number, d,\n",
"help\n",
"quit\n";
}
sub quit
{
exit 0;
}
# truncating for testing purposes. Actual data is aprox. 100 columns
# and starts with ATOM or HETATM.
__DATA__
ATOM 4743 CG GLN A 704 19.896 32.017 54.717 1.00 66.44 C
ATOM 4744 CD GLN A 704 19.589 30.757 55.525 1.00 73.28 C
ATOM 4745 OE1 GLN A 704 18.801 29.892 55.098 1.00 75.91 O
It looks like your Perl skills are advancing nicely -- using references and complex data structures. Here are a few tips and pieces of general advice.
Enable warnings with use warnings rather than $^W = 1. The former is self-documenting and has the advantage being local to the enclosing block rather than being a global setting.
Use well-named variables, which will help document the program's behavior, rather than relying on Perl's special $_. For example:
while (my $input_record = <DATA>){
}
In user-input scenarios, an endless loop provides a way to avoid repeated instructions like "Enter a command". See below.
Your regex can be simplified to avoid the need for repeated anchors. See below.
As a general rule, affirmative tests are easier to understand than negative tests. See the modified if-else structure below.
Enclose each part of program within its own subroutine. This is a good general practice for a bunch of reasons, so I would just start the habit.
A related good practice is to minimize the use of global variables. As an exercise, you could try to write the program so that it uses no global variables at all. Instead, any needed information would be passed around between the subroutines. With small programs one does not necessarily need to be rigid about the avoidance of globals, but it's not a bad idea to keep the ideal in mind.
Give your length subroutine a different name. That name is already used by the built-in length function.
Regarding your question about makeRecord, one approach is to ignore the filtering issue inside makeRecord. Instead, makeRecord could include an additional hash field, and the filtering logic would reside elsewhere. For example:
my $record = makeRecord(#fields);
push #recs, $record if $record->{type} =~ /^(ATOM|HETATM)$/;
An illustration of some of the points above:
use strict;
use warnings;
run();
sub run {
my $atom_data = load_atom_data();
print_records($atom_data);
interact_with_user($atom_data);
}
...
sub interact_with_user {
my $atom_data = shift;
my %command_table = (...);
while (1){
print "Enter a command: ";
chomp(my $reply = <STDIN>);
my ($command, #line) = split /\s+/, $reply;
if ( $command =~ /^(freq|density|length|help|quit)$/ ) {
# Run the command.
}
else {
# Print usage message for user.
}
}
}
...
FM's answer is pretty good. I'll just mention a couple of additional things:
You already have a hash with the valid commands (which is a good idea). There's no need to duplicate that list in a regex. I'd do something like this:
if (my $routine = $command_table{$command}) {
$routine->(#line);
} else {
print "Command must be: freq, length, density or quit\n";
}
Notice I'm also passing #line to the subroutine, because you'll need that for the density command. Subroutines that don't take arguments can just ignore them.
You could also generate the list of valid commands for the error message by using keys %command_table, but I'll leave that as an exercise for you.
Another thing is that the description of the input file mentions column numbers, which suggests that it's a fixed-width format. That's better parsed with substr or unpack. If a field is ever blank or contains a space, then your split will not parse it correctly. (If you use substr, be aware that it numbers columns starting at 0, when people often label the first column 1.)

How do I set up the data structure to make pie charts in GD::Graph?

I am writting a Perl script to create pie graph using GD::Graph::pie with these arrays:
#Array1 = ("A", "B", "C", "D");
$array2 = [
['upto 100 values'],
['upto 100 values'],
['upto 100 values'],
['upto 100 values']
];
As per my understanding to get this done, I have to create an array with the references of above arrays, like:
my #graph_data = (\#Array1, #$array2);
I have also tried to use foreach loop but not getting good results. I want to create pie graph with first value in #Array1 against first value in $array2 and second value in #Array1 against second value in $array2 and so on. Also I want put the same title for each graph as per values in #Array1.
eg.
my #graph_data1 = (\#Array1[0], #$array2[0]);
Can anyone please suggest me the better way to do this?
Before getting into pie charts and stuff like that, I suggest you get yourself updated on basic Perl data structures and references. Please read perlreftut, youl should be able to solve this problem yourself afterwards.
I'm not sure I understand what you are trying to do, but this example will produce 3 pie charts, all of them using the same set of categories. I would second Manni's advice: spend some time with perlreftut and perldsc. Also, if you download the GD::Graph module, it provides many examples, including pie charts (see the samples subdirectory).
use strict;
use warnings;
use GD::Graph::pie;
my #categories = qw(foo bar fubb buzz);
my #data = (
[ 25, 32, 10, 44 ], # Data values for chart #1
[ 123, 221, 110, 142 ], # Data values for chart #2
[ 225, 252, 217, 264 ], # etc.
);
for my $i (0 .. $#data){
my $chart = GD::Graph::pie->new;
my #pie_data = ( \#categories, $data[$i] );
$chart->plot(\#pie_data);
open(my $fh, '>', "pie_chart_$i.gif") or die $!;
binmode $fh;
print $fh $chart->gd->gif;
close $fh;
}
To state in plainer English what the other answers say less directly:
my #graph_data = (\Array1, $#array2);
my #graph_data1 = (\Array1[0], $#array2[0]);
looks mad. You almost certainly mean:
my #graph_data = (\#Array1, $array2);
# you want the first element of each list in the same datastructure?
my #graph_data1 = ([$Array1[0]], [$array2->[0]]); # (['A'], [[..numbers..]])
# Note *two* [ and ] in 2nd bit
# ... or you want a different datastructure?
my #graph_data1 = ($Array1[0], $array2->[0]); # ('A', [..numbers..])
#Array1 is an array, you want a reference to it, and that would be \#Array1.
$array2 is a reference to an array already. It contains references to arrays, and I assume you want a list containing the reference to the array at index 0. Thus: $array2->[0] is the first indexed element via an array reference, and it's already an array reference.
I found the solution of this problem using below code.
my #pairs = map{"$Array1[$_]#$array2[$_],"} 0..$#Array1;
After this the values from array #pairs can be used to create graphs.