print function in Perl - perl

This is a perl code I use for compiling pressure data.
$data_ct--;
mkdir "365Days", 0777 unless -d "365Days";
my $file_no = 1;
my $j = $num_levels;
for ($i = 0; $i < $data_ct; $i++) {
if ($j == $num_levels) {
close OUT;
$j = 0;
my $file = "365days/wind$file_no";
$file_no++;
open OUT, "> $file" or die "Can't open $file: $!";
}
{
$wind_direction = (270-atan2($vwind[$i], $uwind[$i])*(180/pi))%360;
}
$wind_speed = sqrt($uwind[$i]*$uwind[$i]+$vwind[$i]*$vwind[$i]);
printf OUT "%.0f %.0f %.1f\n", $level[$i], $wind_direction, $wind_speed;
$j++;
}
$file_no--;
print STDERR "Wrote out $file_no wind files.\n";
print STDERR "Done\n";
The problem I am having is when it prints out the numbers, I want it to be in this format
Level Wind direction windspeed
250 320 1.5
870 56 4.6
Right now when I run the script the columns names do not show up rather just the numbers. Can someone direct me as to how to rectify the script?

There are several ways to do this in Perl. First, Perl has built in form ability. It's been a part of Perl since version 3.0 (about 20 years old). However, it is rarely used. In fact, it is so rarely used I am not even going to attempt to write an example with it because I'd have to spend way too much time relearning it. It's there and documented.
You can try to figure it out for yourself. Or, maybe some old timer Perl programmer might wake up from his nap and help out. (All bets are off if it's meatloaf night at the old age home, though).
Perl has evolved greatly in the last few decades, and this old forms bit represents a much older way of writing Perl programs. It just isn't pretty.
Another way this can be done and is more popular is to use the printf function. If you're not familiar with C and printf from there, it can be a bit intimidating to use. It depends upon formatting codes (the things that start with % to specify what you want to print (strings, integers, floating point numbers, etc.), and how you want those values formatted.
Fortunately, printf is so useful, that most programming languages have their own version of printf. Once you learn it, your knowledge is transferable to other places. There's an equivalent sprintf for setting variable with formats.
# Define header and line formats
my $header_fmt = "%-5.5s %-14.14s %-9.9s\n";
my $data_fmt = "%5d %14d %9.1f\n";
# Print my table header
printf $header_fmt, "Level", "Wind direction", "windspeed";
my $level = 230;
my $direction = 120;
my $speed = 32.3;
# Print my table data
printf $data_fmt, $level, $direction, $speed;
This prints out:
Level Wind direction windspeed
230 120 32.3
I like defining the format of my printed lines all together, so I can tweak them to get what I want. It's a great way to make sure your data line lines up with your header.

Okay, Matlock wasn't on tonight, so this crusty old Perl programmer has plenty of time.
In my previous answer, I said there was an old way of doing forms in Perl, but I didn't remember how it went. Well, I spent some time and got you an example of how it works.
Basically, you sort of need globalish variables. I thought you needed our variables for this to work, but I can get my variables to work if I define them on the same level as my format statements. It's not pretty.
You use GLOBS to define your formats with _TOP appended for your headers. Since I'm printing these on STDOUT, I define STDOUT_TOP for my heading and STDOUT for my data lines.
The format must start at the beginning of a column. The lone . on the end ends the format definition. You notice I write the entire thing with just a single write statement. How does it know to print out the heading? Perl tracks the number of lines printed and automatically writes a Form Feed character and a new heading when Perl thinks it's at the bottom of a page. I am assuming Perl uses 66 line pages as a default.
You can in Perl set your own form names via select. Perl uses $= as the number of lines on a page, and $- on the number of lines left. These variables are global, but are set by the selected format via the select statement. You can use IO::Handle for better variable naming.
#! /usr/bin/env perl
use strict;
use warnings;
use feature qw(say);
my #data = (
{
level => 250,
direction => 320,
speed => 1.5,
},
{
level => 870,
direction => 55,
speed => 4.5,
},
);
my $level;
my $direction;
my $speed;
for my $item_ref ( #data ) {
$level = $item_ref->{level};
$direction = $item_ref->{direction};
$speed = $item_ref->{speed};
write;
}
format STDOUT_TOP =
Level Wind Direction Windspeed
===== ============== =========
.
format STDOUT =
##### ############## ######.##
$level, $direction, $speed
.
This prints:
Level Wind Direction Windspeed
===== ============== =========
250 320 1.50
870 55 4.50

#Gunnerfan : Can you replace the line from your code as shown below
Your line of code: printf OUT "%.0f %.0f %.1f\n",$level[$i], wind_direction, $wind_speed;
Replacement code:
if($i==0) {
printf OUT "\n\t%s%-20s %-10s%-12s %-20s%s\n", 'Level' , 'Wind direction' , 'windspeed');
}
printf OUT "\t%s%-20s%s %-10s%s%-12s%s %-20s\n",$level[$i],$wind_direction, $wind_speed;

Related

PERL: Jumping to lines in a huge text file

I have a very large text file (~4 GB).
It has the following structure:
S=1
3 lines of metadata of block where S=1
a number of lines of data of this block
S=2
3 lines of metadata of block where S=2
a number of lines of data of this block
S=4
3 lines of metadata of block where S=4
a number of lines of data of this block
etc.
I am writing a PERL program that read in another file,
foreach line of that file (where it must contain a number),
search the huge file for a S-value of that number minus 1,
and then analyze the lines of data of the block belongs to that S-value.
The problem is, the text file is HUGE, so processing each line with a
foreach $line {...} loop
is very slow. As the S=value is strictly increasing, are there any methods to jump to a particular line of the required S-value?
are there any methods to jump to a particular line of the required S-value?
Yes, if the file does not change then create an index. This requires reading the file in its entirety once and noting the positions of all the S=# lines using tell. Store it in a DBM file with the key being the number and the value being the byte position in the file. Then you can use seek to jump to that point in the file and read from there.
But if you're going to do that, you're better off exporting the data into a proper database such as SQLite. Write a program to insert the data into the database and add normal SQL indexes. This will probably be simpler than writing the index. Then you can query the data efficiently using normal SQL, and make complex queries. If the file change you can either redo the export, or use the normal insert and update SQL to update the database. And it will be easy for anyone who knows SQL to work with, as opposed to a bunch of custom indexing and search code.
I know the op has already accepted an answer, but a method that's served me well is to slurp the file into an array, based on changing the "record separator" ($/).
If you do something like this (not tested, but this should be close):
$/ = "S=";
my #records=<fh>;
print $records[4];
The output should be the entire fifth record (the array starts at 0, but your data starts at 1), starting with the record number (5) on a line by itself (you might need to strip that out later), following by all the remaining lines in that record.
It's very simple and fast, although it is a memory pig...
If the blocks of text are of the same length (in bytes or characters) you can calculate the position of the needed S-value in the file and seek there, then read. Otherwise, in principle you need to read lines to find the S value.
However, if there are only a few S-values to find you can estimate the needed position and seek there, then read enough to capture an S-value. Then analyze what you read to see how far off you are, and either seek again or read lines with <> to get to the S-value.
use warnings;
use strict;
use feature 'say';
use Fcntl qw(:seek);
my ($file, $s_target) = #ARGV;
die "Usage: $0 filename\n" if not $file or not -f $file;
$s_target //= 5; #/ default, S=5
open my $fh, '<', $file or die $!;
my $est_text_len = 1024;
my $jump_by = $est_text_len * $s_target; # to seek forward in file
my ($buff, $found);
seek $fh, $jump_by, SEEK_CUR; # get in the vicinity
while (1) {
my $rd = read $fh, $buff, $est_text_len;
warn "error reading: $!" if not defined $rd;
last if $rd == 0;
while ($buff =~ /S=([0-9]+)/g) {
my $s_val = $1;
# Analyze $s_val and $buff:
# (1) if overshot $s_target adjust $jump_by and seek back
# (2) if in front of $s_target read with <> to get to it
# (3) if $s_target is in $buff extract needed text
if ($s_val == $s_target) {
say "--> Found S=$s_val at pos ", pos $buff, " in buffer";
seek $fh, - $est_text_len + pos($buff) + 1, SEEK_CUR;
while (<$fh>) {
last if /S=[0-9]+/; # next block
print $_;
}
$found = 1;
last;
}
}
last if $found;
}
Tested with your sample, enlarged and cleaned up (change S=n in text as it is the same as the condition!), with $est_text_len and $jump_by set at 100 and 20.
This is a sketch. A full implementation needs to negotiate over and under seeking as outlined in comments in code. If text-block sizes don't vary much it can get in front of the needed S-value in two seek-and-reads, and then read with <> or use regex as in the example.
Some comments
The "analysis" sketched above need be done carefully. For one, a buffer may contain multiple S-value lines. Also, note that the code keeps reading if an S-value isn't in buffer.
Once you are close enough and in front of $s_target read lines by <> to get to it.
The read may not get as much as requested so you should really put that in a loop. There are recent posts with that.
Change to sysread from read for efficiency. In that case use sysseek, and don't mix with <> (which is buffered).
The code above presumes one S-value to find; adjust for more. It absolutely assumes that S-values are sorted.
This is clearly far more complex than reading lines but it does run much faster, with a very large file and only a few S-values to find. If there are many values then this may not help.
The foreach (<$fh>), indicated in the question, would cause the whole file to be read first (to build the list for foreach to go through); use while (<$fh>) instead.
If the file doesn't change (or the same file need be searched many times) you can first process it once to build an index of exact positions of S-values. Thanks to Danny_ds for a comment.
Binary search of a sorted list is an O(log N) operation. Something like this using seek:
open my $fh, '>>+', $big_file;
$target = 123_456_789;
$low = 0;
$high = -s $big_file;
while ($high - $low > 0.01 * -s $big_file) {
$mid = ($low + $high) / 2;
seek $fh, $mid, 0;
while (<$fh>) {
if (/^S=(\d+)/) {
if ($1 < $target) { $low = $mid; }
else { $high = $mid }
last;
}
}
}
seek $fh, $low, 0;
while (<$fh>) {
# now you are searching through the 1% of the file that contains
# your target S
}
Sort the numbers in the second file. Now you can proceed thru the huge file in order, processing each S-value as needed.

Shortest Perl solution for outputing 4 random words

I have this one-line Unix shell script
for i in 1 2 3 4; do sed "$(tr -dc '0-9' < /dev/urandom | fold -w 5 |
awk '$0>=35&&$0<=65570' | head -1)q;d" "$0"; done | perl -p00e
's/\n(?!\Z)/ /g'
The script has 65K words in it, one per line, from line 35 to 65570. The code and the data are in the same file.
This script outputs 4 space-separated random words from this list with a newline at the end. For example
first fourth third second
How can I make this one-liner much shorter with Perl, keeping the
tr -dc '0-9' < /dev/urandom
part?
Keeping it is important since it provides Cryptographically Secure Pseudo-Random Numbers (CSPRNs) for all Unix OSs. Of course, if Perl can get numbers from /dev/urandom then the tr can be replaced with Perl too, but the numbers from urandom need to stay.
For convenience, I shared the base script with 65K words
65kwords.txt
or
65kwords.txt
Please use only core modules. It would be used for generating "human memorable passwords".
Later, the (hashing) iteration count, where we would use this to store the passwords would be extremely high, so brute-force would be very slow, even with many many GPUs/FPGAs.
You mention needing a CSPRN, which makes this a non trivial exercise - if you need cryptographic randomness, then using built in stuff (like rand) is not a good choice, as the implementation is highly variable across platforms.
But you've got Rand::Urandom which looks like it does the trick:
By default it uses the getentropy() (only available in > Linux 3.17) and falls back to /dev/arandom then /dev/urandom.
#!/usr/bin/env perl
use strict;
use warnings;
use Rand::Urandom;
chomp ( my #words = <DATA> );
print $words[rand #words], " " for 1..4;
print "\n";
__DATA__
yarn
yard
wound
worst
worry
work
word
wool
wolf
wish
wise
wipe
winter
wing
wind
wife
whole
wheat
water
watch
walk
wake
voice
Failing that though - you can just read bytes from /dev/urandom directly:
#!/usr/bin/env perl
use strict;
use warnings;
my #number_of_words = 4;
chomp ( my #words = <DATA> );
open ( my $urandom, '<:raw', '/dev/urandom' ) or die $!;
my $bytes;
read ( $urandom, $bytes, 2 * $number_of_words ); #2 bytes 0 - 65535
#for testing
#unpack 'n' is n An unsigned short (16-bit)
# unpack 'n*' in a list context returns a list of these.
foreach my $value ( unpack ( "n*", $bytes ) ) {
print $value,"\n";
}
#actually print the words.
#note - this assumes that you have the right number in your list.
# you could add a % #words to the map, e.g. $words[$_ % #words]
#but that will mean wrapping occurs, and will alter the frequency distribution.
#a more robust solution would be to fetch additional bytes if the 'slot' is
#empty.
print join " ", ( map { $words[$_] } unpack ( "n*", $bytes )),"\n";
__DATA__
yarn
yard
wound
worst
#etc.
Note - the above relies on the fact that your wordlist is the same size as two bytes (16 bits) - if this assumption isn't true, you'll need to deal with 'missed' words. A crude approach would be to take a modulo, but that would mean some wrapping and therefore not quite truly even distribution of word picks. Otherwise you can bit-mask and reroll, as indicated below:
On a related point though - have you considered not using a wordlist, and instead using consonant-vowel-consonant groupings?
E.g.:
#!/usr/bin/env perl
use strict;
use warnings;
#uses /dev/urandom to fetch bytes.
#generates consonant-vowel-consonant groupings.
#each are 11.22 bits of entropy, meaning a 4-group is 45 bits.
#( 20 * 6 * 20 = 2400, which is 11.22 bits of entropy log2 2400
#log2(2400 ^ 4) = 44.91
#but because it's generated 'true random' it's a know entropy string.
my $num = 4;
my $format = "CVC";
my %letters = (
V => [qw ( a e i o u y )],
C => [ grep { not /[aeiouy]/ } "a" .. "z" ], );
my %bitmask_for;
foreach my $type ( keys %letters ) {
#find the next power of 2 for the number of 'letters' in the set.
#So - for the '20' letter group, that's 31. (0x1F)
#And for the 6 letter group that's 7. (0x07)
$bitmask_for{$type} = ( 2 << log ( #{$letters{$type}} ) / log 2 ) - 1 ;
}
open( my $urandom, '<:raw', '/dev/urandom' ) or die $!;
for ( 1 .. $num ) {
for my $type ( split //, $format ) {
my $value;
while ( not defined $value or $value >= #{ $letters{$type} } ) {
my $byte;
read( $urandom, $byte, 1 );
#byte is 0-255. Our key space is 20 or 6.
#So rather than modulo, which would lead to an uneven distribution,
#we just bitmask and discard and 'too high'.
$value = (unpack "C", $byte ) & $bitmask_for{$type};
}
print $letters{$type}[$value];
}
print " ";
}
print "\n";
close($urandom);
This generates 3 character CVC symbols, with a known entropy level (11.22 per 'group') for making reasonably robust passwords. (45 bits as opposed to the 64 bits of your original, although obviously you can add extra 'groups' to gain 11.22 bits per time).
This answer is not cryptographically safe!
I would do this completely in Perl. No need for a one-liner. Just grab your word-list and put it into a Perl program.
use strict;
use warnings;
my #words = qw(
first
second
third
fourth
);
print join( q{ }, map { $words[int rand #words] } 1 .. 4 ), "\n";
This grabs four random words from the list and outputs them.
rand #words evaluates #words in scalar context, which gives the number of elements, and creates a random floating point value between 0 and smaller than that number. int cuts off the decimals. This is used as the index to grab an element out of #words. We repeat this four times with the map statement, where the 1 .. 4 is the same as passing a list of (1, 2, 3, 4) into map as an argument. This argument is ignored, but instead our random word is picked. map returns a list, which we join on one space. Finally we print the resulting string, and a newline.
The word list is created with the quoted words qw() operator, which returns a list of quoted words. It's shorthand so you don't need to type all the quotes ' and commas ,.
If you'd want to have the word list at the bottom you could either put the qw() in a sub and call it at the top, or use a __DATA__ section and read from it like a filehandle.
The particular method using tr and fold on /dev/urandom is a lot less efficient than it could be, so let's fix it up a little bit, while keeping the /dev/urandom part.
Assuming that available memory is enough to contain your script (including wordlist):
chomp(#words = <DATA>);
open urandom, "/dev/urandom" or die;
read urandom, $randbytes, 4 * 2 or die;
print join(" ", map $words[$_], unpack "S*", $randbytes), "\n";
__DATA__
word
list
goes
here
This goes for brevity and simplicity without outright obfuscation — of course you could make it shorter by removing whitespace and such, but there's no reason to. It's self-contained and will work with several decades of perls (yes, those bareword filehandles are deliberate :-P)
It still expects exactly 65536 entries in the wordlist, because that way we don't have to worry about introducing bias to the random number choice using a modulus operator. A slightly more ambitious approach might be to read 48 bytes from urandom for each word, turning it into a floating-point value between 0 and 1 (portable to most systems) and multiplying it by the size of the word list, allowing for a word list of any reasonable size.
A lot of nonsense is talked about password strength, and I think you're overestimating the worth of several of your requirements here
I don't understand your preoccupation with making your code "much shorter with perl". (Why did you pick Perl?) Savings here can only really be useful to make the script quicker to read and compile, but they will be dwarfed by the half megabyte of data following the code which must also be read
In this context, the usefulness to a hacker of a poor random number generator depends on prior knowledge of the construction of the password together with the passwords that have been most recently generated. With a sample of only 65,000 words, even the worst random number generator will show insignificant correlation between successive passwords
In general, a password is more secure if it is longer, regardless of its contents. Forming a long password out of a sequence of English words is purely a way of making the sequence more memorable
"Of course later, the (hashing) iteration count ... would be extreme high, so brute-force [hacking?] would be very slow"
This doesn't follow at all. Cracking algorithms won't try to guess the four words you've chosen: they will see only a thirty-character (or so) string consisting only of lower-case letters and spaces, and whose origin is insignificant. It will be no more or less crackable than any other password of the same length with the same character set
I suggest that you should rethink your requirements and so make things easier for yourself. I don't find it hard to think of four English words, and don't need a program to do it for me. Hint: pilchard is a good one: they never guess that!
If you still insist, then I would write something like this in Perl. I've used only the first 18 lines of your data for
use strict;
use warnings 'all';
use List::Util 'shuffle';
my #s = map /\S+/g, ( shuffle( <DATA> ) )[ 0 .. 3 ];
print "#s\n";
__DATA__
yarn
yard
wound
worst
worry
work
word
wool
wolf
wish
wise
wipe
winter
wing
wind
wife
whole
wheat
output
wind wise winter yarn
You could use Data::Random::rand_words()
perl -MData::Random -E 'say join $/, Data::Random::rand_words(size => 4)'

using printf to create columnar data

I am new to perl and scripting in general. I have five variables that hold data and I need to print them as five columns next to each other. Here is the code I have now.
$i = 0;
foreach $line (<inf>){
chomp $line;
#line=split / +/, $line;
$i = $i + 1;
if ($i > $n+1) {
$i = 1;
$numdata = $numdata + 1;
}
if ($i == 1) {
printf "%20s\n", $n, "\n";
} else {
print $i-1, "BEAD", $line[$col], $line[$col+1], $line[$col+2], "\n";
}
# other statistics
}
The output I get from this looks like:
5
1BEAD0.00000e+000.00000e+000.00000e+00
2BEAD0.00000e+000.00000e+000.00000e+00
3BEAD0.00000e+000.00000e+000.00000e+00
4BEAD0.00000e+000.00000e+000.00000e+00
5BEAD0.00000e+000.00000e+000.00000e+00
5
1BEAD9.40631e-02-3.53254e-022.09369e-01
2BEAD-6.69662e-03-3.13492e-012.62915e-01
3BEAD2.98822e-024.60254e-023.61680e-01
4BEAD-1.45631e-013.45979e-021.50167e-01
5BEAD-5.57204e-02-1.51673e-012.95947e-01
5
1BEAD8.14225e-028.10216e-022.76423e-01
2BEAD2.36992e-02-2.74023e-014.47334e-01
3BEAD1.23492e-011.12571e-012.59486e-01
4BEAD-2.05375e-011.25304e-011.85252e-01
5BEAD5.54441e-02-1.30280e-015.82256e-01
I have tried using "%6d %9d %15.6f %28.6f %39.6f\n" before the variables in my print statement to try to space the data out; however, this did not give me the columns I hoped for. Any help/ suggestions are appreciated.
If you're using Perl and doing more complex stuff, you may want to look into perlform, which is designed for this kind of thing, or a module like Text::Table.
As for using printf though, you can use the padding specifiers to get consistent spacing. For instance, using the Perl docs on it, make sure the field width is before the .: your printf string should probably look something more like this (check out the "precision, or maximum width" section):
printf "%6.d %9.d %15.6f %28.6f %39.6f"
Also, if your things are in an array, you can just pass the array the second argument to printf and save yourself typing everything out. I've also prepended the two other items from your example with unshift:
unshift(#line, $i-1, "BEAD");
printf "%6.d %10s %15.6f %28.6f %39.6f\n", $line;
Note that the %s placeholders don't have the . precision specifier, so leave it out for that. If you want the e-notation for the numbers, use %e or %g instead of %f (%39.6e).
Also, for Perl questions, always check out Perl Monks - much of this answer was culled from a question there.
P.S. Given one of your example columns, here's the proof-of-concept script I tried to make sure everything worked:
perl -e '#line = (8.14225e-02,8.10216e-02,2.76423e-01);
unshift(#line, 4, "BEAD");
printf "%6.d %10s %15.6f %28.6f %39.6e\n", #line;'

how to put a file into an array and save it in perl

Hello everyone I'm a beginner in perl and I'm facing some problems as I want to put my strings starting from AA to \ in to an array and want to save it. There are about 2000-3000 strings in a txt file starting from same initials i.e., AA to / I'm doing it by this way plz correct me if I'm wrong.
Input File
AA c0001
BB afsfjgfjgjgjflffbg
CC table
DD hhhfsegsksgk
EB jksgksjs
\
AA e0002
BB rejwkghewhgsejkhrj
CC chair
DD egrhjrhojohkhkhrkfs
VB rkgjehkrkhkh;r
\
Source code
$flag = 0
while ($line = <ifh>)
{
if ( $line = m//\/g)
{
$flag = 1;
}
while ( $flag != 0)
{
for ($i = 0; $i <= 10000; $i++)
{ # Missing brace added by editor
$array[$i] = $line;
} # Missing brace added by editor
}
} # Missing close brace added by editor; position guessed!
print $ofh, $line;
close $ofh;
Welcome to StackOverflow.
There are multiple issues with your code. First, please post compilable Perl; I had to add three braces to give it the remotest chance of compiling, and I had to guess where one of them went (and there's a moderate chance it should be on the other side of the print statement from where I put it).
Next, experts have:
use warnings;
use strict;
at the top of their scripts because they know they will miss things if they don't. As a learner, it is crucial for you to do the same; it will prevent you making errors.
With those in place, you have to declare your variables as you use them.
Next, remember to indent your code. Doing so makes it easier to comprehend. Perl can be incomprehensible enough at the best of times; don't make it any harder than it has to be. (You can decide where you like braces - that is open to discussion, though it is simpler to choose a style you like and stick with it, ignoring any discussion because the discussion will probably be fruitless.)
Is the EB vs VB in the data significant? It is hard to guess.
It is also not clear exactly what you are after. It might be that you're after an array of entries, one for each block in the file (where the blocks end at the line containing just a backslash), and where each entry in the array is a hash keyed by the first two letters (or first word) on the line, with the remainder of the line being the value. This is a modestly complex structure, and probably beyond what you're expected to use at this stage in your learning of Perl.
You have the line while ($line = <ifh>). This is not invalid in Perl if you opened the file the old fashioned way, but it is not the way you should be learning. You don't show how the output file handle is opened, but you do use the modern notation when trying to print to it. However, there's a bug there, too:
print $ofh, $line; # Print two values to standard output
print $ofh $line; # Print one value to $ofh
You need to look hard at your code, and think about the looping logic. I'm sure what you have is not what you need. However, I'm not sure what it is that you do need.
Simpler solution
From the comments:
I want to flag each record starting from AA to \ as record 0 till record n and want to save it in a new file with all the record numbers.
Then you probably just need:
#!/usr/bin/env perl
use strict;
use warnings;
my $recnum = 0;
while (<>)
{
chomp;
if (m/^\\$/)
{
print "$_\n";
$recnum++;
}
else
{
print "$recnum $_\n";
}
}
This reads from the files specified on the command line (or standard input if there are none), and writes the tagged output to standard output. It prefixes each line except the 'end of record' marker lines with the record number and a space. Choose your output format and file handling to suit your needs. You might argue that the chomp is counter-productive; you can certainly code the program without it.
Overly complex solution
Developed in the absence of clear direction from the questioner.
Here is one possible way to read the data, but it uses moderately advanced Perl (hash references, etc). The Data::Dumper module is also useful for printing out Perl data structures (see: perldoc Data::Dumper).
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my #data;
my $hashref = { };
my $nrecs = 0;
while (<>)
{
chomp;
if (m/^\\$/)
{
# End of group - save to data array and start new hash
$data[$nrecs++] = $hashref;
$hashref = { };
}
else
{
m/^([A-Z]+)\s+(.*)$/;
$hashref->{$1} = $2;
}
}
foreach my $i (0..$nrecs-1)
{
print "Record $i:\n";
foreach my $key (sort keys $data[$i])
{
print " $key = $data[$i]->{$key}\n";
}
}
print Data::Dumper->Dump([ \#data ], [ '#data' ]);
Sample output for example input:
Record 0:
AA = c0001
BB = afsfjgfjgjgjflffbg
CC = table
DD = hhhfsegsksgk
EB = jksgksjs
Record 1:
AA = e0002
BB = rejwkghewhgsejkhrj
CC = chair
DD = egrhjrhojohkhkhrkfs
VB = rkgjehkrkhkh;r
$#data = [
{
'EB' => 'jksgksjs',
'CC' => 'table',
'AA' => 'c0001',
'BB' => 'afsfjgfjgjgjflffbg',
'DD' => 'hhhfsegsksgk'
},
{
'CC' => 'chair',
'AA' => 'e0002',
'VB' => 'rkgjehkrkhkh;r',
'BB' => 'rejwkghewhgsejkhrj',
'DD' => 'egrhjrhojohkhkhrkfs'
}
];
Note that this data structure is not optimized for searching except by record number. If you need to search the data in some other way, then you need to organize it differently. (And don't hand this code in as your answer without understanding it all - it is subtle. It also does no error checking; beware faulty data.)
It can't be right. I can see two main issues with your while-loop.
Once you enter the following loop
while ( $flag != 0)
{
...
}
you'll never break out because you do not reset the flag whenever you find an break-line. You'll have to parse you input and exit the loop if necessary.
And second you never read any input within this loop and thus process the same $line over and over again.
You should not put the loop inside your code but instead you can use the following pattern (pseudo-code)
if flag != 0
append item to array
else
save array to file
start with new array
end
I believe what you want is to split the files content at \ though it's not too clear.
To achieve this you can slurp the file into a variable by setting the input record separator, then split the content.
To find out about Perl's special variables related to filehandlers read perlvar
#!perl
use strict;
use warnings;
my $content;
{
open my $fh, '<', 'test.txt';
local $/; # slurp mode
$content = <$fh>;
close $fh;
}
my #blocks = split /\\/, $content;
Make sure to localize modifications of Perl's special variables to not interfere with different parts of your program.
If you want to keep the separator you could set $/ to \ directly and skip split.
#!perl
use strict;
use warnings;
my #blocks;
{
open my $fh, '<', 'test.txt';
local $/ = '\\'; # seperate at \
#blocks = <$fh>;
close $fh;
}
Here's a way to read your data into an array. As I said in a comment, "saving" this data to a file is pointless, unless you change it. Because if I were to print the #data array below to a file, it would look exactly like the input file.
So, you need to tell us what it is you want to accomplish before we can give you an answer about how to do it.
This script follows these rules (exactly):
Find a line that begins with "AA",
and save that into $line
Concatenate every new line from the
file into $line
When you find a line that begins with
a backslash \, stop concatenating
lines and save $line into #data.
Then, find the next line that begins
with "AA" and start the loop over.
These matching regexes are pretty loose, as they will match AAARGH and \bonkers as well. If you need them stricter, you can try /^\\$/ and /^AA$/, but then you need to watch out for whitespace at the beginning and end of line. So perhaps /^\s*\\\s*$/ and /^\s*AA\s*$/ instead.
The code:
use warnings;
use strict;
my $line="";
my #data;
while (<DATA>) {
if (/^AA/) {
$line = $_;
while (<DATA>) {
$line .= $_;
last if /^\\/;
}
}
push #data, $line;
}
use Data::Dumper;
print Dumper \#data;
__DATA__
AA c0001
BB afsfjgfjgjgjflffbg
CC table
DD hhhfsegsksgk
EB jksgksjs
\
AA e0002
BB rejwkghewhgsejkhrj
CC chair
DD egrhjrhojohkhkhrkfs
VB rkgjehkrkhkh;r
\

How to parse a file, create records and perform manipulations on records including frequency of terms and distance calculations

I'm a student in an intro Perl class, looking for suggestions and feedback on my approach to writing a small (but tricky) program that analyzes data about atoms. My professor encourages forums. I am not advanced with Perl subs or modules (including Bioperl) so please limit responses to an appropriate 'beginner level' so that I can understand and learn from your suggestions and/or code (also limit "Magic" please).
The requirements of the program are as follows:
Read a file (containing data about Atoms) from the command line & create an array of atom records (one record/atom per newline). For each record the program will need to store:
• The atom's serial number (cols 7 - 11)
• The three-letter name of the amino acid to which it belongs (cols 18 - 20)
• The atom's three coordinates (x,y,z) (cols 31 - 54 )
• The atom's one- or two-letter element name (e.g. C, O, N, Na) (cols 77-78 )
Prompt for one of three commands: freq, length, density d (d is some number):
• freq - how many of each type of atom is in the file (example Nitrogen, Sodium, etc would be displayed like this: N: 918 S: 23
• length - The distances among coordinates
• density d (where d is a number) - program will prompt for the name of a file to save computations to and will containing the distance between that atom and every other atom. If that distance is less than or equal to the number d, it increments the count of the number of atoms that are within that distance, unless that count is zero into the file. The output will look something like:
1: 5
2: 3
3: 6
... (very big file) and will close when it finishes.
I'm looking for feedback on what I have written (and need to write) in the code below. I especially appreciate any feedback in how to approach writing my subs. I've included sample input data at the bottom.
The program structure and function descriptions as I see it:
$^W = 1; # turn on warnings
use strict; # behave!
my #fields;
my #recs;
while ( <DATA> ) {
chomp;
#fields = split(/\s+/);
push #recs, makeRecord(#fields);
}
for (my $i = 0; $i < #recs; $i++) {
printRec( $recs[$i] );
}
my %command_table = (
freq => \&freq,
length => \&length,
density => \&density,
help => \&help,
quit => \&quit
);
print "Enter a command: ";
while ( <STDIN> ) {
chomp;
my #line = split( /\s+/);
my $command = shift #line;
if ($command !~ /^freq$|^density$|length|^help$|^quit$/ ) {
print "Command must be: freq, length, density or quit\n";
}
else {
$command_table{$command}->();
}
print "Enter a command: ";
}
sub makeRecord
# Read the entire line and make records from the lines that contain the
# word ATOM or HETATM in the first column. Not sure how to do this:
{
my %record =
(
serialnumber => shift,
aminoacid => shift,
coordinates => shift,
element => [ #_ ]
);
return\%record;
}
sub freq
# take an array of atom records, return a hash whose keys are
# distinct atom names and whose values are the frequences of
# these atoms in the array.
sub length
# take an array of atom records and return the max distance
# between all pairs of atoms in that array. My instructor
# advised this would be constructed as a for loop inside a for loop.
sub density
# take an array of atom records and a number d and will return a
# hash whose keys are atom serial numbers and whose values are
# the number of atoms within that distance from the atom with that
# serial number.
sub help
{
print "To use this program, type either\n",
"freq\n",
"length\n",
"density followed by a number, d,\n",
"help\n",
"quit\n";
}
sub quit
{
exit 0;
}
# truncating for testing purposes. Actual data is aprox. 100 columns
# and starts with ATOM or HETATM.
__DATA__
ATOM 4743 CG GLN A 704 19.896 32.017 54.717 1.00 66.44 C
ATOM 4744 CD GLN A 704 19.589 30.757 55.525 1.00 73.28 C
ATOM 4745 OE1 GLN A 704 18.801 29.892 55.098 1.00 75.91 O
It looks like your Perl skills are advancing nicely -- using references and complex data structures. Here are a few tips and pieces of general advice.
Enable warnings with use warnings rather than $^W = 1. The former is self-documenting and has the advantage being local to the enclosing block rather than being a global setting.
Use well-named variables, which will help document the program's behavior, rather than relying on Perl's special $_. For example:
while (my $input_record = <DATA>){
}
In user-input scenarios, an endless loop provides a way to avoid repeated instructions like "Enter a command". See below.
Your regex can be simplified to avoid the need for repeated anchors. See below.
As a general rule, affirmative tests are easier to understand than negative tests. See the modified if-else structure below.
Enclose each part of program within its own subroutine. This is a good general practice for a bunch of reasons, so I would just start the habit.
A related good practice is to minimize the use of global variables. As an exercise, you could try to write the program so that it uses no global variables at all. Instead, any needed information would be passed around between the subroutines. With small programs one does not necessarily need to be rigid about the avoidance of globals, but it's not a bad idea to keep the ideal in mind.
Give your length subroutine a different name. That name is already used by the built-in length function.
Regarding your question about makeRecord, one approach is to ignore the filtering issue inside makeRecord. Instead, makeRecord could include an additional hash field, and the filtering logic would reside elsewhere. For example:
my $record = makeRecord(#fields);
push #recs, $record if $record->{type} =~ /^(ATOM|HETATM)$/;
An illustration of some of the points above:
use strict;
use warnings;
run();
sub run {
my $atom_data = load_atom_data();
print_records($atom_data);
interact_with_user($atom_data);
}
...
sub interact_with_user {
my $atom_data = shift;
my %command_table = (...);
while (1){
print "Enter a command: ";
chomp(my $reply = <STDIN>);
my ($command, #line) = split /\s+/, $reply;
if ( $command =~ /^(freq|density|length|help|quit)$/ ) {
# Run the command.
}
else {
# Print usage message for user.
}
}
}
...
FM's answer is pretty good. I'll just mention a couple of additional things:
You already have a hash with the valid commands (which is a good idea). There's no need to duplicate that list in a regex. I'd do something like this:
if (my $routine = $command_table{$command}) {
$routine->(#line);
} else {
print "Command must be: freq, length, density or quit\n";
}
Notice I'm also passing #line to the subroutine, because you'll need that for the density command. Subroutines that don't take arguments can just ignore them.
You could also generate the list of valid commands for the error message by using keys %command_table, but I'll leave that as an exercise for you.
Another thing is that the description of the input file mentions column numbers, which suggests that it's a fixed-width format. That's better parsed with substr or unpack. If a field is ever blank or contains a space, then your split will not parse it correctly. (If you use substr, be aware that it numbers columns starting at 0, when people often label the first column 1.)