Reading two lines of data from a file - perl

I have a file which I would like to read data from. This is a sample of the data:
NODELOAD 28 27132 3.29108E+04 7.94536E+04 0.00000E+00
NODELOAD 29 27083 9.89950E+04 9.89950E+04 0.00000E+00
NODELOAD 29 27132 6.08112E+04 6.08112E+04 0.00000E+00
NODELOAD 30 27083 1.29343E+05 5.35757E+04 0.00000E+00
NODELOAD 30 27132 7.94536E+04 3.29108E+04 0.00000E+00
NODELOAD 31 68 4.80185E+04 -5.47647E+04 -1.17033E+04
-1.27646E+03 1.18350E+04 -2.03885E+03
NODELOAD 31 1114 1.20706E+05 -3.31323E+04 -7.17280E+04
2.28198E+03 2.75582E+04 5.74460E+02
I have this code and am able to read all values of a single line and save them to an array:
foreach my $line (#input) {
if($line =~ /^\s*NODELOAD\s+/i) {
$line =~ s/^\s*//;
#a = split(/\s+/,$line);
$modelData{"NODELOAD"}->{$a[1]}->{$a[2]}->{"Fx"} = $a[3];
$modelData{"NODELOAD"}->{$a[1]}->{$a[2]}->{"Fy"} = $a[4];
$modelData{"NODELOAD"}->{$a[1]}->{$a[2]}->{"Fz"} = $a[5];
However, there are some "NODELOAD" definitions in the file that are defined on two lines and have 6 load values instead of 3 (the first two numbers on each line are identifiers, the following 3/6 are data).
Is it easiest writing an if statement, which saves the data if the following line does not begin with "NODELOAD" and contains numbers? The very last line after this part in the text file will not contain any numbers, but may be blank or contain text.

Yes, the easiest approach would be to keep values from previous call in some variable, then if if(/NODELOAD/) doesn't match, you get just 3 values and process them using identifiers from previous line (and previous look iteration). You could also skip a regexp in if, and check for the first element of split result:
my #last_values;
foreach my $line (#input) {
$line =~ s/^\s+//;
my #values = split(/\s+/, $line);
if( $values[0] ne 'NODELOAD' ) {
unshift( #values, #last_values[0..2] ); # Get first 3 values from previous call
# Then process it however you'd like to
$modelData{"NODELOAD"}->{$values[1]}->{$values[2]}->{"Fx2"} = $values[3];
}
elsif {
# process like previously...
$modelData{"NODELOAD"}->{$values[1]}->{$values[2]}->{"Fx"} = $values[3];
$modelData{"NODELOAD"}->{$values[1]}->{$values[2]}->{"Fy"} = $values[4];
$modelData{"NODELOAD"}->{$values[1]}->{$values[2]}->{"Fz"} = $values[5];
#last_values = #values; # and save for future reference
}
}

Related

Add new hash keys and then print in a new file

Previously, I post a question to search for an answer to using regex to match specifics sequence identification (ID).
Now I´m looking for some recommendations to print the data that I looking for.
If you want to see the complete file, here's a GitHub link.
This script takes two files to work. The first file is something like this (this is only a part of the file):
AGY29650_2_NA netOGlyc-4.0.0.13 CARBOHYD 2 2 0.0804934 . .
AGY29650_2_NA netOGlyc-4.0.0.13 CARBOHYD 4 4 0.0925522 . .
AGY29650_2_NA netOGlyc-4.0.0.13 CARBOHYD 13 13 0.0250116 . .
AGY29650_2_NA netOGlyc-4.0.0.13 CARBOHYD 23 23 0.565981 . .
...
This file tells me when there is a value >= 0.5, this information is in the sixth column. When this happens my script takes the first column (this is an ID, to match in with the second file) and the fourth column (this is a position of a letter in the second file).
Here my second file (this is only a part):
>AGY29650.2|NA spike protein
MTYSVFPLMCLLTFIGANAKIVTLPGNDA...EEYDLEPHKIHVH*
Like I said previously, the script takes the ID in the first file to match with the ID in the second file when these are the same and then searches for the position (fourth column) in the contents of the data.
Here an example, in file one the fourth row is a positive value (>=0.5) and the position in the fourth column is 23.
Then the script searches for position 23 in the data contents of the second file, here position 23 is a letter T:
MTYSVFPLMCLLTFIGANAKIV T LP
When the script match with the letter, the looking for 2 letters right and 2 letters left to the position of interest:
IVTLP
In the previous post, thank the help of some people in Stack I could solve the problem because of a difference between ID in each file (difference like this: AGY29650_2_NA (file one) and AGY29650.2 (file two)).
Now I looking for help to obtain the output that I need to complete the script.
The script is incomplete because I couldn't found the way to print the output of interest, in this case, the 5 letters in the second file (one letter of the position that appears in file one) 2 letters right, and 2 left.
I have thousands of files like the one and two, now I need some help to complete the script with any idea that you recommend.
Here is the script:
use strict;
use warnings;
use Bio::SeqIO;
​
my $file = $ARGV[0];
my $in = $ARGV[1];
my %fastadata = ();
my #array_residues = ();
my $seqio_obj = Bio::SeqIO->new(-file => $in,
-format => "fasta" );
while (my $seq_obj = $seqio_obj->next_seq ) {
my $dd = $seq_obj->id;
my $ss = $seq_obj->seq;
###my $ee = $seq_obj->desc;
$fastadata{$dd} = "$ss";
}
​
my $thres = 0.5; ### Selection of values in column N°5 with the following condition: >=0.5
​
# Open file
open (F, $file) or die; ### open the file or end the analyze
while(my $one = <F>) {### readline => F
$one =~ s/\n//g;
$one =~ s/\r//g;
my #cols = split(/\s+/, $one); ### split columns
next unless (scalar (#cols) == 7); ### the line must have 7 columns to add to the array
my $val = $cols[5];
​
if ($val >= 0.5) {
my $position = $cols[3];
my $id_list = $cols[0];
$id_list =~ s/^\s*([^_]+)_([0-9]+)_([a-zA-Z0-9]+)/$1.$2|$3/;
if (exists($fastadata{$id_list})) {
my $new_seq = $fastadata{$id_list};
my $subresidues = substr($new_seq, $position -3, 6);
}
}
}
close F;
I´m thinking in add a push function to generate the new data and then print in a new file.
My expected output is to print the position of a positive value (>=0.5), in this case, T (position 23) and the 2 letters right and 2 letters left.
In this case, with the data example in GitHub (link above) the expected output is:
IVTLP
Any recommendation or help is welcome.
Thank!
Main problem seems to be that the line has 8 columns not 7 as assumed in the script. Another small issue is that the extracted substring should have 5 characters not 6 as assumed by the script. Here is a modified version of the loop that works for me:
open (F, $file) or die; ### open the file or end the analyze
while(my $one = <F>) {### readline => F
chomp $one;
my #cols = split(/\s+/, $one); ### split columns
next unless (scalar #cols) == 8; ### the line must have 8 columns to add to the array
my $val = $cols[5];
if ($val >= 0.5) {
my $position = $cols[3];
my $id_list = $cols[0];
$id_list =~ s/^\s*([^_]+)_([0-9]+)_([a-zA-Z0-9]+)/$1.$2|$3/;
if (exists($fastadata{$id_list})) {
my $new_seq = $fastadata{$id_list};
my $subresidues = substr($new_seq, $position -3, 5);
print $subresidues, "\n";
}
}
}

Perl variable not assigned in foreach: scope issues

I am trying to normalize some scores from a .txt file by dividing each score for each possible sense (eg. take#v#2; referred to as $tokpossense in my code) by the sum of all scores for a wordtype (e.g. take#v; referred to as $tokpos). The difficulty is in grouping the wordtypes together when processing each line of the so that the normalized scores are printed upon finding a new wordtype/$tokpos. I used two hashes and an if block to achieve this.
Currently, the problem seems to be that $tokpos is undefined as a key in SumHash{$tokpos} at line 20 resulting in a division by zero. However, I believe $tokpos is properly defined within the scope of this block. What is the problem exactly and how would I best solve it? I would also gladly hear alternative approaches to this problem.
Here's an example inputfile:
i#CL take#v#17 my#CL checks#n#1 to#CL the#CL bank#n#2 .#IT
Context: i#CL <target>take#v</target> my#CL checks#n to#CL the#CL bank#n
Scores for take#v
take#v#1: 17
take#v#10: 158
take#v#17: 174
Winning score: 174
Context: i#CL take#v my#CL <target>checks#n</target> to#CL the#CL bank#n .#IT
Scores for checks#n
check#n#1: 198
check#n#2: 117
check#n#3: 42
Winning score: 198
Context: take#v my#CL checks#n to#CL the#CL <target>bank#n</target> .#IT
Scores for bank#n
bank#n#1: 81
bank#n#2: 202
bank#n#3: 68
bank#n#4: 37
Winning score: 202
My erroneous Code:
#files = #ARGV;
foreach $file(#files){
open(IN, $file);
#lines=<IN>;
foreach (#lines){
chomp;
#store tokpossense (eg. "take#v#1") and rawscore (eg. 4)
if (($tokpossense,$rawscore)= /^\s{4}(.+): (\d+)/) {
#split tokpossense for recombination
($tok,$pos,$sensenr)=split(/#/,$tokpossense);
#tokpos (eg. take#v) will be a unique identifier when calculating normalized score
$tokpos="$tok\#$pos";
#block for when new tokpos(word) is found in inputfile
if (defined($prevtokpos) and
($tokpos ne $prevtokpos)) {
# normalize hash: THE PROBLEM LIES IN $SumHash{$tokpos} which is returned as zero > WHY?
foreach (keys %ScoreHash) {
$normscore=$ScoreHash{$_}/$SumHash{$tokpos};
#print the results to a file
print "$_\t$ScoreHash{$_}\t$normscore\n";
}
#empty hashes
undef %ScoreHash;
undef %SumHash;
}
#prevtokpos is assigned to tokpos for condition above
$prevtokpos = $tokpos;
#store the sum of scores for a tokpos identifier for normalization
$SumHash{$tokpos}+=$rawscore;
#store the scores for a tokpossense identifier for normalization
$ScoreHash{$tokpossense}=$rawscore;
}
#skip the irrelevant lines of inputfile
else {next;}
}
}
Extra info: I am doing Word Sense Disambiguation using Pedersen's Wordnet WSD tool which uses Wordnet::Similarity::AllWords. The output file is generated by this package and the found scores have to be normalized for implementation in our toolset.
You don't assign anything to $tokpos. The assignment is part of a comment - syntax highlighting in your editor should've told you. strict would've told you, too.
Also, you should probably use $prevtokpos in the division: $tokpos is the new value that you haven't met before. To get the output for the last token, you have to process it outside the loop, as there's no $tokpos to replace it. To avoid code repetition, use a subroutine to do that:
#!/usr/bin/perl
use warnings;
use strict;
my %SumHash;
my %ScoreHash;
sub output {
my $token = shift;
for (keys %ScoreHash) {
my $normscore = $ScoreHash{$_} / $SumHash{$token};
print "$_\t$ScoreHash{$_}\t$normscore\n";
}
undef %ScoreHash;
undef %SumHash;
}
my $prevtokpos;
while (<DATA>){
chomp;
if (my ($tokpossense,$rawscore) = /^\s{4}(.+): (\d+)/) {
my ($tok, $pos, $sensenr) = split /#/, $tokpossense;
my $tokpos = "$tok\#$pos";
if (defined $prevtokpos && $tokpos ne $prevtokpos) {
output($prevtokpos);
}
$prevtokpos = $tokpos;
$SumHash{$tokpos} += $rawscore;
$ScoreHash{$tokpossense} = $rawscore;
}
}
output($prevtokpos);
__DATA__
i#CL take#v#17 my#CL checks#n#1 to#CL the#CL bank#n#2 .#IT
Context: i#CL <target>take#v</target> my#CL checks#n to#CL the#CL bank#n
Scores for take#v
take#v#1: 17
take#v#10: 158
take#v#17: 174
Winning score: 174
Context: i#CL take#v my#CL <target>checks#n</target> to#CL the#CL bank#n .#IT
Scores for checks#n
check#n#1: 198
check#n#2: 117
check#n#3: 42
Winning score: 198
Context: take#v my#CL checks#n to#CL the#CL <target>bank#n</target> .#IT
Scores for bank#n
bank#n#1: 81
bank#n#2: 202
bank#n#3: 68
bank#n#4: 37
Winning score: 202
You're confusing yourself by trying to print the results as soon as $tokpos changes. For one thing it's the values for $prevtokpos that are complete, but your trying to output the data for $tokpos; and also you're never going to display the last block of data because you require a change in $tokpos to trigger the output.
It's far easier to accumulate all the data for a given file and then print it when the end of file is reached. This program works by keeping the three values
$tokpos, $sense, and $rawscore for each line of the output in array #results, together with the total score for each value of $tokpos in %totals. Then it's simply a matter of dumping the contents of #results with an extra column that divides each value by the corresponding total.
use strict;
use warnings;
use 5.014; # For non-destructive substitution
for my $file ( #ARGV ) {
open my $fh, '<', $file or die $!;
my (#results, %totals);
while ( <$fh> ) {
chomp;
next unless my ($tokpos, $sense, $rawscore) = / ^ \s{4} ( [^#]+ \# [^#]+ ) \# (\d+) : \s+ (\d+) /x;
push #results, [ $tokpos, $sense, $rawscore ];
$totals{$tokpos} += $rawscore;
}
print "** $file **\n";
for my $item ( #results ) {
my ($tokpos, $sense, $rawscore) = #$item;
printf "%s\t%s\t%6.4f\n", $tokpos.$sense, $rawscore, $rawscore / $totals{$tokpos};
}
print "\n";
}
output
** tokpos.txt **
take#v#1 17 0.0487
take#v#10 158 0.4527
take#v#17 174 0.4986
check#n#1 198 0.5546
check#n#2 117 0.3277
check#n#3 42 0.1176
bank#n#1 81 0.2088
bank#n#2 202 0.5206
bank#n#3 68 0.1753
bank#n#4 37 0.0954

Perl - while loop not working

I'm a perl rookie and dont know how to do this...
My input file:
random text 00:02 23
random text 00:04 25
random text 00:06 53
random text 00:07 56
random text 00:12 34
... etc until 23:59
I would like to have the following output:
00:00
00:01
00:02 23
00:03
00:04
00:05
00:06 53
00:07 56
00:08
00:09
00:10
00:11
00:12 34
... etc until 23:59
So an output file with a every minute timestamp and the corresponding value if found in input file. My input file starts at 00:00 and ends 23:59
My code sofar:
use warnings;
use strict;
my $found;
my #event;
my $count2;
open (FILE, '<./input/input.txt');
open (OUTPUT, '>./output/output.txt');
while (<FILE>){
for ($count2=0; $count2<60; $count2++){
my($line) = $_;
if($line =~ m|.*(00:$count2).*|){
$found = "$1 \n";
push #event, $found;
}
if (#event){
}
else {
$found2 = "00:$count2,";
push #event, $found2;
}
}
}
print OUTPUT (#event);
close (FILE);
close (OUTPUT);
Here's one approach to your task:
use strict;
use warnings;
my %hash;
open my $inFH, '<', './input/input.txt' or die $!;
while (<$inFH>) {
my ( $hr_min, $sec ) = /(\d\d:\d\d)\s+(.+)$/;
push #{ $hash{$hr_min} }, $sec;
}
close $inFH;
open my $outFH, '>', './output/output.txt' or die $!;
for my $hr ( 0 .. 23 ) {
for my $min ( 0 .. 59 ) {
my $hr_min = sprintf "%02d:%02d", $hr, $min;
my $sec = defined $hash{$hr_min} ? " ${ $hash{$hr_min} }[-1]" : '';
print $outFH "$hr_min$sec\n";
}
}
close $outFH;
The first part reads your input file and uses a regex to grab the time at the end of each string. A hash of arrays (HoA) is built, with the HH:MM as the key and seconds in the array. For example:
09:14 => ['21','45']
This means that at 09:14 there were two second entires: one at 21 seconds and the other at 45 seconds. Since the times in the input file are in ascending order, the highest one in the array can be obtained by using the [-1] subscript.
Next, two loops are set up: the outer is (0..23) and the inner (0..59), and sprintf is used to format the HH:MM. When a key is found in the hash that corresponds to the current HH:MM in the loops, HH:MM and the last item in the array (the largest seconds) is printed out to a file (e.g., 00:02 23). If there isn't a corresponding HH:MM in the hash, just the loop's HH:MM is printed (e.g., 00:03):
Sample output:
00:00
00:01
00:02 23
00:03
00:04 45
00:05
00:06 53
00:07 59
00:08
00:09
00:10
00:11
00:12 34
...
23:59
Hope this helps!
This is best done with a hash, as Kenosis has already shown. There are some simplifications/improvements that can be done, however.
By using assignment = we store the latest value for each time, because identical hash keys will overwrite each other.
The range operator .. can also increment strings, so that we can get a range of strings, like 00, 01, ... 59.
The defined-or operator // can be used as a more concise way to check if a key for a certain time is defined.
Using \d+ rather than .+ will be much safer, as it will prevent something like hindsight is 20:20 at 01:23 45 to match 20:20 incorrectly.
We do not use hardcoded file names, instead using shell redirection and arguments.
In the below example code, I used a smaller range of numbers for demonstration purposes. I also used the DATA file handle so that this code can be copy/pasted and tried out. To try it, change <DATA> to <> and run it like this:
perl script.pl input.txt > output.txt
Code:
use strict;
use warnings;
use feature 'say';
my %t;
while (<DATA>) {
if (/((\d{2}:\d{2})\s+\d+)$/) {
$t{$2} = $1; # store most recent value
}
}
for my $h ('00' .. '00') {
for my $m ('00' .. '12') {
my $time = "$h:$m";
say $t{$time} // $time; # say defined $t{$time} ? $t{$time} : $time;
}
}
__DATA__
random text 00:02 23
random text 00:04 25
random text 00:06 53
random text 00:07 56
random text 00:12 34
random text 00:12 39
Output:
00:00
00:01
00:02 23
00:03
00:04 25
00:05
00:06 53
00:07 56
00:08
00:09
00:10
00:11
00:12 39

some help on the following perl script

Need help in merging/concatenating /combining /binding etc
I have several ascii files each defining one variable which I have converted to a single column array
I have such columnised data for many variables ,so I need to perform a column bind like R does and make it one single file.
I can do the same in R but there are too many files. Being able to do it with one single code will help save a lot of time.
Using the following code ,new to perl and need help with this.
#filenames = ("file1.txt","file2.txt");
open F2, ">file_combined.txt" or die;
for($j = 0; $j< scalar #filenames;$j++){
open F1, $filenames[$j] or die;
for($i=1;$i<=6;$i++){$line=<F1>;}
while($line=<F1>){
chomp $line;
#spl = split '\s+', $line;
for($i=0;$i<scalar #spl;$i++){
print F2 "$spl[$i]\n";
paste "file_bio1.txt","file_bio2.txt"> file_combined.txt;
}
}
close F1;
}
Input files here are Ascii text files of a raster.They look like this
32 12 34 21 32 21 22 23
12 21 32 43 21 32 21 12
The above mentioned code without the paste syntax converts these files into a single column
32
12
34
21
32
21
22
23
12
21
32
43
21
32
21
12
The output should look like this
12 21 32
32 23 23
32 21 32
12 34 12
43 32 32
32 23 23
32 34 21
21 32 23
Each column represents a different ascii file.
I need around 15 such ascii files into one dataframe.I can do the same in R but it consumes a lot of time as the number of files and regions of interest are too many and the files are a bit large too.
Let's step through what you have...
# files you want to open for reading..
#filenames = ("file1.txt","file2.txt");
# I would use the 3 arg lexical scoped open
# I think you want to open this for 'append' as well
# open($fh, ">>", "file_combined.txt") or die "cannot open";
open F2, ">file_combined.txt" or die;
# #filenames is best thought as a 'list'
# for my $file (#filenames) {
for($j = 0; $j< scalar #filenames;$j++){
# see above example of 'open'
# - $filenames[$j] + $file
open F1, $filenames[$j] or die;
# what are you trying to do here? You're overriding
# $line in the next 'while loop'
for($i=1;$i<=6;$i++){$line=<F1>;}
# while(<$fh1>) {
while($line=<F1>){
chomp $line;
# #spl is short for split?
# give '#spl' list a meaningful name
#spl = split '\s+', $line;
# again, #spl is a list...
# for my $word (#spl) {
for($i=0;$i<scalar #spl;$i++){
# this whole block is a bit confusing.
# 'F2' is 'file_combined.txt'. Then you try and merge
# ( and overwrite the file) with the paste afterwards...
print F2 "$spl[$i]\n";
# is this a 'system call'?
# Missing 'backticks' or 'system'
paste "file_bio1.txt","file_bio2.txt"> file_combined.txt;
}
}
# close $fh1
close F1;
}
# I'm assuming there's a 'close F2' somewhere here..
It looks like you're trying to do this:
#filenames = ("file1.txt","file2.txt");
$oufile = "combined_text.txt";
`paste $filenames[0] $filenames[1] > $outfile`;

Optimize Perl script - runs too slow on 40GB+ files

I made the following Perl script to handle some file manipulation at work, but it's running far too slowly at the minute to be put in production.
I don't know Perl very well (not one of my languages), so can someone help me identify and replace parts of this script that would be slow given it's processing ~40 million lines?
Data being piped in is in the format:
col1|^|col2|^|col3|!|
col1|^|col2|^|col3|!|
... 40 million of these.
The date_cols array is calculated before this part of the script and basically holds the index of columns containing dates in the pre-converted format.
Here's the part of the script that will be executed for every input row. I've cleaned it up a little and added comments, but let me know if anything else is needed:
## Read from STDIN until no more lines are arailable.
while (<STDIN>)
{
## Split by field delimiter
my #fields = split('\|\^\|', $_, -1);
## Remove the terminating delimiter from the final field so it doesn't
## interfere with date processing.
$fields[-1] = (split('\|!\|', $fields[-1], -1))[0];
## Cycle through all column numbres in date_cols and convert date
## to yyyymmdd
foreach $col (#date_cols)
{
if ($fields[$col] ne "")
{
$fields[$col] = formatTime($fields[$col]);
}
}
print(join('This is an unprintable ASCII control code', #fields), "\n");
}
## Format the input time to yyyymmdd from 'Dec 26 2012 12:00AM' like format.
sub formatTime($)
{
my $col = shift;
if (substr($col, 4, 1) eq " ") {
substr($col, 4, 1) = "0";
}
return substr($col, 7, 4).$months{substr($col, 0, 3)}.substr($col, 4, 2);
}
If written purely for efficiency, I'd write your code like this:
sub run_loop {
local $/ = "|!|\n"; # set the record input terminator
# to the record seperator of our problem space
while (<STDIN>) {
# remove the seperator
chomp;
# Split by field delimiter
my #fields = split m/\|\^\|/, $_, -1;
# Cycle through all column numbres in date_cols and convert date
# to yyyymmdd
foreach $col (#date_cols) {
if ($fields[$col] ne "") {
# $fields[$col] = formatTime($fields[$col]);
my $temp = $fields[$col];
if (substr($temp, 4, 1) eq " ") {
substr($temp, 4, 1) = "0";
}
$fields[$col] = substr($temp, 7, 4).$months{substr($temp, 0, 3)}.substr($temp, 4, 2);
}
}
print join("\022", #fields) . "\n";
}
}
The optimizations are:
Using chomp to remove the |!|\n string at the end
Inlining the formatTime sub.
Subroutine calls are extremely expensive in Perl. If subs have to be used very efficiently, prototype checking can be disabled with the &subroutine(#args) syntax. If #args are ommited, the current arguments #_ are visible to the called sub. This can lead to bugs or additional performance. Use wisely. The goto &subroutine; syntax can be used as well, but this meddles with return (basically a tail call). Do not use.
Further optimizations could include removing the hash lookup %months, as hashing is expensive.
You'll have to benchmark on your data set to compare, but you can throw a regex at it. (Made all the worse by your very regex-unfriendly field and record separators!)
my $i = 0;
our %months = map { $_ => sprintf('%02d', ++$i) } qw(Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec);
while (<DATA>) {
s! \|\^\| !\022!xg; # convert field separator
s/ \| !\| $ //xg; # strip record terminator
s/\b(\w{3}) ( \d|\d\d) (\d{4}) \d\d:\d\d[AP]M\b/${3} . $months{$1} . sprintf('%02d', $2) /eg;
print;
}
Won't do what you want if one of the non-#date_cols fields matches the date regex.
At my work sometimes i need to grep errorlogs etc from 350+ frontends. I use script template i calling "SMP grep" ;) Its simple:
stat file, get file length
Get "chunk length" = file_length / num_processors
Andjust chunk starts and ends so they start/end at "\n". Just read(), find "\n" and calculate offsets.
fork() to make num_processor workers, each working on own chunk
This can help if you use regexps in your grep or other CPU operations(as your case i think). Admins complaining this script eats disk throughput, but its only bottleneck here if server has 8 CPUs =) Also, obviously if you need to parse 1 week data you can divide between servers.
Tomorrow i can post the code if interested.