Math::BaseConvert with strange basenumbers like 23 or 1000 - perl

I just found the perl module Math::BaseConvert. I have to the task to convert numbers to very strange number with a different base. Not only base 2, 8, 16, but also 23, 134 up to 1000. (This is a partial task to balance a tree of files in a directory)
I could not make it. Reading the tests for the module in CPAN also confused me. So I wrote a little test, maybe you can tell me what's wrong, the result is:
ok 1 - use Math::BaseConvert;
ok 2 - Convert number '23' (base10) to '27' (base8)
not ok 3 - Convert number '23' (base10) to '23' (base32)
# Failed test 'Convert number '23' (base10) to '23' (base32)'
# at test_math_baseconvert.pl line 35.
# got: 'N'
# expected: '23'
not ok 4 - Convert number '64712' (base10) to '64:712' (base1000)
# Failed test 'Convert number '64712' (base10) to '64:712' (base1000)'
# at test_math_baseconvert.pl line 35.
# got: '-1'
# expected: '64:712'
1..4
# Looks like you failed 2 tests of 4.
The testprogram is this:
#!/usr/bin/perl
use strict;
use warnings;
use Test::More;
use_ok( 'Math::BaseConvert', '1.7' );
my #lines = (
{
# http://www.wolframalpha.com/input/?i=23+from+base+10+to+base+16
old_number => '23',
old_base => 10,
new_number => '27',
new_base => 8,
},
{
# http://www.wolframalpha.com/input/?i=23+from+base+10+to+base+32
old_number => '23',
old_base => 10,
new_number => '23', # stays same
new_base => 32,
},
{
# http://www.wolframalpha.com/input/?i=64712+from+base+10+to+base+1000
old_number => '64712',
old_base => 10,
new_number => '64:712',
new_base => 1000,
},
);
for my $line (#lines) {
cmp_ok(
Math::BaseConvert::cnv(
$line->{old_number}, $line->{old_base}, $line->{new_base}
),
'eq',
$line->{new_number},
sprintf(
"Convert number '%s' (base%d) to '%s' (base%d)",
$line->{old_number}, $line->{old_base},
$line->{new_number}, $line->{new_base}
)
);
}
done_testing();

Wolfram Alpha's method of showing bases greater than base 16 is to separate digits with a colon. There's nothing wrong with that, as they're displaying those numbers using css styling that lessons the shading of the colon to make it more obvious what they're doing. But they also add a message stating exactly how many digits they're showing since "1:1617 (2 digits)" isn't obvious enough.
The method Math::BaseConvert and other such modules use is to expand the character set for digits just like is done with hex numbers 0-9A-F to include the first 6 letters in the alphabet. For the case of base 32 numbers the character set is 0-9A-V. Given N is the 14th letter in the alphabet, it is the appropriate representation for 23 in base 32.
If you want to use the colon representation for numbers greater than 16, you can either use the module or just roll your own solution.
#!/usr/bin/perl
use strict;
use warnings;
use Test::More;
use_ok( 'Math::BaseConvert', '1.7' );
my #lines = (
# Old_Number Old_Base New_Number New_Base
[qw(23 10 27 8)], # http://www.wolframalpha.com/input/?i=23+from+base+10+to+base+16
[qw(23 10 23 32)], # http://www.wolframalpha.com/input/?i=23+from+base+10+to+base+32
[qw(64712 10 64:712 1000)], # http://www.wolframalpha.com/input/?i=64712+from+base+10+to+base+1000
);
for my $line (#lines) {
cmp_ok(
base10toN(#$line[0,3]),
'eq',
$line->[2],
sprintf("Convert number '%s' (base%d) to '%s' (base%d)", $line->[0], 10, #$line[2,3])
);
}
sub base10toN {
my ($num, $base) = #_;
return Math::BaseConvert::cnv($num, 10, $base)
if $base <= 16;
my #digits = ();
while (1) {
my $remainder = $num % $base;
unshift #digits, $remainder;
$num = ($num - $remainder) / $base
or last;
}
return join ':', #digits;
}
done_testing();

You seem to be expecting decimal output, with "digits" being decimal numbers separated by :.
Math::BaseConvert doesn't do that. It only supports having a single character per digit.
By default, the digits used are '0'..'9', 'A'..'Z', 'a'..'z', '.', '_' though you can supply your own list instead (and you would have to do so to support up to base 1000).

Related

Perl: Get position and length of element in a string

Say I have a string like:
my $refseq="CCCC-TGA---ATAAAC--TCCAT-GCTCCCCC--------------------AAGC";
I want to detect the positions where "-" occurs and the number of contiguous "-". I want to end up with a hash with "-" position as key, and extension length as value, for this example above:
%POSLENGTH = (5 => 1, 8 => 3, 14 => 2, 19 => 1, 27 => 20);
Note that the positions should be given based on the string without "-".
Check for #- array in perlval
my $refseq = "CCCC-TGA---ATAAAC--TCCAT-GCTCCCCC--------------------AAGC";
my %POSLENGTH;
$POSLENGTH{ $-[0] +1 } = length($1) while $refseq =~ s/(-+)//;
use Data::Dumper; print Dumper \%POSLENGTH;
output
$VAR1 = {
'14' => 2,
'8' => 3,
'27' => 20,
'19' => 1,
'5' => 1
};
You can do this using the built-in #- and #+ arrays. Together they hold the start and end offsets of the last successful pattern match in element 0 (and of any captures in elements 1 onwards) so clearly the length of the last match is $+[0] - $-[0].
They're documented under Variables related to regular expressions in perldoc perlvar.
I've used Data::Dump here just to display the contents of the hash that is built
On a side note, I'm very doubtful that a hash is a useful structure for this information as I can't imagine a situation where you know the start position of a substring and need to know its length. I would have thought it was better represented as just an array of pairs
use strict;
use warnings;
use Data::Dump;
my $refseq="CCCC-TGA---ATAAAC--TCCAT-GCTCCCCC--------------------AAGC";
my %pos_length;
while ( $refseq =~ /-+/g ) {
my ($pos, $len) = ( $-[0] + 1, $+[0] - $-[0] );
$pos_length{$pos} = $len;
}
dd \%pos_length;
output
{ 5 => 1, 9 => 3, 18 => 2, 25 => 1, 34 => 20 }

How to print least significant 4 bits of signed integer in a file using Perl?

For example:
I have $a= -1. If I print it using printf with %.4b or %b, it gives me 32-bit all 1's.
But, I only want to print the least significant 4 bits like 1111 in the file in binary.
Any ideas how to do it?
Thanks
-1 in binary is represented via 2s complement, so it is all 1s. (See here for more: What is “2's Complement”?)
If you want to 'limit' it, then the way you can do this is with a bitwise and.
Switching on 4 bits is
1+2+4+8 = 15.
Therefore:
use strict;
use warnings;
my $val = -1;
printf ( "%b", $val & 15 );
%.4b refers to fractional digits, %04b formats to at least 4 digits, padding leading 0s as needed.
To cater for negative integers, take the modulus by 16 ( 2^<number of least significant bits> ).
my #b = (12, 59, -1, 1 ); # sample of integers
#b = map { $_ % 16; } #b; # take modulus
printf ("4-bits: %04b" . (", %04b" x $#b) . ";\n", #b );
# output with computed number of placeholders

Perl grep not returning expected value

I have the following code:
#!/usr/bin/perl
# splits.pl
use strict;
use warnings;
use diagnostics;
my $pivotfile = "myPath/Internal_Splits_Pivot.txt";
open PIVOTFILE, $pivotfile or die $!;
while (<PIVOTFILE>) { # loop through each line in file
next if ($. == 1); # skip first line (contains business segment code)
next if ($. == 2); # skip second line (contains transaction amount text)
my #fields = split('\t',$_); # split fields for line into an array
print scalar(grep $_, #fields), "\n";
}
Given that the data in the text file is this:
4 G I M N U X
Transaction Amount Transaction Amount Transaction Amount Transaction Amount Transaction Amount Transaction Amount Transaction Amount
0000-13-I21 600
0001-8V-034BLA 2,172 2,172
0001-8V-191GYG 13,125 4,375
0001-9W-GH5B2A -2,967.09 2,967.09 25.00
I would expect the output from the perl script to be: 2 3 3 4 given the amount of defined elements in each line. The file is a tab delimited text file with 8 columns.
Instead I get 3 4 3 4 and I have no idea why!
For background, I am using Counting array elements in Perl as the basis for my development, as I am trying to count the number of elements in the line to know if I need to skip that line or not.
I suspect you have spaces mixed with the tabs in some places, and your grep test will consider " " true.
What does:
use Data::Dumper;
$Data::Dumper::Useqq=1;
print Dumper [<PIVOTFILE>];
show?
The problem should be in this line:
my #fields = split('\t',$_); # split fields for line into an array
The tab character doesn't get interpolated. And your file doesn't seem to be tab-only separated, at least here on SO. I changed the split regex to match arbitrary whitespace, ran the code on my machine and got the "right" result:
my #fields = split(/\s+/,$_); # split fields for line into an array
Result:
2
3
3
4
As a side note:
For background, I am using Counting array elements in Perl as the basis for my development, as I am trying to count the number of elements in the line to know if I need to skip that line or not.
Now I understand why you use grep to count array elements. That's important when your array contains undefined values like here:
my #a;
$a[1] = 42; # #a contains the list (undef, 42)
say scalar #a; # 2
or when you manually deleted entries:
my #a = split /,/ => 'foo,bar'; # #a contains the list ('foo', 'bar')
delete $a[0]; # #a contains the list (undef, 'bar')
say scalar #a; # 2
But in many cases, especially when you're using arrays to just store list without operating on single array elements, scalar #a works perfectly fine.
my #a = (1 .. 17, 1 .. 25); # (1, 2, ..., 17, 1, 2, .., 25)
say scalar #a; # 42
It's important to understand, what grep does! In your case
print scalar(grep $_, #fields), "\n";
grep returns the list of true values of #fields and then you print how many you have. But sometimes this isn't what you want/expect:
my #things = (17, 42, 'foo', '', 0); # even '' and 0 are things
say scalar grep $_ => #things # 3!
Because the empty string and the number 0 are false values in Perl, they won't get counted with that idiom. So if you want to know how long an array is, just use
say scalar #array; # number of array entries
If you want to count true values, use this
say scalar grep $_ => #array; # number of true values
But if you want to count defined values, use this
say scalar grep defined($_) => #array; # number of defined values
I'm pretty sure you already know this from the other answers on the linked page. In hashes, the situation is a little bit more complex because setting something to undef is not the same as deleteing it:
my %h = (a => 0, b => 42, c => 17, d => 666);
$h{c} = undef; # still there, but undefined
delete $h{d}; # BAM! $h{d} is gone!
What happens when we try to count values?
say scalar grep $_ => values %h; # 1
because 42 is the only true value in %h.
say scalar grep defined $_ => values %h; # 2
because 0 is defined although it's false.
say scalar grep exists $h{$_} => qw(a b c d); # 3
because undefined values can exist. Conclusion:
know what you're doing instead of copy'n'pasting code snippets :)
There are not only tabs, but there are spaces as well.
trying out with splitting by space works
Look below
#!/usr/bin/perl
# splits.pl
use strict;
use warnings;
use diagnostics;
while (<DATA>) { # loop through each line in file
next if ($. == 1); # skip first line (contains business segment code)
next if ($. == 2); # skip second line (contains transaction amount text)
my #fields = split(" ",$_); # split fields by SPACE
print scalar(#fields), "\n";
}
__DATA__
4 G I M N U X
Transaction Amount Transaction Amount Transaction Amount Transaction Amount Transaction Amount Transaction Amount Transaction Amount
0000-13-I21 600
0001-8V-034BLA 2,172 2,172
0001-8V-191GYG 13,125 4,375
0001-9W-GH5B2A -2,967.09 2,967.09 25.00
Output
2
3
3
4
Your code works for me. The problem may be that the input file contains some "hidden" whitespace fields (eg. other whitespace than tabs). For instance
A<tab><space><CR> gives two fields, A and <space><CR>
A<tab>B<tab><CR> gives three, A, B, <CR> (remember, the end of line is part of the input!)
I suggest you to chomp every line you use; other than that, you will have to clean the array from whitespace-only fields. Eg.
scalar(grep /\S/, #fields)
should do it.
A lot of great help on this question, and quickly too!
After a long, drawn-out learning process, this is what I came up with that worked quite well, with intended results.
#!/usr/bin/perl
# splits.pl
use strict;
use warnings;
use diagnostics;
my $pivotfile = "myPath/Internal_Splits_Pivot.txt";
open PIVOTFILE, $pivotfile or die $!;
while (<PIVOTFILE>) { # loop through each line in file
next if ($. == 1); # skip first line (contains business segment code)
next if ($. == 2); # skip second line (contains transaction amount text)
chomp $_; # clean line of trailing \n and white space
my #fields = split(/\t/,$_); # split fields for line into an array
print scalar(grep $_, #fields), "\n";
}

Getting fixed width columnar output using 'printf' in Perl [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Formatting output with 'printf' in Perl
my #selections = ("Hamburger","Frankfurter","French Fries","Large Coke","Medium Coke","Small Coke","Onion Rings");
my #prices = (3.49, 2.19, 1.69, 1.79, 1.59, 1.39, 1.19);
my #quantity = (3, 0, 0, 4, 0, 0, 8);
printf("%s %10s %12s %10s\n", "Qty", "Desc.", "Unit \$", "Total");
for($meh = 0; $meh <= 6; $meh++)
{
if($quantity[$meh] != 0)
{
printf("%d %10s %9.2f %7.2f\n", $quantity[$meh], $selections[$meh], $prices[$meh], $prices[$meh]*$quantity[$meh])
}
}
I can't figure out how to make the columns line up. I followed the suggestions of another post, but it still isn't working.
The problem is that your strings are more than 10 characters long, and Perl won't cut them unless you specify a maximum width, which is given after the dot for strings (%10.10s). Also, you may want to use a negative number so they become aligned to the left (%-10.10s).
If you want the columns to be exactly aligned based on dynamic input data, you need to make two passes over the rows. The first time through, record the maximum length of each column. Then construct a format string using those lengths. Finally, print each row using that format string.
use strict;
use warnings;
my #selections = ("Hamburger","Frankfurter","French Fries","Large Coke","Medium Coke","Small Coke","Onion Rings");
my #prices = (3.49, 2.19, 1.69, 1.79, 1.59, 1.39, 1.19);
my #quantity = (3, 0, 0, 4, 0, 0, 8);
my #rows;
push #rows, ["Qty", "Desc.", "Unit \$", "Total"];
# construct table data as a two-dimensional array
for (my $meh = 0; $meh < #selections; $meh++) {
next unless $quantity[$meh];
push #rows, [$quantity[$meh], $selections[$meh], $prices[$meh], $prices[$meh]*$quantity[$meh]];
}
# first pass over rows: compute the maximum width for each column
my #widths;
for my $row (#rows) {
for (my $col = 0; $col < #$row; $col++) {
$widths[$col] = length $row->[$col] if length $row->[$col] > ($widths[$col] // 0);
}
}
# compute the format. for this data, it works out to "%-3s %-11s %-6s %-5s\n"
my $format = join(' ', map { "%-${_}s" } #widths) . "\n";
# second pass: print each row using the format
for my $row (#rows) {
printf $format, #$row;
}
That yields this output:
Qty Desc. Unit $ Total
3 Hamburger 3.49 10.47
4 Large Coke 1.79 7.16
8 Onion Rings 1.19 9.52
Long time ago, Perl was mainly used for formatting files. It still has this capabilities although I haven't seen it used in a program since Perl 4.x came out.
Check out the perlform documentation, the format function, and the write function.
I'd give you an example on what the code would look like except I haven't done it in years. Otherwise, use the printf statement. You can limit the size of a text field with a %-10.10s type of format. This says to left justify the string, and pad it out to 10 characters, but not more than 10 characters.
I also suggest you get a book on modern Perl. One that will teach you about references.
I've rewritten your program to use references. Notice that all of the data is now in a single array instead of spread over four separate arrays that you hope you keep the index together.
I can talk about the ENTREE of $item[1] by saying $item[1]->{ENTREE}. It's easier to read and easier to maintain.
Also note that I've changed your for loop. In yours, you had to know that you had seven items. If you added a new item, you'd have to change your loop. In mine, I use $#menu to get the last index of my menu. I then use (0..$#menu) to automatically loop from 0 to the last item in the #menu array.
use strict;
use warnings;
use Data::Dumper;
my #menu = (
{ ENTREE => "Hamburger", PRICE => 3.49, QUANTITY => 3 },
{ ENTREE => "Frankfurter", PRICE => 2.19, QUANTITY => 0 },
{ ENTREE => "French Fries", PRICE => 1.69, QUANTITY => 0 },
{ ENTREE => "Large Coke", PRICE => 1.79, QUANTITY => 4 },
{ ENTREE => "Medium Coke", PRICE => 1.59, QUANTITY => 0 },
{ ENTREE => "Small Coke", PRICE => 1.39, QUANTITY => 0 },
{ ENTREE => "Onion Rings", PRICE => 1.19, QUANTITY => 8 },
);
printf "%-3.3s %-10.10s %-6.6s %s\n\n", 'Qty', 'Desc.', 'Unit $', 'Total';
# Use $#menu to get the number of items in the array instead of knowing it's 6
foreach my $item (0..$#menu) {
# Dereference $menu[$item] to make $menu_item a hash
# This makes the syntax easier to read.
my %menu_item = %{ $menu[$item] };
if ( $menu_item{QUANTITY} ) {
printf "%3d %-10.10s %9.2f %7.2f\n",
$menu_item{QUANTITY}, $menu_item{ENTREE}, $menu_item{PRICE},
$menu_item{QUANTITY} * $menu_item{PRICE};
}
}
OUTPUT:
Qty Desc. Unit $ Total
3 Hamburger 3.49 10.47
4 Large Coke 1.79 7.16
8 Onion Ring 1.19 9.52

Strategies to handle a file with multiple fixed formats

This question is not Perl-specific, (although the unpack function will most probably figure into my implementation).
I have to deal with files where multiple formats exist to hierarchically break down the data into meaningful sections. What I'd like to be able to do is parse the file data into a suitable data structure.
Here's an example (commentary on RHS):
# | Format | Level | Comment
# +--------+-------+---------
**DEVICE 109523.69142 # 1 1 file-specific
.981 561A # 2 1
10/MAY/2010 24.15.30,13.45.03 # 3 2 group of records
05:03:01 AB23X 15.67 101325.72 # 4 3 part of single record
* 14 31.30474 13 0 # 5 3 part of single record
05:03:15 CR22X 16.72 101325.42 # 4 3 new record
* 14 29.16264 11 0 # 5 3
06:23:51 AW41X 15.67 101323.9 # 4 3
* 14 31.26493219 0 # 5 3
11/MAY/2010 24.07.13,13.44.63 # 3 2 group of new records
15:57:14 AB23X 15.67 101327.23 # 4 3 part of single record
* 14 31.30474 13 0 # 5 3 part of single record
15:59:59 CR22X 16.72 101331.88 # 4 3 new record
* 14 29.16264 11 0 # 5
The logic I have at the moment is fragile:
I know, for instance, that a Format 2 always comes after a Format 1, and that they only span 2 lines.
I also know that Formats 4 and 5 always come in pairs as they correspond to a single record. The number of records may be variable
I'm using regular expressions to infer the format of each line. However, this is risky and does not lend to flexibility in the future (when someone decides to change the format of the output).
The big question here is about what strategies I can employ to determine which format needs to be used for which line. I'd be interested to know if others have faced similar situations and what they've done to address it.
Toying with an answer to your question, I arrived at an interesting solution with a concise main loop:
while (<>) {
given($_) {
when (#{[ map $pattern{$_}, #expect]}) {}
default {
die "$0: line $.: expected " . join("|" => #expect) . "; got\n$_";
}
}
}
As you'll see below, %pattern is a hash of named patterns for the different formats, and given/when against an array of Regex objects performs a short-circuiting search to find the first match.
From this, you can infer that #expect is a list of names of formats we expect to find on the current line.
For a while, I was stuck on the case of multiple possible expected formats and how to know format just matched, but then I remembered (?{ code }) in regular expressions:
This zero-width assertion evaluates any embedded Perl code. It always succeeds, and its code is not interpolated.
This allows something like a poor man's yacc grammar. For example, the pattern to match and process format 1 is
fmt1 => qr/^ \*\* DEVICE \s+ (\S+) \s*$
(?{ $device->{attr1} = $1;
#expect = qw< fmt2 >;
})
/x,
After processing the input from your question, $device contains
{
'attr1' => '109523.69142',
'attr2' => '.981',
'attr3' => '561A',
'groups' => [
{
'date' => '10/MAY/2010',
'nnn' => [ '24.15.30', '13.45.03' ],
'records' => [
[ '05:03:01', 'AB23X', '15.67', '101325.72', '14', '31.30474', '13', '0' ],
[ '05:03:15', 'CR22X', '16.72', '101325.42', '14', '29.16264', '11', '0' ],
[ '06:23:51', 'AW41X', '15.67', '101323.9', '14', '31.264932', '19', '0' ],
],
},
{
'date' => '11/MAY/2010',
'nnn' => [ '24.07.13', '13.44.63' ],
'records' => [
[ '15:57:14', 'AB23X', '15.67', '101327.23', '14', '31.30474', '13', '0' ],
[ '15:59:59', 'CR22X', '16.72', '101331.88', '14', '29.16264', '11', '0' ],
],
}
],
}
I'm amused with the result, but for some reason Larry's advice in perlstyle comes to mind:
Just because you CAN do something a particular way doesn't mean that you SHOULD do it that way.
For completeness, a working program demonstrating the result is below.
#! /usr/bin/perl
use warnings;
use strict;
use feature ':5.10';
use re 'eval';
*ARGV = *DATA;
my $device;
my $record;
my #expect = qw/ fmt1 /;
my %pattern;
%pattern = (
fmt1 => qr/^ \*\* DEVICE \s+ (\S+) \s*$
(?{ $device->{attr1} = $1;
#expect = qw< fmt2 >;
})
/x,
fmt2 => qr/^ \s* (\S+) \s+ (\S+) \s*$
(?{ #{$device}{qw< attr2 attr3 >} = ($1,$2);
#expect = qw< fmt3 >;
})
/x,
# e.g., 10/MAY/2010 24.15.30,13.45.03
fmt3 => qr/^ (\d\d\/[A-Z]{3}\/\d{4}) \s+ (\S+) \s*$
(?{ my($date,$nnns) = ($1,$2);
push #{ $device->{groups} } =>
{ nnn => [ split m|,| => $nnns ],
date => $date };
#expect = qw< fmt4 >;
})
/x,
# e.g., 05:03:01 AB23X 15.67 101325.72
fmt4 => qr/^ (\d\d:\d\d:\d\d) \s+
(\S+) \s+ (\S+) \s+ (\S+)
\s*$
(?{ push #{ $device->{groups}[-1]{records} } =>
[ $1, $2, $3, $4 ];
#expect = qw< fmt4 fmt5 >;
})
/x,
# e.g., * 14 31.30474 13 0
fmt5 => qr/^\* \s+ (\d+) \s+
# tricky: possibly no whitespace after 9-char float
((?=\d{1,7}\.\d+)[\d.]{1,9}) \s*
(\d+) \s+ (\d+)
\s*$
(?{ push #{ $device->{groups}[-1]{records}[-1] } =>
$1, $2, $3, $4;
#expect = qw< fmt4 fmt3 fmt2 >;
})
/x,
);
while (<>) {
given($_) {
when (#{[ map $pattern{$_}, #expect]}) {}
default {
die "$0: line $.: expected " . join("|" => #expect) . "; got\n$_";
}
}
}
use Data::Dumper;
$Data::Dumper::Terse = $Data::Dumper::Indent = 1;
print Dumper $device;
__DATA__
**DEVICE 109523.69142
.981 561A
10/MAY/2010 24.15.30,13.45.03
05:03:01 AB23X 15.67 101325.72
* 14 31.30474 13 0
05:03:15 CR22X 16.72 101325.42
* 14 29.16264 11 0
06:23:51 AW41X 15.67 101323.9
* 14 31.26493219 0
11/MAY/2010 24.07.13,13.44.63
15:57:14 AB23X 15.67 101327.23
* 14 31.30474 13 0
15:59:59 CR22X 16.72 101331.88
* 14 29.16264 11 0
This is a good question. Two suggestions occur to me.
(1) The first is simply to reiterate the idea from cjm: an object-based state machine. This is a flexible way to perform complex parsing. I've used its several times and have been happy with the results in most cases.
(2) The second idea falls under the category of a divide-and-conquer Unix-pipeline to pre-process the data.
First an observation about your data: if a set of formats always occurs as a pair, it effectively represent a single data format and can be combined without any loss of information. This means that you have only 3 formats: 1+2, 3, and 4+5.
And that thought leads to the strategy. Write a very simple script or two to pre-process your data -- effectively, a reformatting step to get the data into shape before the real parsing work begins. Here I show the scripts as separate tools. They could be combined, but the general philosophy might suggest that they remain distinct, narrowly defined tools.
In unbreak_records.pl.
Omitting the she-bang and use strict/warnings.
while (<>){
chomp;
print /^\*?\s/ ? ' ' : "\n", $_;
}
print "\n";
In add_record_types.pl
while (<>){
next unless /\S/;
my $rt = /^\*/ ? 1 :
/^..\// ? 2 : 3;
print $rt, ' ', $_;
}
At the command line.
./unbreak_records.pl orig.dat | ./add_record_types.pl > reformatted.dat
Output:
1 **DEVICE 109523.69142 .981 561A
2 10/MAY/2010 24.15.30,13.45.03
3 05:03:01 AB23X 15.67 101325.72 * 14 31.30474 13 0
3 05:03:15 CR22X 16.72 101325.42 * 14 29.16264 11 0
3 06:23:51 AW41X 15.67 101323.9 * 14 31.26493219 0
2 11/MAY/2010 24.07.13,13.44.63
3 15:57:14 AB23X 15.67 101327.23 * 14 31.30474 13 0
3 15:59:59 CR22X 16.72 101331.88 * 14 29.16264 11 0
The rest of the parsing is straightforward. If your data providers modify the format slightly, you simply need to write some different reformatting scripts.
Depending what you want to do with this, it might be a good place to actually write a formal grammar, using Parse::RecDescent, for instance. This will allow you to feed the entire file to the parser, and get a datastructure out of it.
This sounds like the sort of thing a state machine is good at. One way to do a state machine in Perl is as an object, where each state is a method. The object gives you a place to store the structure you're building, and any intermediate state you need (like the filehandle you're reading from).
my $state = 'expect_fmt1';
while (defined $state) {
$state = $object->$state();
}
...
sub expect_fmt1 {
my $self = shift;
# read format 1, parse it, store it in object
return 'expect_fmt2';
}
Some thoughts on handling the cases where you have to look at the line before deciding what to do with it:
If the file is small enough, you could slurp it into an arrayref in the object. That makes it easy for a state to examine a line without removing it.
If the file is too big for easy slurping, you can have a method for reading the next line along with a cache in your object that allows you to put it back:
my get_line {
my $self = shift;
my $cache = $self->{line_cache};
return shift #$cache if #$cache;
return $self->{filehandle}->getline;
}
my unget_line { my $self = shift; unshift #{ $self->{line_cache} }, #_ }
Or, you could split the states that involve this decision into two states. The first state reads the line, stores it in $self->{current_line}, decides what format it is, and returns the state that parses & stores that format (which gets the line to parse from $self->{current_line}).
I would keep an additional state in one or more variables and update it per row.
Then you e. g. know if the last line was level 1, or if the last row was format 4 (and you can expect format 5), thus giving more security to your processing.
What I used to do in this case--if possible--is have a unique regex for each line. If format #2 follows 1 line of format #1, then you can apply regex #2 right after 1. But for the line following the first #2, you want to try either #2 or #3.
You could also have an alternation which combines #2 and #3:
my ( $cap2_1, $cap2_2, $cap3_1, $cap3_2 ) = $line =~ /$regex2|regex3/;
If #4 immediate follows 3, you'll want to apply regex #4 after #3, and regex #5. After that, because it can be either #3 or #4, you might want to repeat either the multiple match or the alternation with #3/#4.
while ( <> ) {
given ( $state ) {
when ( 1 ) { my ( $device_num ) = m/$regex1/; $state++; }
when ( 2 ) { my ( $cap1, $cap2 ) = m/$regex2/; $state++; }
when ( 3 ) {
my ( $cap1, $cap2, $date, $nums ) = m/$regex2|$regex3/;
$state += $cap1 ? 1 : 2;
}
}
}
That kind of gives you the gist of what you might want to do. Or see FSA::Rules for a state managing module.