Perl Translation - perl

I want to do something like this
$string ='4791';
$string =~ tr/4791/(ma)(laya)(lam)(_baasha)/;
should give me
$string='malayalam_baasha';
i.e replace each character with n other characters. n may be different for each character.
Is there a one line solution for this translation ?

Assuming you always want to replace a single character with a specific string...
my %Replacement = (
'0' => 'abc',
'1' => 'def',
'2' => 'ghi',
'3' => 'jkl',
# ... whatever others you like ...
);
my $String = '0123';
print "$String\n"; # Prints "0123"
$String =~ s{(.)}
{exists($Replacement{$1}) ? $Replacement{$1} : $1}egx;
print "$String\n"; # Prints "abcdefghijkl"
Just make an entry in %Replacement for each character you want to swap out.
Re-reading your question, no, this isn't on one line, though it can be written as such (though messily) if you like. Constraining it to a single line will really kind of depend on how many different exchanges you want to have, though. After a certain point, it's going to get ugly.

The right answer is Brian Gerard's, but it can be done in one fairly short and almost readable line:
$string =~ s/(.)/{1 => "_baasha", 4 => "ma", 7 => "laya", 9 => "lam"}->{$1}/ge;
or one short unreadable line:
$string =~ s/(.)/{4,ma=>7,laya=>9,lam=>1,"_baasha"}->{$1}/ge;
or even shorter, but a bit more readable:
$string =~ s/(.)/qw(a _baasha a a ma a a laya a lam)[$1]/ge;
or the shortest I could get it (this one won't work with strict turned on):
$string =~ s/(.)/(a,_baasha,a,a,ma,a,a,laya,a,lam)[$1]/ge;

This
($i=0) || (#tr = qw |abc def ghi jkl| ) && (#string = map { $tr[$i++] } split //,'0123') && ($string =join '',#string);
OR
( %tr = ( 0 => 'abc' , 1 => 'def' , 2 => 'ghi' , 3 => 'jkl' ) ) && (#string = map { $tr{$_} } split //,'0123') && ($string =join '',#string);
should work! But I wouldn't use it!!

Related

how can I partition a line into code and comment using a single regex in perl?

I want to read through a text file and partition each line into the following three variables. Each variable must be defined, although it might be equal to the empty string.
$a1code: all characters up to and not including the first non-escaped percent sign. If there is no non-escaped percent sign, this is the entire line. As we see in the example below, this also could be the empty string in a line where the following two variables are non-empty.
$a2boundary: the first non-escaped percent sign, if there is one.
$a3cmnt: any characters after the first non-escaped percent sign, if there is one.
The script below accomplishes this but requires several lines of code, two hashes, and a composite regex, that is, 2 regex combined by |.
The composite seems necessary because the first clause,
(?<a1code>.*?)(?<a2boundary>(?<!\\)%)(?<a3cmnt>.*)
does not match a line that is pure code, no comment.
Is there a more elegant way, using a single regex and fewer steps?
In particular, is there a way to dispense with the %match hash and somehow
fill the %+ hash with all three three variables in a single step?
#!/usr/bin/env perl
use strict; use warnings;
print join('', 'perl ', $^V, "\n",);
use Data::Dumper qw(Dumper); $Data::Dumper::Sortkeys = 1;
my $count=0;
while(<DATA>)
{
$count++;
print "$count\t";
chomp;
my %match=(
a2boundary=>'',
a3cmnt=>'',
);
print "|$_|\n";
if($_=~/^(?<a1code>.*?)(?<a2boundary>(?<!\\)%)(?<a3cmnt>.*)|(?<a1code>.*)/)
{
print "from regex:\n";
print Dumper \%+;
%match=(%match,%+,);
}
else
{
die "no match? coding error, should never get here";
}
if(scalar keys %+ != scalar keys %match)
{
print "from multiple lines of code:\n";
print Dumper \%match;
}
print "------------------------------------------\n";
}
__DATA__
This is 100\% text and below you find an empty line.
abba 5\% %comment 9\% %Borgia
%all comment
%
Result:
perl v5.34.0
1 |This is 100\% text and below you find an empty line. |
from regex:
$VAR1 = {
'a1code' => 'This is 100\\% text and below you find an empty line. '
};
from multiple lines of code:
$VAR1 = {
'a1code' => 'This is 100\\% text and below you find an empty line. ',
'a2boundary' => '',
'a3cmnt' => ''
};
------------------------------------------
2 ||
from regex:
$VAR1 = {
'a1code' => ''
};
from multiple lines of code:
$VAR1 = {
'a1code' => '',
'a2boundary' => '',
'a3cmnt' => ''
};
------------------------------------------
3 |abba 5\% %comment 9\% %Borgia|
from regex:
$VAR1 = {
'a1code' => 'abba 5\\% ',
'a2boundary' => '%',
'a3cmnt' => 'comment 9\\% %Borgia'
};
------------------------------------------
4 |%all comment|
from regex:
$VAR1 = {
'a1code' => '',
'a2boundary' => '%',
'a3cmnt' => 'all comment'
};
------------------------------------------
5 |%|
from regex:
$VAR1 = {
'a1code' => '',
'a2boundary' => '%',
'a3cmnt' => ''
};
------------------------------------------
You can use the following:
my ($a1code, $a2boundary, $a3cmnt) =
/
^
( (?: [^\\%]+ | \\. )* )
(?: (%) (.*) )?
\z
/sx;
It does not consider % escaped in abc\\%def since the preceding \ is escaped.
It requires no backtracking, and it always matches.
$a1code is always a string. It can be zero characters long (when the input is an empty string and when % is the first character), or the entire input string (when there is no unescaped %).
However, $a2boundary and $a3cmnt are only defined if there's an unescaped %. In other words, $a2boundary is equivalent to defined($a3cmnt) ? '%' : undef.
Explanation: [^\\%]+ matches non-escaped characters other than \ and %. \\. matches escaped characters. So (?: [^\\%]+ | \\. )* gets us the prefix, or the entire string if there are no unescaped %.
What about cases like this\\%string where the backslash before the percent sign is itself escaped?
Consider something like this, which instead of trying to use a regular expression to split the string into three groups, uses one to look where for it should be split, and substr to do the actual splitting:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
sub splitter {
my $line = shift;
if ($line =~ /
# Match either
(?<!\\)% # A % not preceded by a backslash
| # or
(?<=[^\\])(?:\\\\)+\K% # Any even number of backslashes followed by a %
/x) {
return (substr($line, 0, $-[0]), '%', substr($line, $+[0]));
} else {
return ($line, '', '');
}
}
while (<DATA>) {
chomp;
# Assign to an array instead of individual scalars for demonstration purposes
my #vals = splitter $_;
print Dumper(\#vals);
}
__DATA__
This is 100\% text and below you find an empty line.
abba 5\% %comment 9\% %Borgia
%all comment
%
a tricky\\%test % case
another \\\%one % to mess with you

Regular Expression Matching Perl for first case of pattern

I have multiple variables that have strings in the following format:
some_text_here__what__i__want_here__andthen_someĀ 
I want to be able to assign to a variable the what__i__want_here portion of the first variable. In other words, everything after the FIRST double underscore. There may be double underscores in the rest of the string but I only want to take the text after the FIRST pair of underscores.
Ex.
If I have $var = "some_text_here__what__i__want_here__andthen_some", I would like to assign to a new variable only the second part like $var2 = "what__i__want_here__andthen_some"
I'm not very good at matching so I'm not quite sure how to do it so it just takes everything after the first double underscore.
my $text = 'some_text_here__what__i__want_here';
# .*? # Match a minimal number of characters - see "man perlre"
# /s # Make . match also newline - see "man perlre"
my ($var) = $text =~ /^.*?__(.*)$/s;
# $var is not defined when there is no __ in the string
print "var=${var}\n" if defined($var);
You might consider this an example of where split's third parameter is useful. The third parameter to split constrains how many elements to return. Here is an example:
my #examples = (
'some_text_here__what__i_want_here',
'__keep_this__part',
'nothing_found_here',
'nothing_after__',
);
foreach my $string (#examples) {
my $want = (split /__/, $string, 2)[1];
print "$string => ", (defined $want ? $want : ''), "\n";
}
The output will look like this:
some_text_here__what__i_want_here => what__i_want_here
__keep_this__part => keep_this__part
nothing_found_here =>
nothing_after__ =>
This line is a little dense:
my $want = (split /__/, $string, 2)[1];
Let's break that down:
my ($prefix, $want) = split /__/, $string, 2;
The 2 parameter tells split that no matter how many times the pattern /__/ could match, we only want to split one time, the first time it's found. So as another example:
my (#parts) = split /#/, "foo#bar#baz#buzz", 3;
The #parts array will receive these elements: 'foo', 'bar', 'baz#buzz', because we told it to stop splitting after the second split, so that we get a total maximum of three elements in our result.
Back to your case, we set 2 as the maximum number of elements. We then go one step further by eliminating the need for my ($throwaway, $want) = .... We can tell Perl we only care about the second element in the list of things returned by split, by providing an index.
my $want = ('a', 'b', 'c', 'd')[2]; # c, the element at offset 2 in the list.
my $want = (split /__/, $string, 2)[1]; # The element at offset 1 in the list
# of two elements returned by split.
You use brackets to capature then reorder the string, the first set of brackets () is $1 in the next part of the substitution, etc ...
my $string = "some_text_here__what__i__want_here";
(my $newstring = $string) =~ s/(some_text_here)(__)(what__i__want_here)/$3$2$1/;
print $newstring;
OUTPUT
what__i__want_here__some_text_here

How to write a Perl script to check x number of columns to print a value in another line for each returned value

I have data (IP addresses) in a CSV file that will be columns 9-13. If there aren't values in the other columns, then by default it should just print out what is in column 9. There is an output file that will print a set of values and the value of column 9 (and through 13 if a value exist) concatenated with a static value to create an alias value. My question is, how would you do this efficiently? I have this code that works:
my $alias0= "ComponentAliases=['ComputerSystem:$columns[9]'];\n";
my $alias1= "ComponentAliases=['ComputerSystem:$columns[9]','ComputerSystem:$columns[10]'];\n";
my $alias2= "ComponentAliases=['ComputerSystem:$columns[9]','ComputerSystem:$columns[10]','ComputerSystem:$columns[11]'];\n";
print BAROC "ComputerSystem;\n";
if(($columns[11] != '')&&($columns[10] != '')) { print BAROC $alias2 }
elsif(($columns[11] == '')&&($columns[10] != '')) { print BAROC $alias1 }
elsif(($columns[11] == '')&&($columns[10] == '')) { print BAROC $alias0 }
This works to do what I want it to do, but there is a chance the CSV file will have values in columns 9-13 or 9-11, etc. Easily I think statically writing this will be fine, but I would like to do it efficiently as well as understand and always apply best practices. I'm new to scripting Perl, but continually am drawn to it to solve problems at work. Suggestions?
This is the output, btw:
ComponentAliases=['ComputerSystem:10.1.0.225','ComputerSystem:10.200.252.77','ComputerSystem:10.100.252.77'];
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV_XS;
my $csv_in = Text::CSV_XS->new
or die Text::CSV_XS->error_diag;
my $csv_out = Text::CSV_XS->new({
always_quote => 1,
quote_char => q{'},
}) or die Text::CSV_XS->error_diag;
while (my $row = $csv_in->getline(\*DATA)) {
my #aliases = map "ComputerSystem:$_",
grep defined && length, #$row[9 .. 13];
if ($csv_out->combine(#aliases)) {
printf "ComponentAliases=[%s];\n", $csv_out->string;
}
}
__DATA__
0,1,2,3,4,5,6,7,8,10.1.0.225,10.200.252.77,,,,,,,
0,1,2,3,4,5,6,7,8,10.1.0.225,10.200.252.77,10.100.252.77,,,,,
Output:
C:\temp> gn
ComponentAliases=['ComputerSystem:10.1.0.225','ComputerSystem:10.200.252.77'];
ComponentAliases=['ComputerSystem:10.1.0.225','ComputerSystem:10.200.252.77','Co
mputerSystem:10.100.252.77'];
Efficiently now means maintainable. Trying to save a command or two isn't going to save you to much time. In fact, it might actually make the program more inefficient if the compiler can't figure out what you're doing.
What is important is readability. Get rid of the $alias stuff. It simply makes it harder to see what your code is doing, and you could end up with all sorts of side effects doing stuff like this.
The lack of white space also makes your code much more difficult to figure out too. Once I reformatted your code, I immediately spotted an error. You did this:
if ( ($columns[11] != '') && ($columns[10] != '') )
However, this is a string comparison. You need to do this:
if ( ( $columns[11] ne '' ) && ( $columns[10] ne '' ) ) {
Or, you could simplify it even further:
if ( not $column[10] and not $column[11] ) {
This makes it very clear what you're looking for, and will work whether the columns contain a numeric zero, a null string, or are undefined.
This code snippet is using your logic, but I take advantage of the fact that print doesn't automatically add a \n at the end of the string. I simply continue building upon the line:
if ( $columns[9] ) {
print BAROC "ComputerSystem;\n";
print BAROC "ComponentAliases=['ComputerSystem:$columns[9]'";
if ( $columns[10] ) {
print BAROC ",ComputerSystem:$columns[10]";
}
if ( $columns[11] ) {
print BAROC ",ComputerSystem:$columns[11]";
}
print BAROC "];\n";
}
You mentioned that you might need columns 9 to 13 if these columns had data in them. Why not use a loop?
if ( $#columns >= 9 ) { #There are at least nine columns
print BAROC "ComputerSystem;\n";
print BAROC "ComponentAliases=[ComputerSystem:$columns[9]";
for my $column ( (10..$#columns) ) {
last if not $column[$column];
print BAROC ",ComputerSystem:$columns[$columns];
}
print BAROC "];\n";
}
If given more time, I'm sure I could clean up the logic a bit more. But, this will work whether there are 9, 10, 11, or 43 columns with data.
One liner (not so elegant, but somehow I like it):
print "ComponentAliases=[".join(",",map {"'ComputerSystem:$_'"} grep {$_ ne ""} #columns[9-13])."]\n";
Or if you prefer the same code in a more understandable way:
print(
"ComponentAliases=[",
join(
",",
map(
"'ComputerSystem:$_'",
grep (
$_ ne "",
#columns[9-13]
)
)
),
"]\n"
);

Perl: Debug for uninitialized s///?

I'm some trouble finding the problem with my program. Getting the error:
Use of uninitialized value in substitution (s///)
I realize this has been asked before, but that didn't help me. I realize $1 might be unitialized, but I was wondering if you guys could help me figure out why?
Here's the problem part of the code:
$one_match_ref->{'sentence'} = $1 if ($line =~ /^Parsing \[sent. \d+ len. \d+\]: \[(.+)\]/);
$one_match_ref->{'sentence'} =~ s/, / /g;
EDIT: I have declared the $one_match_ref->{'sentence'} like so:
my $sentence;
$one_match_ref = {
chapternumber => $chapternumber_value,
sentencenumber => $sentencenumber_value,
sentence => $sentence, ##Get from parsed text: remove commas
grammar_relation => $grammar_relation_value, ##Get from parsed text: split?
arg1 => $argument1, ##Get from parsed text: first_dependencyword
arg2 => $argument2 ##Get from parsed text: second_dependencyword
};
But none of these variables have anything assigned to them.
My attempts:
A. If I put: if( defined (one_match_ref->{'sentence'})) after the s///, it works. But this is cumbersome, and seems to be avoiding the problem instead of fixing it.
The last time I used that fix, it was because my loop had an "off-by-one" error, I don't think this is the case this time.
B. If I declare: my $sentence = ''; It prints, but with a lot of blank lines in between. How can I eliminate these?
EDIT: For interest and efficiency purposes: Is it better to use split to get what I want?
Thanks in advance for any help or advice. Let me know if you need an example of the file format.
Your code boils down to
my $sentence;
$one_match_ref = { sentence => $sentence };
() if ($line =~ /^Parsing \[sent. \d+ len. \d+\]: \[(.+)\]/);
$one_match_ref->{'sentence'} =~ s/, / /g;
You assign undef to $one_match_ref->{'sentence'}, then you try to remove the commas from it. That doesn't make any sense, thus the warning.
Maybe you want
my $sentence;
$one_match_ref = { sentence => $sentence };
if ($line =~ /^Parsing \[sent. \d+ len. \d+\]: \[(.+)\]/) {
$one_match_ref->{'sentence'} = $1;
$one_match_ref->{'sentence'} =~ s/, / /g;
}
I'm not sure it's $1 that's uninitialised here but rather $one_match_ref->{'sentence'}.
That value is set if and only if the line matches the regex. Otherwise it's not touched at all.
My reasoning is that it's complaining during the substitute rather than the assignment. You could possibly fix it by simply setting $one_match_ref->{'sentence'} to a known value before those two lines (such as the empty string).
But this depends on what you're actually using those values for.

Strategies to handle a file with multiple fixed formats

This question is not Perl-specific, (although the unpack function will most probably figure into my implementation).
I have to deal with files where multiple formats exist to hierarchically break down the data into meaningful sections. What I'd like to be able to do is parse the file data into a suitable data structure.
Here's an example (commentary on RHS):
# | Format | Level | Comment
# +--------+-------+---------
**DEVICE 109523.69142 # 1 1 file-specific
.981 561A # 2 1
10/MAY/2010 24.15.30,13.45.03 # 3 2 group of records
05:03:01 AB23X 15.67 101325.72 # 4 3 part of single record
* 14 31.30474 13 0 # 5 3 part of single record
05:03:15 CR22X 16.72 101325.42 # 4 3 new record
* 14 29.16264 11 0 # 5 3
06:23:51 AW41X 15.67 101323.9 # 4 3
* 14 31.26493219 0 # 5 3
11/MAY/2010 24.07.13,13.44.63 # 3 2 group of new records
15:57:14 AB23X 15.67 101327.23 # 4 3 part of single record
* 14 31.30474 13 0 # 5 3 part of single record
15:59:59 CR22X 16.72 101331.88 # 4 3 new record
* 14 29.16264 11 0 # 5
The logic I have at the moment is fragile:
I know, for instance, that a Format 2 always comes after a Format 1, and that they only span 2 lines.
I also know that Formats 4 and 5 always come in pairs as they correspond to a single record. The number of records may be variable
I'm using regular expressions to infer the format of each line. However, this is risky and does not lend to flexibility in the future (when someone decides to change the format of the output).
The big question here is about what strategies I can employ to determine which format needs to be used for which line. I'd be interested to know if others have faced similar situations and what they've done to address it.
Toying with an answer to your question, I arrived at an interesting solution with a concise main loop:
while (<>) {
given($_) {
when (#{[ map $pattern{$_}, #expect]}) {}
default {
die "$0: line $.: expected " . join("|" => #expect) . "; got\n$_";
}
}
}
As you'll see below, %pattern is a hash of named patterns for the different formats, and given/when against an array of Regex objects performs a short-circuiting search to find the first match.
From this, you can infer that #expect is a list of names of formats we expect to find on the current line.
For a while, I was stuck on the case of multiple possible expected formats and how to know format just matched, but then I remembered (?{ code }) in regular expressions:
This zero-width assertion evaluates any embedded Perl code. It always succeeds, and its code is not interpolated.
This allows something like a poor man's yacc grammar. For example, the pattern to match and process format 1 is
fmt1 => qr/^ \*\* DEVICE \s+ (\S+) \s*$
(?{ $device->{attr1} = $1;
#expect = qw< fmt2 >;
})
/x,
After processing the input from your question, $device contains
{
'attr1' => '109523.69142',
'attr2' => '.981',
'attr3' => '561A',
'groups' => [
{
'date' => '10/MAY/2010',
'nnn' => [ '24.15.30', '13.45.03' ],
'records' => [
[ '05:03:01', 'AB23X', '15.67', '101325.72', '14', '31.30474', '13', '0' ],
[ '05:03:15', 'CR22X', '16.72', '101325.42', '14', '29.16264', '11', '0' ],
[ '06:23:51', 'AW41X', '15.67', '101323.9', '14', '31.264932', '19', '0' ],
],
},
{
'date' => '11/MAY/2010',
'nnn' => [ '24.07.13', '13.44.63' ],
'records' => [
[ '15:57:14', 'AB23X', '15.67', '101327.23', '14', '31.30474', '13', '0' ],
[ '15:59:59', 'CR22X', '16.72', '101331.88', '14', '29.16264', '11', '0' ],
],
}
],
}
I'm amused with the result, but for some reason Larry's advice in perlstyle comes to mind:
Just because you CAN do something a particular way doesn't mean that you SHOULD do it that way.
For completeness, a working program demonstrating the result is below.
#! /usr/bin/perl
use warnings;
use strict;
use feature ':5.10';
use re 'eval';
*ARGV = *DATA;
my $device;
my $record;
my #expect = qw/ fmt1 /;
my %pattern;
%pattern = (
fmt1 => qr/^ \*\* DEVICE \s+ (\S+) \s*$
(?{ $device->{attr1} = $1;
#expect = qw< fmt2 >;
})
/x,
fmt2 => qr/^ \s* (\S+) \s+ (\S+) \s*$
(?{ #{$device}{qw< attr2 attr3 >} = ($1,$2);
#expect = qw< fmt3 >;
})
/x,
# e.g., 10/MAY/2010 24.15.30,13.45.03
fmt3 => qr/^ (\d\d\/[A-Z]{3}\/\d{4}) \s+ (\S+) \s*$
(?{ my($date,$nnns) = ($1,$2);
push #{ $device->{groups} } =>
{ nnn => [ split m|,| => $nnns ],
date => $date };
#expect = qw< fmt4 >;
})
/x,
# e.g., 05:03:01 AB23X 15.67 101325.72
fmt4 => qr/^ (\d\d:\d\d:\d\d) \s+
(\S+) \s+ (\S+) \s+ (\S+)
\s*$
(?{ push #{ $device->{groups}[-1]{records} } =>
[ $1, $2, $3, $4 ];
#expect = qw< fmt4 fmt5 >;
})
/x,
# e.g., * 14 31.30474 13 0
fmt5 => qr/^\* \s+ (\d+) \s+
# tricky: possibly no whitespace after 9-char float
((?=\d{1,7}\.\d+)[\d.]{1,9}) \s*
(\d+) \s+ (\d+)
\s*$
(?{ push #{ $device->{groups}[-1]{records}[-1] } =>
$1, $2, $3, $4;
#expect = qw< fmt4 fmt3 fmt2 >;
})
/x,
);
while (<>) {
given($_) {
when (#{[ map $pattern{$_}, #expect]}) {}
default {
die "$0: line $.: expected " . join("|" => #expect) . "; got\n$_";
}
}
}
use Data::Dumper;
$Data::Dumper::Terse = $Data::Dumper::Indent = 1;
print Dumper $device;
__DATA__
**DEVICE 109523.69142
.981 561A
10/MAY/2010 24.15.30,13.45.03
05:03:01 AB23X 15.67 101325.72
* 14 31.30474 13 0
05:03:15 CR22X 16.72 101325.42
* 14 29.16264 11 0
06:23:51 AW41X 15.67 101323.9
* 14 31.26493219 0
11/MAY/2010 24.07.13,13.44.63
15:57:14 AB23X 15.67 101327.23
* 14 31.30474 13 0
15:59:59 CR22X 16.72 101331.88
* 14 29.16264 11 0
This is a good question. Two suggestions occur to me.
(1) The first is simply to reiterate the idea from cjm: an object-based state machine. This is a flexible way to perform complex parsing. I've used its several times and have been happy with the results in most cases.
(2) The second idea falls under the category of a divide-and-conquer Unix-pipeline to pre-process the data.
First an observation about your data: if a set of formats always occurs as a pair, it effectively represent a single data format and can be combined without any loss of information. This means that you have only 3 formats: 1+2, 3, and 4+5.
And that thought leads to the strategy. Write a very simple script or two to pre-process your data -- effectively, a reformatting step to get the data into shape before the real parsing work begins. Here I show the scripts as separate tools. They could be combined, but the general philosophy might suggest that they remain distinct, narrowly defined tools.
In unbreak_records.pl.
Omitting the she-bang and use strict/warnings.
while (<>){
chomp;
print /^\*?\s/ ? ' ' : "\n", $_;
}
print "\n";
In add_record_types.pl
while (<>){
next unless /\S/;
my $rt = /^\*/ ? 1 :
/^..\// ? 2 : 3;
print $rt, ' ', $_;
}
At the command line.
./unbreak_records.pl orig.dat | ./add_record_types.pl > reformatted.dat
Output:
1 **DEVICE 109523.69142 .981 561A
2 10/MAY/2010 24.15.30,13.45.03
3 05:03:01 AB23X 15.67 101325.72 * 14 31.30474 13 0
3 05:03:15 CR22X 16.72 101325.42 * 14 29.16264 11 0
3 06:23:51 AW41X 15.67 101323.9 * 14 31.26493219 0
2 11/MAY/2010 24.07.13,13.44.63
3 15:57:14 AB23X 15.67 101327.23 * 14 31.30474 13 0
3 15:59:59 CR22X 16.72 101331.88 * 14 29.16264 11 0
The rest of the parsing is straightforward. If your data providers modify the format slightly, you simply need to write some different reformatting scripts.
Depending what you want to do with this, it might be a good place to actually write a formal grammar, using Parse::RecDescent, for instance. This will allow you to feed the entire file to the parser, and get a datastructure out of it.
This sounds like the sort of thing a state machine is good at. One way to do a state machine in Perl is as an object, where each state is a method. The object gives you a place to store the structure you're building, and any intermediate state you need (like the filehandle you're reading from).
my $state = 'expect_fmt1';
while (defined $state) {
$state = $object->$state();
}
...
sub expect_fmt1 {
my $self = shift;
# read format 1, parse it, store it in object
return 'expect_fmt2';
}
Some thoughts on handling the cases where you have to look at the line before deciding what to do with it:
If the file is small enough, you could slurp it into an arrayref in the object. That makes it easy for a state to examine a line without removing it.
If the file is too big for easy slurping, you can have a method for reading the next line along with a cache in your object that allows you to put it back:
my get_line {
my $self = shift;
my $cache = $self->{line_cache};
return shift #$cache if #$cache;
return $self->{filehandle}->getline;
}
my unget_line { my $self = shift; unshift #{ $self->{line_cache} }, #_ }
Or, you could split the states that involve this decision into two states. The first state reads the line, stores it in $self->{current_line}, decides what format it is, and returns the state that parses & stores that format (which gets the line to parse from $self->{current_line}).
I would keep an additional state in one or more variables and update it per row.
Then you e. g. know if the last line was level 1, or if the last row was format 4 (and you can expect format 5), thus giving more security to your processing.
What I used to do in this case--if possible--is have a unique regex for each line. If format #2 follows 1 line of format #1, then you can apply regex #2 right after 1. But for the line following the first #2, you want to try either #2 or #3.
You could also have an alternation which combines #2 and #3:
my ( $cap2_1, $cap2_2, $cap3_1, $cap3_2 ) = $line =~ /$regex2|regex3/;
If #4 immediate follows 3, you'll want to apply regex #4 after #3, and regex #5. After that, because it can be either #3 or #4, you might want to repeat either the multiple match or the alternation with #3/#4.
while ( <> ) {
given ( $state ) {
when ( 1 ) { my ( $device_num ) = m/$regex1/; $state++; }
when ( 2 ) { my ( $cap1, $cap2 ) = m/$regex2/; $state++; }
when ( 3 ) {
my ( $cap1, $cap2, $date, $nums ) = m/$regex2|$regex3/;
$state += $cap1 ? 1 : 2;
}
}
}
That kind of gives you the gist of what you might want to do. Or see FSA::Rules for a state managing module.