Strategies to handle a file with multiple fixed formats

Strategies to handle a file with multiple fixed formats - perl

This question is not Perl-specific, (although the unpack function will most probably figure into my implementation).
I have to deal with files where multiple formats exist to hierarchically break down the data into meaningful sections. What I'd like to be able to do is parse the file data into a suitable data structure.
Here's an example (commentary on RHS):
# | Format | Level | Comment
# +--------+-------+---------
**DEVICE 109523.69142 # 1 1 file-specific
.981 561A # 2 1
10/MAY/2010 24.15.30,13.45.03 # 3 2 group of records
05:03:01 AB23X 15.67 101325.72 # 4 3 part of single record
* 14 31.30474 13 0 # 5 3 part of single record
05:03:15 CR22X 16.72 101325.42 # 4 3 new record
* 14 29.16264 11 0 # 5 3
06:23:51 AW41X 15.67 101323.9 # 4 3
* 14 31.26493219 0 # 5 3
11/MAY/2010 24.07.13,13.44.63 # 3 2 group of new records
15:57:14 AB23X 15.67 101327.23 # 4 3 part of single record
* 14 31.30474 13 0 # 5 3 part of single record
15:59:59 CR22X 16.72 101331.88 # 4 3 new record
* 14 29.16264 11 0 # 5
The logic I have at the moment is fragile:
I know, for instance, that a Format 2 always comes after a Format 1, and that they only span 2 lines.
I also know that Formats 4 and 5 always come in pairs as they correspond to a single record. The number of records may be variable
I'm using regular expressions to infer the format of each line. However, this is risky and does not lend to flexibility in the future (when someone decides to change the format of the output).
The big question here is about what strategies I can employ to determine which format needs to be used for which line. I'd be interested to know if others have faced similar situations and what they've done to address it.

Toying with an answer to your question, I arrived at an interesting solution with a concise main loop:
while (<>) {
given($_) {
when (#{[ map $pattern{$_}, #expect]}) {}
default {
die "$0: line $.: expected " . join("|" => #expect) . "; got\n$_";
}
}
}
As you'll see below, %pattern is a hash of named patterns for the different formats, and given/when against an array of Regex objects performs a short-circuiting search to find the first match.
From this, you can infer that #expect is a list of names of formats we expect to find on the current line.
For a while, I was stuck on the case of multiple possible expected formats and how to know format just matched, but then I remembered (?{ code }) in regular expressions:
This zero-width assertion evaluates any embedded Perl code. It always succeeds, and its code is not interpolated.
This allows something like a poor man's yacc grammar. For example, the pattern to match and process format 1 is
fmt1 => qr/^ \*\* DEVICE \s+ (\S+) \s*$
(?{ $device->{attr1} = $1;
#expect = qw< fmt2 >;
})
/x,
After processing the input from your question, $device contains
{
'attr1' => '109523.69142',
'attr2' => '.981',
'attr3' => '561A',
'groups' => [
{
'date' => '10/MAY/2010',
'nnn' => [ '24.15.30', '13.45.03' ],
'records' => [
[ '05:03:01', 'AB23X', '15.67', '101325.72', '14', '31.30474', '13', '0' ],
[ '05:03:15', 'CR22X', '16.72', '101325.42', '14', '29.16264', '11', '0' ],
[ '06:23:51', 'AW41X', '15.67', '101323.9', '14', '31.264932', '19', '0' ],
],
},
{
'date' => '11/MAY/2010',
'nnn' => [ '24.07.13', '13.44.63' ],
'records' => [
[ '15:57:14', 'AB23X', '15.67', '101327.23', '14', '31.30474', '13', '0' ],
[ '15:59:59', 'CR22X', '16.72', '101331.88', '14', '29.16264', '11', '0' ],
],
}
],
}
I'm amused with the result, but for some reason Larry's advice in perlstyle comes to mind:
Just because you CAN do something a particular way doesn't mean that you SHOULD do it that way.
For completeness, a working program demonstrating the result is below.
#! /usr/bin/perl
use warnings;
use strict;
use feature ':5.10';
use re 'eval';
*ARGV = *DATA;
my $device;
my $record;
my #expect = qw/ fmt1 /;
my %pattern;
%pattern = (
fmt1 => qr/^ \*\* DEVICE \s+ (\S+) \s*$
(?{ $device->{attr1} = $1;
#expect = qw< fmt2 >;
})
/x,
fmt2 => qr/^ \s* (\S+) \s+ (\S+) \s*$
(?{ #{$device}{qw< attr2 attr3 >} = ($1,$2);
#expect = qw< fmt3 >;
})
/x,
# e.g., 10/MAY/2010 24.15.30,13.45.03
fmt3 => qr/^ (\d\d\/[A-Z]{3}\/\d{4}) \s+ (\S+) \s*$
(?{ my($date,$nnns) = ($1,$2);
push #{ $device->{groups} } =>
{ nnn => [ split m|,| => $nnns ],
date => $date };
#expect = qw< fmt4 >;
})
/x,
# e.g., 05:03:01 AB23X 15.67 101325.72
fmt4 => qr/^ (\d\d:\d\d:\d\d) \s+
(\S+) \s+ (\S+) \s+ (\S+)
\s*$
(?{ push #{ $device->{groups}[-1]{records} } =>
[ $1, $2, $3, $4 ];
#expect = qw< fmt4 fmt5 >;
})
/x,
# e.g., * 14 31.30474 13 0
fmt5 => qr/^\* \s+ (\d+) \s+
# tricky: possibly no whitespace after 9-char float
((?=\d{1,7}\.\d+)[\d.]{1,9}) \s*
(\d+) \s+ (\d+)
\s*$
(?{ push #{ $device->{groups}[-1]{records}[-1] } =>
$1, $2, $3, $4;
#expect = qw< fmt4 fmt3 fmt2 >;
})
/x,
);
while (<>) {
given($_) {
when (#{[ map $pattern{$_}, #expect]}) {}
default {
die "$0: line $.: expected " . join("|" => #expect) . "; got\n$_";
}
}
}
use Data::Dumper;
$Data::Dumper::Terse = $Data::Dumper::Indent = 1;
print Dumper $device;
__DATA__
**DEVICE 109523.69142
.981 561A
10/MAY/2010 24.15.30,13.45.03
05:03:01 AB23X 15.67 101325.72
* 14 31.30474 13 0
05:03:15 CR22X 16.72 101325.42
* 14 29.16264 11 0
06:23:51 AW41X 15.67 101323.9
* 14 31.26493219 0
11/MAY/2010 24.07.13,13.44.63
15:57:14 AB23X 15.67 101327.23
* 14 31.30474 13 0
15:59:59 CR22X 16.72 101331.88
* 14 29.16264 11 0

This is a good question. Two suggestions occur to me.
(1) The first is simply to reiterate the idea from cjm: an object-based state machine. This is a flexible way to perform complex parsing. I've used its several times and have been happy with the results in most cases.
(2) The second idea falls under the category of a divide-and-conquer Unix-pipeline to pre-process the data.
First an observation about your data: if a set of formats always occurs as a pair, it effectively represent a single data format and can be combined without any loss of information. This means that you have only 3 formats: 1+2, 3, and 4+5.
And that thought leads to the strategy. Write a very simple script or two to pre-process your data -- effectively, a reformatting step to get the data into shape before the real parsing work begins. Here I show the scripts as separate tools. They could be combined, but the general philosophy might suggest that they remain distinct, narrowly defined tools.
In unbreak_records.pl.
Omitting the she-bang and use strict/warnings.
while (<>){
chomp;
print /^\*?\s/ ? ' ' : "\n", $_;
}
print "\n";
In add_record_types.pl
while (<>){
next unless /\S/;
my $rt = /^\*/ ? 1 :
/^..\// ? 2 : 3;
print $rt, ' ', $_;
}
At the command line.
./unbreak_records.pl orig.dat | ./add_record_types.pl > reformatted.dat
Output:
1 **DEVICE 109523.69142 .981 561A
2 10/MAY/2010 24.15.30,13.45.03
3 05:03:01 AB23X 15.67 101325.72 * 14 31.30474 13 0
3 05:03:15 CR22X 16.72 101325.42 * 14 29.16264 11 0
3 06:23:51 AW41X 15.67 101323.9 * 14 31.26493219 0
2 11/MAY/2010 24.07.13,13.44.63
3 15:57:14 AB23X 15.67 101327.23 * 14 31.30474 13 0
3 15:59:59 CR22X 16.72 101331.88 * 14 29.16264 11 0
The rest of the parsing is straightforward. If your data providers modify the format slightly, you simply need to write some different reformatting scripts.

Depending what you want to do with this, it might be a good place to actually write a formal grammar, using Parse::RecDescent, for instance. This will allow you to feed the entire file to the parser, and get a datastructure out of it.

This sounds like the sort of thing a state machine is good at. One way to do a state machine in Perl is as an object, where each state is a method. The object gives you a place to store the structure you're building, and any intermediate state you need (like the filehandle you're reading from).
my $state = 'expect_fmt1';
while (defined $state) {
$state = $object->$state();
}
...
sub expect_fmt1 {
my $self = shift;
# read format 1, parse it, store it in object
return 'expect_fmt2';
}
Some thoughts on handling the cases where you have to look at the line before deciding what to do with it:
If the file is small enough, you could slurp it into an arrayref in the object. That makes it easy for a state to examine a line without removing it.
If the file is too big for easy slurping, you can have a method for reading the next line along with a cache in your object that allows you to put it back:
my get_line {
my $self = shift;
my $cache = $self->{line_cache};
return shift #$cache if #$cache;
return $self->{filehandle}->getline;
}
my unget_line { my $self = shift; unshift #{ $self->{line_cache} }, #_ }
Or, you could split the states that involve this decision into two states. The first state reads the line, stores it in $self->{current_line}, decides what format it is, and returns the state that parses & stores that format (which gets the line to parse from $self->{current_line}).

I would keep an additional state in one or more variables and update it per row.
Then you e. g. know if the last line was level 1, or if the last row was format 4 (and you can expect format 5), thus giving more security to your processing.

What I used to do in this case--if possible--is have a unique regex for each line. If format #2 follows 1 line of format #1, then you can apply regex #2 right after 1. But for the line following the first #2, you want to try either #2 or #3.
You could also have an alternation which combines #2 and #3:
my ( $cap2_1, $cap2_2, $cap3_1, $cap3_2 ) = $line =~ /$regex2|regex3/;
If #4 immediate follows 3, you'll want to apply regex #4 after #3, and regex #5. After that, because it can be either #3 or #4, you might want to repeat either the multiple match or the alternation with #3/#4.
while ( <> ) {
given ( $state ) {
when ( 1 ) { my ( $device_num ) = m/$regex1/; $state++; }
when ( 2 ) { my ( $cap1, $cap2 ) = m/$regex2/; $state++; }
when ( 3 ) {
my ( $cap1, $cap2, $date, $nums ) = m/$regex2|$regex3/;
$state += $cap1 ? 1 : 2;
}
}
}
That kind of gives you the gist of what you might want to do. Or see FSA::Rules for a state managing module.

Related

Compare 3 tab delimited files and print matches in Perl

I want to match column 1 of file 1 with column 1 of file 2 and then column 2 of file 1 with column 1 of file 3 and then print the matches. The columns in the files are separated by tabs. For example:
file 1:
fji01dde AIDJFMGKG
dlp02sle VMCFIJGM
cmr03lsp CKEIFJ
file 2:
fji01dde 25 30
dlp02sle 40 50
cmr03lsp 60 70
file 3:
AIDJFMGKG
CKEIFJ
output needs to be:
fji01dde AIDJFMGKG 25 30
cmr03lsp CKEIFJ 60 70
I only want lines that are common in all three files.
The below code works well for the first two files, but I need to incorporate the third file. Any ideas?
#!/usr/bin/env perl
use strict;
my (%file1,%file2);
## Open the 1st file
open(A,"file1");
while(<A>){
chomp;
## Split the current line on tabs into the #F array.
my #F=split(/\t/);
push #{$file1{$F[0]}},#F[1..$#F];
}
## Open the 2nd file
open(B,"file2");
while(<B>){
chomp;
## Split the current line on tabs into the #F array.
my #F=split(/\t/);
if (defined($file1{$F[0]})) {
foreach my $col (#{$file1{$F[0]}}) {
print "$F[0]\t$col\t#F[1..$#F]\n";
}
}
}

The algorithm seems to be...
for each line in 1
if 1.1 and 2.1 match AND
1.2 appears in 3.1
then
combine 1.1, 1.2, 2.2 and 2.3
Because there's plenty of edge cases in parsing CSV files, don't do it by hand. Use Text::CSV_XS. It can also handle turning CSV files into hashes for us and it's super efficient.
What we'll do is parse all the files. The first file is left as a list, but the other two are put into hashes keyed on the columns that we're going to search on.
NOTE: The names $data are horrible, but I don't know what sort of data these files represent.
use strict;
use warnings;
use Text::CSV_XS qw(csv);
my #csv_files = #ARGV;
# Parse all the CSV files into arrays of arrays.
my $data1 = csv( in => $csv_files[0], sep_char => "\t" );
# Parse the other CSV files into hashes of rows keyed on the columns we're going to search on.
my $data2 = csv( in => $csv_files[1],
sep_char => "\t",
headers => ["code", "num1", "num2"],
key => "code"
);
my $data3 = csv( in => $csv_files[2],
sep_char => "\t",
headers => ["CODE"],
key => "CODE"
);
for my $row1 (#$data1) {
my $row2 = $data2->{$row1->[0]};
my $row3 = $data3->{$row1->[1]};
if( $row2 && $row3 ) {
print join "\t", $row1->[0], $row1->[1], $row2->{num1}, $row2->{num2};
print "\n";
}
}
This reads all the files into memory. If the files are very large this can be a problem. You can reduce memory usage by iterating through file1 one row at a time instead of slurping it all in.

Math::BaseConvert with strange basenumbers like 23 or 1000

I just found the perl module Math::BaseConvert. I have to the task to convert numbers to very strange number with a different base. Not only base 2, 8, 16, but also 23, 134 up to 1000. (This is a partial task to balance a tree of files in a directory)
I could not make it. Reading the tests for the module in CPAN also confused me. So I wrote a little test, maybe you can tell me what's wrong, the result is:
ok 1 - use Math::BaseConvert;
ok 2 - Convert number '23' (base10) to '27' (base8)
not ok 3 - Convert number '23' (base10) to '23' (base32)
# Failed test 'Convert number '23' (base10) to '23' (base32)'
# at test_math_baseconvert.pl line 35.
# got: 'N'
# expected: '23'
not ok 4 - Convert number '64712' (base10) to '64:712' (base1000)
# Failed test 'Convert number '64712' (base10) to '64:712' (base1000)'
# at test_math_baseconvert.pl line 35.
# got: '-1'
# expected: '64:712'
1..4
# Looks like you failed 2 tests of 4.
The testprogram is this:
#!/usr/bin/perl
use strict;
use warnings;
use Test::More;
use_ok( 'Math::BaseConvert', '1.7' );
my #lines = (
{
# http://www.wolframalpha.com/input/?i=23+from+base+10+to+base+16
old_number => '23',
old_base => 10,
new_number => '27',
new_base => 8,
},
{
# http://www.wolframalpha.com/input/?i=23+from+base+10+to+base+32
old_number => '23',
old_base => 10,
new_number => '23', # stays same
new_base => 32,
},
{
# http://www.wolframalpha.com/input/?i=64712+from+base+10+to+base+1000
old_number => '64712',
old_base => 10,
new_number => '64:712',
new_base => 1000,
},
);
for my $line (#lines) {
cmp_ok(
Math::BaseConvert::cnv(
$line->{old_number}, $line->{old_base}, $line->{new_base}
),
'eq',
$line->{new_number},
sprintf(
"Convert number '%s' (base%d) to '%s' (base%d)",
$line->{old_number}, $line->{old_base},
$line->{new_number}, $line->{new_base}
)
);
}
done_testing();

Wolfram Alpha's method of showing bases greater than base 16 is to separate digits with a colon. There's nothing wrong with that, as they're displaying those numbers using css styling that lessons the shading of the colon to make it more obvious what they're doing. But they also add a message stating exactly how many digits they're showing since "1:1617 (2 digits)" isn't obvious enough.
The method Math::BaseConvert and other such modules use is to expand the character set for digits just like is done with hex numbers 0-9A-F to include the first 6 letters in the alphabet. For the case of base 32 numbers the character set is 0-9A-V. Given N is the 14th letter in the alphabet, it is the appropriate representation for 23 in base 32.
If you want to use the colon representation for numbers greater than 16, you can either use the module or just roll your own solution.
#!/usr/bin/perl
use strict;
use warnings;
use Test::More;
use_ok( 'Math::BaseConvert', '1.7' );
my #lines = (
# Old_Number Old_Base New_Number New_Base
[qw(23 10 27 8)], # http://www.wolframalpha.com/input/?i=23+from+base+10+to+base+16
[qw(23 10 23 32)], # http://www.wolframalpha.com/input/?i=23+from+base+10+to+base+32
[qw(64712 10 64:712 1000)], # http://www.wolframalpha.com/input/?i=64712+from+base+10+to+base+1000
);
for my $line (#lines) {
cmp_ok(
base10toN(#$line[0,3]),
'eq',
$line->[2],
sprintf("Convert number '%s' (base%d) to '%s' (base%d)", $line->[0], 10, #$line[2,3])
);
}
sub base10toN {
my ($num, $base) = #_;
return Math::BaseConvert::cnv($num, 10, $base)
if $base <= 16;
my #digits = ();
while (1) {
my $remainder = $num % $base;
unshift #digits, $remainder;
$num = ($num - $remainder) / $base
or last;
}
return join ':', #digits;
}
done_testing();

You seem to be expecting decimal output, with "digits" being decimal numbers separated by :.
Math::BaseConvert doesn't do that. It only supports having a single character per digit.
By default, the digits used are '0'..'9', 'A'..'Z', 'a'..'z', '.', '_' though you can supply your own list instead (and you would have to do so to support up to base 1000).

Perl grep not returning expected value

I have the following code:
#!/usr/bin/perl
# splits.pl
use strict;
use warnings;
use diagnostics;
my $pivotfile = "myPath/Internal_Splits_Pivot.txt";
open PIVOTFILE, $pivotfile or die $!;
while (<PIVOTFILE>) { # loop through each line in file
next if ($. == 1); # skip first line (contains business segment code)
next if ($. == 2); # skip second line (contains transaction amount text)
my #fields = split('\t',$_); # split fields for line into an array
print scalar(grep $_, #fields), "\n";
}
Given that the data in the text file is this:
4 G I M N U X
Transaction Amount Transaction Amount Transaction Amount Transaction Amount Transaction Amount Transaction Amount Transaction Amount
0000-13-I21 600
0001-8V-034BLA 2,172 2,172
0001-8V-191GYG 13,125 4,375
0001-9W-GH5B2A -2,967.09 2,967.09 25.00
I would expect the output from the perl script to be: 2 3 3 4 given the amount of defined elements in each line. The file is a tab delimited text file with 8 columns.
Instead I get 3 4 3 4 and I have no idea why!
For background, I am using Counting array elements in Perl as the basis for my development, as I am trying to count the number of elements in the line to know if I need to skip that line or not.

I suspect you have spaces mixed with the tabs in some places, and your grep test will consider " " true.
What does:
use Data::Dumper;
$Data::Dumper::Useqq=1;
print Dumper [<PIVOTFILE>];
show?

The problem should be in this line:
my #fields = split('\t',$_); # split fields for line into an array
The tab character doesn't get interpolated. And your file doesn't seem to be tab-only separated, at least here on SO. I changed the split regex to match arbitrary whitespace, ran the code on my machine and got the "right" result:
my #fields = split(/\s+/,$_); # split fields for line into an array
Result:
2
3
3
4

As a side note:
For background, I am using Counting array elements in Perl as the basis for my development, as I am trying to count the number of elements in the line to know if I need to skip that line or not.
Now I understand why you use grep to count array elements. That's important when your array contains undefined values like here:
my #a;
$a[1] = 42; # #a contains the list (undef, 42)
say scalar #a; # 2
or when you manually deleted entries:
my #a = split /,/ => 'foo,bar'; # #a contains the list ('foo', 'bar')
delete $a[0]; # #a contains the list (undef, 'bar')
say scalar #a; # 2
But in many cases, especially when you're using arrays to just store list without operating on single array elements, scalar #a works perfectly fine.
my #a = (1 .. 17, 1 .. 25); # (1, 2, ..., 17, 1, 2, .., 25)
say scalar #a; # 42
It's important to understand, what grep does! In your case
print scalar(grep $_, #fields), "\n";
grep returns the list of true values of #fields and then you print how many you have. But sometimes this isn't what you want/expect:
my #things = (17, 42, 'foo', '', 0); # even '' and 0 are things
say scalar grep $_ => #things # 3!
Because the empty string and the number 0 are false values in Perl, they won't get counted with that idiom. So if you want to know how long an array is, just use
say scalar #array; # number of array entries
If you want to count true values, use this
say scalar grep $_ => #array; # number of true values
But if you want to count defined values, use this
say scalar grep defined($_) => #array; # number of defined values
I'm pretty sure you already know this from the other answers on the linked page. In hashes, the situation is a little bit more complex because setting something to undef is not the same as deleteing it:
my %h = (a => 0, b => 42, c => 17, d => 666);
$h{c} = undef; # still there, but undefined
delete $h{d}; # BAM! $h{d} is gone!
What happens when we try to count values?
say scalar grep $_ => values %h; # 1
because 42 is the only true value in %h.
say scalar grep defined $_ => values %h; # 2
because 0 is defined although it's false.
say scalar grep exists $h{$_} => qw(a b c d); # 3
because undefined values can exist. Conclusion:
know what you're doing instead of copy'n'pasting code snippets :)

There are not only tabs, but there are spaces as well.
trying out with splitting by space works
Look below
#!/usr/bin/perl
# splits.pl
use strict;
use warnings;
use diagnostics;
while (<DATA>) { # loop through each line in file
next if ($. == 1); # skip first line (contains business segment code)
next if ($. == 2); # skip second line (contains transaction amount text)
my #fields = split(" ",$_); # split fields by SPACE
print scalar(#fields), "\n";
}
__DATA__
4 G I M N U X
Transaction Amount Transaction Amount Transaction Amount Transaction Amount Transaction Amount Transaction Amount Transaction Amount
0000-13-I21 600
0001-8V-034BLA 2,172 2,172
0001-8V-191GYG 13,125 4,375
0001-9W-GH5B2A -2,967.09 2,967.09 25.00
Output
2
3
3
4

Your code works for me. The problem may be that the input file contains some "hidden" whitespace fields (eg. other whitespace than tabs). For instance
A<tab><space><CR> gives two fields, A and <space><CR>
A<tab>B<tab><CR> gives three, A, B, <CR> (remember, the end of line is part of the input!)
I suggest you to chomp every line you use; other than that, you will have to clean the array from whitespace-only fields. Eg.
scalar(grep /\S/, #fields)
should do it.

A lot of great help on this question, and quickly too!
After a long, drawn-out learning process, this is what I came up with that worked quite well, with intended results.
#!/usr/bin/perl
# splits.pl
use strict;
use warnings;
use diagnostics;
my $pivotfile = "myPath/Internal_Splits_Pivot.txt";
open PIVOTFILE, $pivotfile or die $!;
while (<PIVOTFILE>) { # loop through each line in file
next if ($. == 1); # skip first line (contains business segment code)
next if ($. == 2); # skip second line (contains transaction amount text)
chomp $_; # clean line of trailing \n and white space
my #fields = split(/\t/,$_); # split fields for line into an array
print scalar(grep $_, #fields), "\n";
}

Perl Translation

I want to do something like this
$string ='4791';
$string =~ tr/4791/(ma)(laya)(lam)(_baasha)/;
should give me
$string='malayalam_baasha';
i.e replace each character with n other characters. n may be different for each character.
Is there a one line solution for this translation ?

Assuming you always want to replace a single character with a specific string...
my %Replacement = (
'0' => 'abc',
'1' => 'def',
'2' => 'ghi',
'3' => 'jkl',
# ... whatever others you like ...
);
my $String = '0123';
print "$String\n"; # Prints "0123"
$String =~ s{(.)}
{exists($Replacement{$1}) ? $Replacement{$1} : $1}egx;
print "$String\n"; # Prints "abcdefghijkl"
Just make an entry in %Replacement for each character you want to swap out.
Re-reading your question, no, this isn't on one line, though it can be written as such (though messily) if you like. Constraining it to a single line will really kind of depend on how many different exchanges you want to have, though. After a certain point, it's going to get ugly.

The right answer is Brian Gerard's, but it can be done in one fairly short and almost readable line:
$string =~ s/(.)/{1 => "_baasha", 4 => "ma", 7 => "laya", 9 => "lam"}->{$1}/ge;
or one short unreadable line:
$string =~ s/(.)/{4,ma=>7,laya=>9,lam=>1,"_baasha"}->{$1}/ge;
or even shorter, but a bit more readable:
$string =~ s/(.)/qw(a _baasha a a ma a a laya a lam)[$1]/ge;
or the shortest I could get it (this one won't work with strict turned on):
$string =~ s/(.)/(a,_baasha,a,a,ma,a,a,laya,a,lam)[$1]/ge;

This
($i=0) || (#tr = qw |abc def ghi jkl| ) && (#string = map { $tr[$i++] } split //,'0123') && ($string =join '',#string);
OR
( %tr = ( 0 => 'abc' , 1 => 'def' , 2 => 'ghi' , 3 => 'jkl' ) ) && (#string = map { $tr{$_} } split //,'0123') && ($string =join '',#string);
should work! But I wouldn't use it!!

Help writing flexible splits, perl

A couple weeks ago I posted a question about trouble I was having parsing an irregularly-formatted data file. Here's a sample of the data:
01-021412 15/02/2007 207,000.00 14,839.00 18 -6 2 6 6 5 16 6 4 4 3 -28 -59 -88 -119
-149 -191 -215 -246
Atraso Promedio ---> 2.88
I need a program that would extract 01-021412, 18, count and sum all the digits in the subsequent series, and store atraso promedio, and that could repeat this operation for over 40,000 entires. I received a very helpful response, and from that was able to write the code:
use strict;
use warnings;
#Create an output file
open(OUT, ">outFull.csv");
print OUT "loanID,nPayments,atrasoPromedio,atrasoAlt,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72\n";
open(MYINPUTFILE, "<DATOS HISTORICO ASPIRE2.txt");
my #payments;
my $numberOfPayments;
my $loanNumber;
while(<MYINPUTFILE>)
{
if(/\b\d{2}-\d{6}\b/)
{
($loanNumber, undef, undef, undef, $numberOfPayments, #payments) = split;
}
elsif(m/---> *(\d*.\d*)/)
{
my (undef, undef, undef, $atrasoPromedio) = split;
my $N = scalar #payments;
print "$numberOfPayments,$N,$loanNumber\n";
if($N==$numberOfPayments){
my $total = 0;
($total+=$_) for #payments;
my $atrasoAlt = $total/$N;
print OUT "$loanNumber,$numberOfPayments,$atrasoPromedio,$atrasoAlt,",join( ',', #payments),"\n";
}
}
else
{
push(#payments, split);
}
}
This would work fine, except for the fact that about 50 percent of entries include an '*' as follows:
* 01-051948 06/03/2009 424,350.00 17,315.00 48 0 6 -2 0 21 10 9 13 10 9 7 13 3 4
12 -3 14 8 6
Atraso Promedio ---> 3.02
The asterisk causes the program to fail because it interrupts the split pattern, causing incorrect variable assignments. Until now I've dealt with this by removing the asterisks from the input data file, but I just realized that by doing this the program actually omits these loans altogether. Is there an economical way to modify my script so that it handles entries with and without asterisks?
As an aside, if an entry does include an asterisk I would like to record this fact in the output data.
Many thanks in advance,
Aaron

Use an intermediate array:
my $has_asterisk;
# ...
if(/\b\d{2}-\d{6}\b/)
{
my #fields = split;
$has_asterisk = $fields[0] eq '*';
shift #fields if $has_asterisk;
($loanNumber, undef, undef, undef, $numberOfPayments, #payments) = #fields;
}

You could discard the asterisk before doing the split :
while(<MYINPUTFILE>) {
s/^\s*\*\s*//;
if(/\b\d{2}-\d{6}\b/) {
($loanNumber, undef, undef, undef, $numberOfPayments, #payments) = split;
...
And, apart of this, you should use 3 args open, lexical filehandles and test open for failure.
my $file = 'DATOS HISTORICO ASPIRE2.txt';
open my $MYINPUTFILE, '<', $file or die "unable to open '$file' for reading : $!";

so it looks like your first if statement regex is not accounting for that '*', so how about we modify it. my perl regex skillz are a little rusty, note that this is untested.
if(/(?:\* )?\b\d{2}-\d{6}\b/)
* is a modifier meaning "zero or more times" so we need to escape it, \*
(?: ) means "group this together but don't save it", I just use that so I can apply the ? to both the space and * at the same time

At beginning of the while loop, try this:
...
while(<MYINPUTFILE>)
{
my $asterisk_exists = 0;
if (s/^\* //) {
$asterisk_exists = 1;
}
...
In addition to removing the asterisk by using the s/// function, you also keep track of whether or not the asterisk was there in the first place. With the asterisk removed, the rest of your script should function as normal.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Strategies to handle a file with multiple fixed formats - perl

Depending what you want to do with this, it might be a good place to actually write a formal grammar, using Parse::RecDescent, for instance. This will allow you to feed the entire file to the parser, and get a datastructure out of it.

I would keep an additional state in one or more variables and update it per row. Then you e. g. know if the last line was level 1, or if the last row was format 4 (and you can expect format 5), thus giving more security to your processing.

Related

Compare 3 tab delimited files and print matches in Perl

Math::BaseConvert with strange basenumbers like 23 or 1000

Perl grep not returning expected value

Perl Translation

Help writing flexible splits, perl

Categories

Resources