How do I decipher this perl statement $params->{$1}{'Kx'} = $2 - hash

Iam trying to decipher the intent perl code given below:
sub get_input_params($) {
my ($file) = #_;
my $params = {};
open(IN, $file) or die "Input file not found";
while(<IN>)
{
if(/^\s*JOINT (\d{4}) (.{7})(.{7})(.{7})(.{7})(.{7})(.{7}) ELASTIC?\s*$/i)
{
$params->{$1}{'Kx'} = $2;
$params->{$1}{'Ky'} = $3;
$params->{$1}{'Kz'} = $4;
$params->{$1}{'Txx'} = $5;
$params->{$1}{'Tyy'} = $6;
$params->{$1}{'Tzz'} = $7;
}
}
close(IN);
$params;
};
In particular Iam stuck at this :
$params->{$1}{'Kx'} = $2;
Can someone give a source for getting this clarified

Suppose the current line of the file is
JOIN 1234 uuuuuuuvvvvvvvwwwwwwwxxxxxxxyyyyyyyzzzzzzz ELASTIC ...
$1 $2 $3 $4 $5 $6 $7
the contents after ELASTIC being irrelevant.
The capture groups are referred to by number, as shown above.
$params is a hash reference, with $1 being used as a key mapped to another hash with keys Kx, Ky, etc. Using the example line above, the following assignments would be made
$params->{1234}{Kx} = "uuuuuuu"
$params->{1234}{Ky} = "vvvvvvv"
$params->{1234}{Kz} = "wwwwwww"
$params->{1234}{Txx} = "xxxxxxx"
$params->{1234}{Tyy} = "yyyyyyy"
$params->{1234}{Tzz} = "zzzzzzz"

Related

Perl: CAM::PDF makes changes, but these don't materialise in final doc

I've been using the CAM::PDF module to try editing pdf docs at work - essentially just trying to change the date on docs automatically to show they have been reviewed recently
unfortunately, despite my code telling me that I am making changes to the PDF objects ($pdf->{changes})
and giving the pdfs the doc is attempting to change maximum accessibility (anyone can access, read, write)
the pdf's outputted never seem to materialise with these changes. I have also been grepping the object node tmp files I output on mass and found that all of these show no sign of the old date after running the code; yet when I view the pdf after running it, the old date is still on the pdf. Has anyone encountered this before or can suggest anything?
just doing this manually isn't an option; I want to script this so I can have a script I just run against multiple files at once (I have LOTS of these files to sort out at work) but other than changing dates written on the doc, the doc has to remain looking the sameish (by which I mean, it would be ok if they changed in size a little but not ok if they completely changed in appearance)
I strictly followed the example changepdfstring.pl (https://metacpan.org/pod/distribution/CAM-PDF/bin/changepdfstring.pl) from the author of the module CAM::PDF on how to do this for my code, then tried different variations of it to try and get things to work - so I'm bemused that nothing has worked in the end
#!/usr/bin/perl
use strict;
use warnings;
use CAM::PDF;
use Data::Dumper;
my $pdf = CAM::PDF->new('Order fulfilment process flowchart.pdf');
if (!$pdf->canModify())
{
die "This PDF forbids modification\n";
}
my $olddate = "15.02.2019";
my $newdate = "22.02.2022";
foreach my $objectnumber (keys %{$pdf->{xref}}){
my $objectnode = $pdf->dereference($objectnumber);
$pdf->changeString($objectnode, {$olddate=>$newdate});
}
my $change = $pdf->{changes};
print Dumper($change);
my $count = 0;
foreach my $objectnumber (keys %{$pdf->{xref}}){
my $objectnode = $pdf->dereference($objectnumber);
$count++;
open (ONO, ">tmp.objectnode.$count");
print ONO Dumper($objectnode);
close (ONO);}
if (!scalar %{$pdf->{changes}})
{
die "no changes were made :(";
}
$pdf->preserveOrder();
$pdf->cleanoutput('pleasework.pdf');
Any help or advice would be greatly appreciated
A quick search in page 145 of the PDF specification[1] shows that there are 2 metadata fields that should allow a simple change to achieve what you are trying to do.
CreationDate
ModDate
Below you can find a quick script using CAM::PDF to set/modify the ModDate with the current date, thus giving the illusion of "modifying" the PDF.
The script can, if needed, be amended to use a specific date instead of the current time to set the modification date.
Please note that I'm not sure that CAM::PDF is the best option to get this task done.
The script is a only a sample of what can be done within the limitations and simplicity of CAM::PDF.
[1] https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference.pdf
#!/usr/bin/env perl
use strict;
use warnings;
use Time::Local;
use CAM::PDF;
use CAM::PDF::Node;
my $infile = shift || die 'syntax...';
my $outfile = shift || die 'syntax...';
my $pdf = CAM::PDF->new($infile) || die;
my $info = $pdf->getValue($pdf->{trailer}->{Info});
if ($info) {
my #time = localtime(time);
my ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = #time;
$year += 1900;
$mon++;
my $gmt_offset_in_seconds = timegm(#time) - timelocal(#time);
my $gmt_offset_min = ($gmt_offset_in_seconds / 60) % 60;
my $gmt_offset_hour = abs(int($gmt_offset_in_seconds / (60*60)));
my $offset_char = "";
if ($gmt_offset_in_seconds < 0) {
$offset_char = "-";
} else {
$offset_char = "+";
}
my $date = sprintf("D:%04d%02d%02d%02d%02d%02d%s%02d'%02d'", $year, $mon, $mday, $hour, $min, $sec, $offset_char, $gmt_offset_hour, $gmt_offset_min);
my $objnum = undef;
for my $obj ('Creator', 'Producer', 'CreationDate') {
if (exists $info->{$obj} and exists $info->{$obj}->{objnum}) {
$objnum = $info->{$obj}->{objnum};
last;
}
}
die "Cannot find objnum, halting..." if not defined $objnum;
my $mod_date = $info->{ModDate};
if ($mod_date) {
$mod_date->{value} = $date;
} else {
my $mod_date = new CAM::PDF::Node('string',$date);
$mod_date->{gennum} = 0;
$mod_date->{objnum} = $objnum;
$info->{ModDate} = $mod_date;
}
$pdf->preserveOrder();
$pdf->cleanoutput($outfile);
} else {
print "Cannot find PDF info section, doing nothing!\n";
}
I'm the author of CAM::PDF. Without seeing the PDF, I can only guess but I'd wager that the problem is that $olddate simply doesn't match any text in the doc. Kerning can break strings into multiple parts, for example. Also, there are several different ways to encode strings that appear the same in the resulting doc. So the trick for you will be figuring out what is the pattern for the dates in your specific docs.
That said, I also like the clever idea that #Bruce Ramos offered in a separate answer. That approach won't change a date that's visible in the rendered PDF (like, if you print it) but it should show up as metadata in nearly any PDF viewer.
I found that the line I was trying to edit was not actually a contiguous set of characters in the pdf, but rather it was inside a TJ operator in a BT line in the PDF. I cannot see any provision for handling cases where the desired text is in TJ lines in the CAM::PDF library (although perhaps there is #ChrisDolan ?) hence it was unable to be operated on nor "swapped out" by CAM::PDF. After decompressing all the streams (where applicable) I found this 'TJ' line which had the text I wished to operate on:
[(D)-20(a)24(t)62(e)-46(:)86( )-46(1)52(5)-37(.)70(0)-37(2)52(.)-20(2)52(0)-37(1)52(9)] TJ
I don't believe it would have been possible for CAM::PDF to act on TJ lines, perhaps it can only act on Tj lines
For anyone looking for a quick answer to this same problem, this "dirty" script worked for me in this case:
#!/usr/bin/perl
use strict;
use Compress::Raw::Zlib;
use bytes;
open(OUT,'>', "newfromoldscript.pdf");
my $fname = 'Order fulfilment process flowchart.pdf';
open(FILE, '<:raw', $fname) || die("can't open($fname): $!");
$/ = undef;
my $file = <FILE>;
my $file_len = length($file);
my $i = 0;
my $offset;
my $offset;
my $o;
do {
$o = doX(substr($file, $offset, $file_len), $i);
$offset+=$o;
$i++;
} while($o && $i< 100);
sub doX {
my $file = shift;
my $i = shift;
my $stream = index($file, "\nstream");
if ($stream < 0) {
print OUT $file;
return 0;
}
$stream++;
my $deflate = 1;
my $line_before = rindex(substr($file,0,$stream), "<<");
print OUT substr($file,0,$line_before);
my $x = substr($file, $line_before,$stream-$line_before);
if ($i == 22) {
print "";
}
my $stream_len;
if ($x =~ /FlateDecode\/Length (\d+)>>/) {
$stream_len = $1;
}
if ($x =~ /FlateDecode\/Length (\d+)\//) {
print "Warn Object $i has len/len what the even is this?\n";
$stream_len = $1;
}
if ($x =~ /XML\/Length (\d+)>>/) {
$deflate = 0;
$stream_len = $1;
}
if (!$stream_len) {
die("I fail with no stream len : $x");
}
print "-->$line_before,$i,$stream=$stream_len=$x<--\n";
my $bytes = substr($file, $stream+8,$stream_len);
my $orig_bytes = $bytes; # inflate seems to mangle bytes, so take a copy
my $o;
my $d=new Compress::Raw::Zlib::Inflate();
if ($deflate) {
$d->inflate($bytes,$o);
} else {
$o = $bytes;
}
my $orig_x = $x;
my $changes;
my %change = (
'-20(2)52(0)-37(.)52(.)' => '-20(2)52(0)-37(2)52(0)', #trialling different reg ex's here
'-37(1)52(9)'=>'-37(2)52(0)', #reg ex's
'Date: 15.02.2019'=>'Date: 12.02.2020',
'[(A)[\d-]+(p)[\d-]+(p)[\d-]+(r)[\d-]+(o)[\d-]+(ve)[\d-]+(d)[\d-]+( )[\d-]+(B[^\]]+\] TJ'=>'(Approved By: George W) Tj??G-TAG??' #scrap the whole TJ, replace for Tj
);
foreach my $re (keys %change) {
my $to = $change{$re};
$re =~ s/([\(\)])/\\\1/g; # escape round brackets
print $re;
open (GW, ">tmp.gw");
print GW $re;
close (GW);
if ($o=~/$re/m) {
$o =~ s/$re/$to/mg;
print $o;
$changes++;
}
}
if ($changes) {
print "\n MADE CHANGES\n";
#split, get rid of the ? mark tag
my #remains = split('\?\?G-TAG\?\?', $o);
my $firsthalf = $remains[0];
my $secondhalf = $remains[1];
#reverse the string
$firsthalf = scalar reverse ($firsthalf);
if ($firsthalf =~ m/fT 52\.8 2F/){print "FOUND THE REVERSE"}
$firsthalf =~ s/fT 52\.8 2F/fT 52\.8 0F/;
#reg ex to back track to the nearest and thus relevant Font/F and set it to F0
#put it back in correct orientation
$firsthalf = scalar reverse ($firsthalf);
$o = join("", $firsthalf, $secondhalf);
open (WEIRD, ">tmp.weird");
print WEIRD $firsthalf;
close (WEIRD);
$changes++;
my $d = new Compress::Raw::Zlib::Deflate();
my $obytes;
my $obytes2;
my $status = $d->deflate($o, $obytes);
$d->flush($obytes2);
$bytes = $obytes . $obytes2;
if (length($bytes) != $stream_len) {
my $l = length($bytes);
print "-->$x<--\n";
warn("what do we do here $l != $stream_len");
$orig_x =~ s/$stream_len/$l/;
}
print OUT $orig_x . "stream\r\n";
print OUT $bytes . "\r";
} else {
print OUT $orig_x . "stream\r\n";
print OUT $orig_bytes . "\r";
}
open(TMP,">out/tmp.$i.bytes");
print TMP $o;
close(TMP);
return $stream + 8 + $stream_len + 1;
}
Essentially I swap out the TJ for a Tj for changing someone elses name on the document to my name, which makes it simpler to insert my change (but potentially messy). To enable this to display with capitalised letters, I had to reverse the string and swap out the font (F) it was under (F2) to F0
For the TJ line relating to date, I swapped out the TJ characters for the date I wished to change it to, this meant I had to abide by the "unfriendly" syntax TJ operator lines abide by

Array elements to match in regressively using perl script

My Input file:
my $inp = "sample.txt";
#Sample.txt
As the HF exchange `\mathcal{\mathsf{}}` operator adopted
in, the same HF exchange operator is adopted in without further
optimization. However, the remaining `\mathbb{\mathbbm{}}`
`\mathbm{\mathbf{}}`, `\mathbf{\mathit{}}`. When compared with those
adopted in the MR hybrid functionals developed by Henderson {\it et al.}
for different `\mathrm{\mathscr{}}`, `\mathsf{\mathfrak{}}`
My Array Elements:
my #arr = qw(boldsymbol mathbb mathbbm mathbf mathcal mathbf mathit mathbf mathcal mathfrak mathit mathrm mathscr mathsf);
My concern is need to check the below pattern:
\\$arr[0]{$arr[1] ... \\$arr[0]{$arr[2] .... \\$arr[0]{$arr[3] ... \\$arr[0]{$arr[13]
...
...
\\$arr[13]{$arr[0] ... \\$arr[13]{$arr[1] ... \\$arr[13]{$arr[2] ... \\$arr[13]{$arr[13]
For Example:
\boldsymbol{\mathbb} and \\boldsymbol{\mathbbm} ...
\mathbb{\boldsymbol} and \\mathbb{\mathbbm} ...
#My Ist attempt
readFileinString($inp,\$inpcnt);
my $i = 1; my $j = 1; my $cls = $#arr;
while($inpcnt=~m/\\$arr[$i]\{\\$arr[$j]/g)
{
print "LL: $&\n";
$j += 1;
if($j == $cls) { $i++; }
}
#IInd Attempt
my (#arr1,#arr2) = ();
while(<>)
{
chomp;
push(#arr1, $_);
}
my $join1 = join "|", #arr1;
my $join2 = join "|", #arr1;
#print "($join1)\{($join2)";
while($str=~m/($join1)\{($join2)/g)
{
print "Matched: $&\n";
}
#------------------>Reading a file
sub readFileinString
#------------------>
{
my $File = shift;
my $string = shift;
use File::Basename;
my $filenames = basename($File);
open(FILE1, "<$File") or die "\nFailed Reading File: [$File]\n\tReason: $!";
read(FILE1, $$string, -s $File, 0);
close(FILE1);
}
__DATA__
\boldsymbol
\mathbb
\mathbbm
\mathbf
\mathcal
\mathbf
\mathit
\mathbf
\mathcal
\mathfrak
\mathit
\mathrm
\mathscr
\mathsf
Could you please anyone guide me where I am doing wrong on this coding flow.
Loop over the list of terms twice, nested. This will result in the cartesian product.
use 5.026;
use strictures;
use Data::Munge qw(list2re);
my #markup = qw(boldsymbol mathbb mathbbm mathbf mathcal mathbf mathit
mathbf mathcal mathfrak mathit mathrm mathscr mathsf);
my $BS = '\\'; # a single backslash
my #expressions;
for my $first_term (#markup) {
for my $second_term (#markup) {
push #expressions, "$BS${first_term}{$BS$second_term"
}
}
my $regex = list2re #expressions;
my $input = <<'';
As the HF exchange \mathcal{\mathsf{}} operator adopted
in, the same HF exchange operator is adopted in without further
optimization. However, the remaining \mathbb{\mathbbm{}}
\mathbm{\mathbf{}}, \mathbf{\mathit{}}. When compared with those
adopted in the MR hybrid functionals developed by Henderson {\it et al.}
for different \mathrm{\mathscr{}} \mathsf{\mathfrak{}}
my #results = $input =~ m/$regex/gms;
# (
# '\\mathcal{\\mathsf',
# '\\mathbb{\\mathbbm',
# '\\mathbf{\\mathit',
# '\\mathrm{\\mathscr',
# '\\mathsf{\\mathfrak'
# )

Read an file in two hashes inorder to retain the order

I am trying to read a file with user information categorized under a location, I want to fill in the some of the fields using user input and output the file while keeping the fields under each location intact for eg - file
[California]
$;FIrst_Name =
$;Last_Name=
$;Age =
[NewYork]
$;FIrst_Name =
$;Last_Name=
$;Age =
[Washington]
$;FIrst_Name =
$;Last_Name=
$;Age =
Once user provides input from command line it should look it
[California]
$;FIrst_Name = Jack
$;Last_Name= Daner
$;Age = 27
[NewYork]
$;FIrst_Name = Jill
$;Last_Name= XYZ
$;Age = 30
[Washington]
$;FIrst_Name = Kim
$;Last_Name= ABC
$;Age = 25
The order of First_Name, Last_Name and Age within each location can change and even order of locations can change, but each location section should remain separate and intact. I wrote following code so far and some of my code works for taking whole file in one hash, but i am not able to preserve each location section within it! I tried using two hashes - can someone please help me as it is getting really complex for me! Thanks a lot. ( I had another issue with a similar file as well, but unfortunately could not resolve it either)
EDITED code
Open the file
use strict;
use warnings;
use Getopt::Long;
sub read_config {
my $phCmdLineOption = shift;
my $phConfig = shift;
my $sInputfile = shift;
open($input.file, "<$InputFile") or die "Error! Cannot open $InputFile
+ for reading: $!";
while (<$input.file>) {
$_ =~ s/\s+$//;
next if ($_ =~ /^#/);
next if ($_ =~ /^$/);
if ($_ =~ m/^\[(\S+)\]$/) {
$sComponent = $1;
next;
}
elsif ($_ =~ m/^;;\s*(.*)/) {
$sDesc .= "$1.";
next;
}
elsif ($_ =~ m/\$;(\S+)\$;\s*=\s*(.*)/) {
$sParam = $1;
$sValue = $2;
if ((defined $sValue) && ($sValue !~ m/^\s*$/)) {
$phfield->{$sCategory}{$sParam} = ["$sValue", "$sDesc"];
}
else {
$field->{$sCategory}{$sParam} = [undef, "$sDesc"];
}
}
$sParam = $sValue = $sDesc = "";
next;
}
}
Write the new file -
sub write_config {
my $phCmdLineOption = shift;
my $phConfig = shift;
my $sOut = shift;
open(outfile, ">$sOut") or die " $!";
foreach $sCategory (sort {$a cmp $b} keys %{$fields}) {
print $outfile "[$sCategory]\n";
foreach $sParam (sort {$a cmp $b} keys %{$fields-{$sCategory}}) {
$sDesc = ((defined $phConfig->{$sCategory}{$sParam}[1]) $fields->{$sCategory}{$sParam}[1] : "");
print $outfile ";;$sDesc\n" if ((defined $sDesc) && ($sDesc !~ m/^$/));
$sValue = ((defined $fields->{$sCategory}{$sParam}[0]) ? $fields->{$sCategory}{$sParam}[0] : undef);
print $outfile "$sValue" if (defined $sValue);
print $outfile "\n";
}
print $outfile "\n";
}
close($outfile);
return;
Note - I have posted this question on PerlMonks forum as well. Thanks a lot!
I think you're getting lost in the detail and skipping over some basics which is unnecessarily complicating the problem. Those basics are;
Indent your code properly (it's amazing the difference this makes)
Always use the /x modifier on regex and lots of whitespace to increase readability
When using lots of regexs, use "quote rule", qr, to seperate regex definition from regex use
Apart from that, you were headed in the right direction but there are a couple of insights on the algorithm you were missing which further increased the complexity.
Firstly, for small-time parsing of data, look out for the possibility that matching one type of line immediately disqualifies matching of other types of line. All the elsif's aren't necessary since a line that matches a category is never going to match a LastName or Age and vice versa.
Secondly, when you get a match, see if you can do what's needed immediately rather than storing the result of the match for processing later. In this case, instead of saving a "component" or "category" in a variable, put it immediately into the hash you're building.
Thirdly, if you're updating text files that are not huge, consider working on a new version of the file and then at the end of the program declare the current version old, and the new version current. This reduces the chances of unintentionally modifying something in place and allows comparison of the update with the original after execution - if necessary, "rollback" of the change in trivially easy which one of your users may be very grateful for one day.
Fourthly and most of all, you've only got a couple of attributes or components to worry about, so deal with them in the concrete rather than the abstract. You can see below that I've looped over qw( First_Name Last_Name Age) rather than all keys of the hash. Now obviously, if you have to deal with open-ended or unknown attributes you can't do it this way but in this case, AFAICT, your fields are fixed.
Here's a version that basically works given the above mentioned constraints.
#!/usr/bin/env perl
use v5.12 ;
use Getopt::Long ;
my %db ; # DB hash
my $dbf = "data.txt" ; # DB file name
my $dbu = "data.new" ; # updated DB file name
my $dbo = "data.old" ; # Old DB file name
my ($cat, $first, $last, $age) ; # Default is undef
GetOptions( 'cat=s' => \$cat ,
'first=s' => \$first ,
'last=s' => \$last ,
'age=i' => \$age
);
die "Category option (--cat=...) is compolsory\n" unless $cat ;
open my $dbh, '<', $dbf or die "$dbf: $!\n"; # DB Handle
open my $uph, '>', $dbu or die "$dbu: $!\n"; # UPdate Handle
# REs for blank line, category header and attribute specification
my $blank_re = qr/ ^ \s* $ /x ;
my $cat_re = qr/ ^ \[ (\w+) \] \s* $ /x ;
my $attr_re = qr/ ^ \$ ; (?<key>\w+) \s* = \s* (?<val>\N*) $ /x ;
while ( <$dbh> ) {
next unless /$cat_re/ ;
my %obj = ( cat => $1 ) ;
while ( <$dbh> ) {
$obj{ $+{key} } = $+{val} if /$attr_re/ ;
last if /$blank_re/
}
$db{ $obj{cat} } = \%obj
}
# Grab existing obj, otherwise presume we're adding a new one
my $obref = $db{ $cat } // { cat => $cat } ;
$obref->{ First_Name } = $first if defined $first ;
$obref->{ Last_Name } = $last if defined $last ;
$obref->{ Age } = $age if defined $age ;
# Update the DB with the modified/new one
$db{ $obref->{cat} } = $obref ;
for (sort keys %db) {
my $obref = $db{ $_ } ;
printf $uph "[%s]\n", $obref->{ cat } ;
for (qw( First_Name Last_Name Age )) {
printf $uph '$;' . "%s = %s\n", $_, $obref->{ $_ }
}
print $uph "\n"
}
close $dbh ;
close $dbu ;
rename $dbf , $dbo ;
rename $dbu , $dbf ;
exit 0
User input here need be organized, and for this we can use named options for each field, plus one for state. The Getopt option for reading into a hash is useful here. We also need to associate names of these options with field names. With that in hand it is simple to process the file since we have a ready mechanism to identify lines of interest.
By putting lines on a ref-array we can keep the order as well, and that refarray is a value for the section-key in the hash. The hash is not necessary but adds flexibility for future development. Once we are at it we can also keep the order of sections by using a simple auxiliary array.
use warnings;
use strict;
use Getopt::Long;
use feature qw(say);
# Translate between user input and field name ($;) in file
my ($o1, $o2, $o3) = qw(first last age);
my #tags = ('FIrst_Name', 'Last_Name', 'Age');
my %desc = ($tags[0] => $o1, $tags[1] => $o2, $tags[2] => $o3);
my (%input, $state);
GetOptions(\%input, "$o1=s", "$o2=s", "$o3=i", 'state=s' => \$state);
my $locinfo = 'loc_info.txt';
open my $in_fh, '<', $locinfo;
my (%conf, #sec_order, $section, $field);
while (my $line = <$in_fh>)
{
chomp($line);
next if $line =~ m/^\s*$/;
# New section ([]), for hash and order-array
if ($line =~ m/^\s*\[(.*)\]/) {
push #sec_order, $section = $1;
next;
}
# If we are in a wrong state just copy the line
if ($section ne $state) {
push #{$conf{$section}}, $line . "\n";
next;
}
if (($field) = $line =~ m/^\$;\s*(.*?)\s*=/ ) {
if (exists $input{$desc{$field}}) {
# Overwrite what is there or append
$line =~ s|^\s*(.*?=\s*)(.*)|$1 $input{$desc{$field}}|;
}
}
else { warn "Unexpected line: |$line| --" }
push #{$conf{$section}}, $line . "\n";
}
close $in_fh;
for (#sec_order) { say "[$_]"; say #{$conf{$_}}; }
Invocation
script.pl -state STATE -first FIRST_NAME -last LAST_NAME -age INT
Any option may be left out in which case that field is not touched. A field supplied on the command line will be overwritten if it has something. (This can be changed easily.) This works for a single-state entry as it stands but which is simple to modify if needed.
This is a basic solution. The first next thing would be to read the field names from the file itself, instead of having them hard-coded. (This would avoid the need to spot the typo FIrst and inconsistent spacings before =, for one thing.) But the more refinements are added, the more one is getting into template development. At some point soon it will be a good idea to use a module.
Note The regex delimiter above is different than elsewhere (|) to avoid the editor coloring all red.

What is the most efficient way to parse a text file using Perl?

Although this is pretty basic, I can't find a similar question, so please link to one if you know of an existing question/solution on SO.
I have a .txt file that is about 2MB and about 16,000 lines long. Each record length is 160 characters with a blocking factor of 10. This is an older type of data structure which almost looks like a tab-delimited file, but the separation is by single-chars/white-spaces.
First, I glob a directory for .txt files - there is never more than one file in the directory at a time, so this attempt may be inefficient in itself.
my $txt_file = glob "/some/cheese/dir/*.txt";
Then I open the file with this line:
open (F, $txt_file) || die ("Could not open $txt_file");
As per the data dictionary for this file, I'm parsing each "field" out of each line using Perl's substr() function within a while loop.
while ($line = <F>)
{
$nom_stat = substr($line,0,1);
$lname = substr($line,1,15);
$fname = substr($line,16,15);
$mname = substr($line,31,1);
$address = substr($line,32,30);
$city = substr($line,62,20);
$st = substr($line,82,2);
$zip = substr($line,84,5);
$lnum = substr($line,93,9);
$cl_rank = substr($line,108,4);
$ceeb = substr($line,112,6);
$county = substr($line,118,2);
$sex = substr($line,120,1);
$grant_type = substr($line,121,1);
$int_major = substr($line,122,3);
$acad_idx = substr($line,125,3);
$gpa = substr($line,128,5);
$hs_cl_size = substr($line,135,4);
}
This approach takes a lot of time to process each line and I'm wondering if there is a more efficient way of getting each field out of each line of the file.
Can anyone suggest a more efficient/preferred method?
It looks to me that you are working with fixed width fields here. Is that true? If it is, the unpack function is what you need. You provide the template for the fields and it will extract the info from those fields. There is a tutorial available, and the template information is found in the documentation for pack which is unpack's logical inverse. As a basic example simply:
my #values = unpack("A1 A15 A15 ...", $line);
where 'A' means any text character (as I understand it) and the number is how many. There is quite an art to unpack as some people use it, but I believe this will suffice for basic use.
A single regular expression, compiled and cached using the /o option, is the fastest approach. I ran your code three ways using the Benchmark module and came out with:
Rate unpack substr regexp
unpack 2.59/s -- -59% -67%
substr 6.23/s 141% -- -21%
regexp 7.90/s 206% 27% --
Input was a file with 20k lines, each line had the same 160 characters on it (16 repetitions of the characters 0123456789). So it's the same input size as the data you're working with.
The Benchmark::cmpthese() method outputs the subroutine calls from slowest to fastest. The first column is telling us how many times per second the sub-routine can be run. The regular expression approach is fastest. Not unpack as I state previously. Sorry about that.
The benchmark code is below. The print statements are there as sanity checks. This was with Perl 5.10.0 built for darwin-thread-multi-2level.
#!/usr/bin/env perl
use Benchmark qw(:all);
use strict;
sub use_substr() {
print "use_substr(): New itteration\n";
open(F, "<data.txt") or die $!;
while (my $line = <F>) {
my($nom_stat,
$lname,
$fname,
$mname,
$address,
$city,
$st,
$zip,
$lnum,
$cl_rank,
$ceeb,
$county,
$sex,
$grant_type,
$int_major,
$acad_idx,
$gpa,
$hs_cl_size) = (substr($line,0,1),
substr($line,1,15),
substr($line,16,15),
substr($line,31,1),
substr($line,32,30),
substr($line,62,20),
substr($line,82,2),
substr($line,84,5),
substr($line,93,9),
substr($line,108,4),
substr($line,112,6),
substr($line,118,2),
substr($line,120,1),
substr($line,121,1),
substr($line,122,3),
substr($line,125,3),
substr($line,128,5),
substr($line,135,4));
#print "use_substr(): \$lname = $lname\n";
#print "use_substr(): \$gpa = $gpa\n";
}
close(F);
return 1;
}
sub use_regexp() {
print "use_regexp(): New itteration\n";
my $pattern = '^(.{1})(.{15})(.{15})(.{1})(.{30})(.{20})(.{2})(.{5})(.{9})(.{4})(.{6})(.{2})(.{1})(.{1})(.{3})(.{3})(.{5})(.{4})';
open(F, "<data.txt") or die $!;
while (my $line = <F>) {
if ( $line =~ m/$pattern/o ) {
my($nom_stat,
$lname,
$fname,
$mname,
$address,
$city,
$st,
$zip,
$lnum,
$cl_rank,
$ceeb,
$county,
$sex,
$grant_type,
$int_major,
$acad_idx,
$gpa,
$hs_cl_size) = ( $1,
$2,
$3,
$4,
$5,
$6,
$7,
$8,
$9,
$10,
$11,
$12,
$13,
$14,
$15,
$16,
$17,
$18);
#print "use_regexp(): \$lname = $lname\n";
#print "use_regexp(): \$gpa = $gpa\n";
}
}
close(F);
return 1;
}
sub use_unpack() {
print "use_unpack(): New itteration\n";
open(F, "<data.txt") or die $!;
while (my $line = <F>) {
my($nom_stat,
$lname,
$fname,
$mname,
$address,
$city,
$st,
$zip,
$lnum,
$cl_rank,
$ceeb,
$county,
$sex,
$grant_type,
$int_major,
$acad_idx,
$gpa,
$hs_cl_size) = unpack(
"(A1)(A15)(A15)(A1)(A30)(A20)(A2)(A5)(A9)(A4)(A6)(A2)(A1)(A1)(A3)(A3)(A5)(A4)(A*)", $line
);
#print "use_unpack(): \$lname = $lname\n";
#print "use_unpack(): \$gpa = $gpa\n";
}
close(F);
return 1;
}
# Benchmark it
my $itt = 50;
cmpthese($itt, {
'substr' => sub { use_substr(); },
'regexp' => sub { use_regexp(); },
'unpack' => sub { use_unpack(); },
}
);
exit(0)
Do a split on each line, like this:
my #values = split(/\s/,$line);
and then work with your values.
You could do something like:
while ($line = <F>){
if ($line =~ /(.{1}) (.{15}) ........ /){
$nom_stat = $1;
$lname = $2;
...
}
}
I think it's faster than your substr suggestion, but I'm not sure whether it's the fastest solution, but I think it might very well be.

How can I skip some block content while reading in Perl

I plan to skip the block content which include the start line of "MaterializeU4()" with the subroutin() read_block below. But failed.
# Read a constant definition block from a file handle.
# void return when there is no data left in the file.
# Otherwise return an array ref containing lines to in the block.
sub read_block {
my $fh = shift;
my #lines;
my $block_started = 0;
while( my $line = <$fh> ) {
# how to correct my code below? I don't need the 2nd block content.
$block_started++ if ( ($line =~ /^(status)/) && (index($line, "MaterializeU4") != 0) ) ;
if( $block_started ) {
last if $line =~ /^\s*$/;
push #lines, $line;
}
}
return \#lines if #lines;
return;
}
Data as below:
__DATA__
status DynTest = <dynamic 100>
vid = 10002
name = "DynTest"
units = ""
status VIDNAME9000 = <U4 MaterializeU4()>
vid = 9000
name = "VIDNAME9000"
units = "degC"
status DynTest = <U1 100>
vid = 100
name = "Hello"
units = ""
Output:
<StatusVariables>
<SVID logicalName="DynTest" type="L" value="100" vid="10002" name="DynTest" units=""></SVID>
<SVID logicalName="DynTest" type="L" value="100" vid="100" name="Hello" units=""></SVID>
</StatusVariables>
[Updated]
I print the value of index($line, "MaterializeU4"), it output 25.
Then I updated the code as below
$block_started++ if ( ($line =~ /^(status)/) && (index($line, "MaterializeU4") != 25)
Now it works.
Any comments are welcome about my practice.
Perl already has an operator to keep track of blocks. It's called the "flip-flop" operator:
Try this out:
while ( <DATA> ) {
next if /\Q<U4 MaterializeU4()>\E/../^\s*$/;
push #lines, $_;
}
The value of /\Q<U4 MaterializeU4()>\E/../^\s*$/ will be true when it sees a line that matches the starting regex and it will stop being true after it sees a line matching the second expression.
First, using a regex instead of index is probably better since you can tune it to the exact format of status string if you may decide to be stricter than just "substring exists"
I would suggest as one solution adding a second flag to skip the block contents if it's a MaterializeU4 block, as follows:
# Read a constant definition block from a file handle.
# void return when there is no data left in the file.
# Empty return for skippable (Materialize4U) block!!!
# Otherwise return an array ref containing lines to in the block.
sub read_block {
my $fh = shift;
my #lines = ();
my $block_started = 0;
my $block_ignore = 0;
while (my $line = <$fh> ) {
if ($line =~ /^status.*?((MaterializeU4)?)/) {
$block_started = 1;
$block_ignore = 1 if $1;
}
last if $line =~ /^\s*$/ && $block_started;
push #lines, $line unless $block_ignore;
}
return \#lines if #lines || $block_started;
return;
}
Here's a slightly modified sample I tested using codepad.org:
Code:
use Data::Dumper;
my #all_lines = (
"s 1" ,"b 1" ,""
, "s MaterializeU4" ,"b 2" ,""
, "s 3" ,"b 3" ,""
);
while (#all_lines) {
my $block = read_block();
print Data::Dumper->Dump([$block]);
}
exit 0;
sub read_block {
my #lines = ();
my $block_started = 0;
my $block_ignore = 0;
while (my $line = shift #all_lines) {
if ($line =~ /^s .*?((MaterializeU4)?)/) {
$block_started = 1;
$block_ignore = 1 if $1;
}
last if $line =~ /^\s*$/ && $block_started;
push #lines, $line unless $block_ignore;
}
return \#lines if #lines || $block_started;
return;
}
Output:
$VAR1 = [
's 1',
'b 1'
];
$VAR1 = [];
$VAR1 = [
's 3',
'b 3'
];
On successful match of a substring, index returns the position of the substring, which could be any value >= 0. On "failure", index returns -1.
The way you are using index
index($line, "MaterializeU4") != 0
will be true for all lines except for a line that begins with the string "MaterializeU4".
It looks like you already know a little bit about Perl regular expressions. Why not use one in this case, too?
++$block_started if $line =~ /status/ && $line =~ /MaterializeU4/;
Another issue I see is that you set $block_started to begin capturing lines, but you never set it to zero at the end of the "block", say, when $line is empty. I'm not sure if that's what you wanted to do.