I have the following Perl code. I Know what the end result is: if I run it and pass in an x9.37 file, it will spit out each field of text. That's great, but I am trying to port this to another language, and I can't read Perl at all. If someone could turn this into some form of pseudocode (I don't need working Java - I can write that part) I just need someone to explain what is going on in the Perl below!
#!/usr/bin/perl -w
use strict;
use Encode;
my $tiff_flag = 0;
my $count = 0;
open(FILE,'<',$ARGV[0]) or die 'Error opening input file';
binmode(FILE) or die 'Error setting binary mode on input file';
while (read (FILE,$_,4)) {
my $rec_len = unpack("N",$_);
die "Bad record length: $rec_len" unless ($rec_len > 0);
read (FILE,$_,$rec_len);
if (substr($_,0,2) eq "\xF5\xF2") {
$_ = substr($_,0,117);
}
print decode ('cp1047', $_) . "\n";
}
close FILE;
read (FILE,$_,4) : read 4 bytes from FILE input stream and load into the variable $_
$rec_len = unpack("N",$_): interpret the first 4 bytes of the variable $_ as an unsigned 32-bit integer in big-endian order, assign to the variable $rec_len
read (FILE,$_,$rec_len): read $rec_len bytes from FILE stream into variable $_
substr($_,0,2): the first two characters of the variable $_
"\xF5\xF2": a two-character string consisting of the bytes 245 and 242
$_ = substr($_,0,117): set $_ to the first 117 characters of $_
use Encode;print decode ('cp1047', $_): interpret the contents of $_ with "code page 1047", i.e., EBCDIC and output to standard output
-w is the old way of enabling warnings.
my declares a lexically scoped variable.
open with < opens a file for reading, the filename is taken from the #ARGV array, i.e. the program's parameters. FILE is the file handle associated with the file.
read reads four bytes into the $_ variable. unpack interprets it as an unsigned 32-bit long (so the following condition can fail only when it's 0).
The next read reads that many bytes to $_ again. substr extracts a substring, and if the first two bytes there are "\xf5\xf2", it shortens the string to the first 117 bytes. It then converts the string to the code page 1047.
Related
Right now I am trying to do an assignment where I have to
- Extract information from an HTML file
- Save it to a scalar
- Run a regular expression to find the number of seats available in the designated course (the program argument is the course number for example 100 for ICS 100)
- If the course has multiple sessions, I have to find the sum of the seats available and print
- The output is just the number of seats available
The problem here is that when I was debugging and checking to make sure that my variable I have the program arg saved to was storing the correct value, it was storing the values with an extra 0 behind it.
ex.) perl filename.pl 100
ARGV[0] returns as 0100
I've tried storing the True regular expression values to an array, saving using multiple scalar variables, and changing my regular expression but none worked.
die "Usage: perl NameHere_seats.pl course_number" if (#ARGV < 1);
# This variable will store the .html file contents
my $fileContents;
# This variable will store the sum of seats available in the array #seatAvailable
my $sum = 0;
# This variable will store the program argument
my $courseNum = $ARGV[0];
# Open the file to read contents all at once
open (my $fh, "<", "fa19_ics_class_availability.html") or die ("Couldn't open 'fa19_ics_class_availability.html'\n");
# use naked brakets to limit the $/
{
#use local $/ to get <$fh> to read the whole file, and not one line
local $/;
$fileContents = <$fh>;
}
# Close the file handle
close $fh;
# Uncomment the line below to check if you've successfully extracted the text
# print $fileContents;
# Check if the course exists
die "No courses matched...\n" if ($ARGV[0] !~ m/\b(1[0-9]{2}[A-Z]?|2[0-8][0-9][A-Z]?|29[0-3])[A-Z]?\b/);
while ($fileContents =~ m/$courseNum(.+?)align="center">(\d)</) {
my $num = $2;
$sum = $sum + $num;
}
print $sum;
# Use this line as error checking to make sure #ARGV[0] is storing proper number
print $courseNum;
The current output I am receiving when program argument is 100 is just 0, and I assume it's because the regular expression is not catching any values as true therefore the sum remains at a value of 0. The output should be 15...
This is a link to the .html page > https://laulima.hawaii.edu/access/content/user/emeyer/ics/215/FA19/01/perl/fa19_ics_class_availability.html
You're getting "0100" because you have two print() statements.
print $sum;
...
print $courseNum;
And because there are no newlines or other output between them, you get the two values printed out next to each other. $sum is '0' and $courseNum is '100'.
So why is $sum zero? Well, that's because your regex isn't picking up the data you want it to match. Your regex looks like this:
m/$courseNum(.+?)align="center">(\d)</
You're looking for $courseNum followed by a number of other characters, followed by 'align="center">' and then your digit. This doesn't work for a number of reasons.
The string "100" appears many times in your text. Many times it doesn't even mean a course number (e.g. "100%"). Perhaps you should look for something more precise (ICS $coursenum).
The .+? doesn't do what you think it does. The dot doesn't match newline characters unless you use the /s option on the match operator.
But even if you fix those first two problems, it still won't work as there are a number of numeric table cells for each course and you're doing nothing to ensure that you're grabbing the last one. Your current code will get the "Curr. Enrolled" column, not the "Seats Avail" one.
This is a non-trivial HTML parsing problem. It shouldn't be addressed using regexes (HTML should never be parsed using regexes). You should look at one of the HTML parsing modules from CPAN - I think I'd use Web::Query.
Update: An example solution using Web::Query:
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use File::Basename;
use Web::Query;
my $course_num = shift
or die 'Usage: perl ' . basename $0 . " course_number\n";
my $source = 'fa19_ics_class_availability.html';
open my $fh, '<', $source
or die "Cannot open '$source': $!\n";
my $html = do { local $/; <$fh> };
my $count_free;
wq($html)
# Get each table row in the table
->find('table.listOfClasses tr')
->each(sub {
my ($i, $elem) = #_;
my #tds;
# Get each <td> in the <tr>
$elem->find('td')->each(sub { push #tds, $_[1] });
# Ignore rows that don't have 13 columns
return if #tds != 13;
# Ignore rows that aren't about the right course
return if $tds[2]->text ne "ICS $course_num";
# Add the number of available places
$count_free += $tds[8]->text;
});
say $count_free;
I was trying to test file encryption using the chilkat functionality. Based on code found on this example page, I replaced the last part with this:
# Encrypt a string...
# The input string is 44 ANSI characters (i.e. 44 bytes), so
# the output should be 48 bytes (a multiple of 16).
# Because the output is a hex string, it should
# be 96 characters long (2 chars per byte).
my $input = "sample.pdf";
# create file handle for the pdf file
open my $fh, '<', $input or die $!;
binmode ($fh);
# the output should be sample.pdf.enc.dec
open my $ffh, '>', "$input.enc.dec" or die $!;
binmode $ffh;
my $encStr;
# read 16 bytes at a time
while (read($fh,my $block,16)) {
# encrypt the 16 bytes block using encryptStringEnc sub provided by chilkat
$encStr = $crypt->encryptStringENC($block);
# Now decrypt:
# decrypt the encrypted block
my $decStr = $crypt->decryptStringENC($encStr);
# print it in the sample.pdf.enc.dec file
print $ffh $decStr;
}
close $fh;
close $ffh;
Disclaimer:
I know the CBC mode is not recommended for file encryption because if one block is lost, the other blocks are lost too.
The output file is corrupted and when I look with beyond compare at the two files, there are chunks of the file which match and there are chunks of file which doesn't. What am I doing wrong?
You're trying to use character string encryption (encryptStringENC(), decryptStringENC()) for what is, at least partly, a binary file.
This worked for me:
my $input = "sample.pdf";
# create file handle for the pdf file
open my $fh, '<', $input or die $!;
binmode $fh;
# the output should be sample.pdf.enc.dec
open my $ffh, '>', "$input.enc.dec" or die $!;
binmode $ffh;
my $inData = chilkat::CkByteData->new;
my $encData = chilkat::CkByteData->new;
my $outData = chilkat::CkByteData->new;
# read 16 bytes at a time
while ( my $len = read( $fh, my $block, 16 ) ) {
$inData->clear;
$inData->append2( $block, $len );
$crypt->EncryptBytes( $inData, $encData );
$crypt->DecryptBytes( $encData, $outData );
print $ffh $outData->getData;
}
close $fh;
close $ffh;
You likely better off perusing the Chilkat site further though, there are sample codes for binary data.
I'm going to write and post a link to a sample that is much better than the examples posted here. The examples posted here are not quite correct. There are two important Chilkat Crypt2 properties that one needs to be aware of: FirstChunk and LastChunk. By default, both of these properties are true (or the value 1 in Perl). This means that for a given call to encrypt/decrypt, such as EncryptBytes, DecryptBytes, etc. it assumes the entire amount of data was passed. For CBC mode, this is important because the IV is used for the first chunk, and for the last chunk, the output is padded to the block size of the algorithm according to the value of the PaddingScheme property.
One can instead feed the input data to the encryptor chunk-by-chunk by doing the following:
For the 1st chunk, set FirstChunk=1, LastChunk=0.
For middle chunks, set FirstChunk=0, LastChunk=0.
For the final chunk (even if a 0-byte final chunk), set FirstChunk=0, LastChunk=1. This causes a final padded output block to be emitted.
When passing chunks using FirstChunk/LastChunk, one doesn't need to worry about passing chunks matching the block size of the algorithm. If a partial block is passed in, or if the bytes are not an exact multiple of the block size (16 bytes for AES), then Chilkat will buffer the input and the partial block will be added to the data passed in the next chunk. For example:
FirstChunk=1, LastChunk=0, pass in 23 bytes, output is 16 bytes, 7 bytes buffered.
FirstChunk=0, LastChunk=0, pass in 23 bytes, output is 16 bytes, (46-32 bytes) 14 bytes buffered
FirstChunk=0, LastChunk=1, pass in 5 bytes, output is 32 bytes, (14 buffered bytes + 5 more = 19 bytes. The 19 bytes is one full block (16 bytes) plus 3 bytes remainder, which is padded to 16, and thus the output is 32 bytes and the CBC stream is ended.
This example demonstrates using FirstChunk/LastChunk. Here's the example: https://www.example-code.com/perl/encrypt_file_chunks_cbc.asp
I am attempting to obfusicate some code, and using a short "pm" library. The code is in the cgi-bin. But I am getting "Identifier too long" errors ... yet the line it's deciphering is only 1016 chars in plain form.
Here is the library:
package dml;
use Filter::Simple;
FILTER {
$_=~tr/[hjndr 9_45|863!7]/[acefbd2461507938]/;
}
And the actual program itself ...
BEGIN {
$path = 'D:/home/cristofa/';
}
use lib $path;
use dml;
! 9_44! 96476_6_68!h9d9d6666669n4!4d454_4 4|4 9n4!4d4 9d4j45634d6|6_9d4 63 ...
no dml;
I have shortened the code for obvious reasons.
As well as the "identifier too long", I can change other bits, (I think removing filter::simple and using tr~ on its own) and then get "NO is not allowed" referring to the 'no dml' line. I tried putting the data into $_='! 9_44 ...' but that comes back re changing a read only value!!!
If you're curious, the first two figures above SHOULD convert to "3d". I step through the decoded string two at a time, and thus hex for the above is "=", (since the first line of the decoded file is "$f='xyz';" - and I ran into problems trying to substitute the Dollar back to a variable - I ended up using "=$f='xyz';" in the script and then using $data=~s/=\$/\$/g; when converting)
But my 'dilemma' is why that 1016 byte line is causing the script to throw a "wobbly" when I have another program using a library which decodes 2678 bytes with no problem.
$ perl -E'
$_ = "! 9_44! 96476_6_68!h9d9d6666669n4!4d454_4 4|4 9n4!4d4 9d4j45634d6|6_9d4 63 ...";
tr/[hjndr 9_45|863!7]/[acefbd2461507938]/;
say;
'
3d24663d27687474703a2f2f7777772e636f61646d656d2e636f6d2f6c61796f75742f6d79d...
That is indeed a very very long identifier.
That looks like hex. Let's try converting the sequence from hex into bytes and rendering them on a UTF-8 terminal.
$ perl -E'
$_ = "! 9_44! 96476_6_68!h9d9d6666669n4!4d454_4 4|4 9n4!4d4 9d4j45634d6|6_9d4 63 ...";
tr/[hjndr 9_45|863!7]/[acefbd2461507938]/;
$_ = pack("H*", $_);
say;
'
=$f='http://www.coadmem.com/layout/my<garbage>
Bingo! You forgot $_ = pack("H*", $_); in your filter.
By the way, tr/[abc]/[def]/ is equivalent tr/][abc/][def/, which is equivalent to tr/abc/def/ (except for the returned value you ignore). Get rid of [ and ]!
My script generates some very very huge files, and I am trying to print/save the output in a binary format to reduce the file size as much as possible!
Each time that script generates five values, like:
$a1 = 1.64729
$a2 = 4.33329
$a3 = 3.55724
$a4 = 1.45759
$a5 = 7.474700
It prints in the output like:
A:1.64729,4.33329,3.55724,1.45759,7.474700
I am not sure whether this is the best way, but I thought to pack each row when it is printing to the output! I used pack/unpack built-in function in Perl!
I had a look at perldoc, but I did not understand which format specifiers were proper (???)!
#!/usr/bin/perl
...
#A = ($a1,$a2,$a3,$a4,$a5);
print pack ("???", ("A:", join(",", map { sprintf "%.1f", $_ } #A)), "\n";
If you compress the file (instead of trying to write binary bytes) you will get a small file. That's because your entire file will have mostly the ten digit characters, plus a decimal point, and a comma.
You can compress a file as you write it via IO::Zlib. This will use either the Zlib library, or the gzip command.
However, if you want to use pack, go ahead. Get the Camel Book which gives much clearer documentation than the standard Perldoc.
It's not all that difficult:
my $output = "A:1.64729,4.33329,3.55724,1.45759,7.474700";
$output =~ s/^A://; #Remove the 'A:'
my #numbers = split /,/, $output # Make into an array
my $packed = pack "d5", #numbers; # Pack five inputs as floating point numbers
say join ",", "d5", $packed; # Unpacks those five decimal encoded numbers
You'll probably have to use syswrite and sysread since aren't reading and writing strings. This is unbuffered reading and writing, and you have to specify the number of bytes you're reading or writing.
One more thing: If you know where the decimal point is in the number (that is, it's always a number between 1 and up to 10) you can convert the number into an integer which will allow you to pack the number into an even smaller number of bytes:
my $output = "A:1.64729,4.33329,3.55724,1.45759,7.474700";
$output =~ s/^A://; #Remove the 'A:'
$output =~ s/,//g; #Remove all the decimal points
my #numbers = split /,/, $output # Make into an array
my $packed = pack "L5", #numbers; # Pack five inputs as unsigned long numbers
I download a CSV file from another server using perl script. After download I wish to check whether the file contains any corrupted data or not. I tried to use Encode::Detect::Detector to detect encoding but it returns 'undef' in both cases:
if the string is ASCII or
if the string is corrupted
So using the below program I can't differentiate between ASCII & Corrupted Data.
use strict;
use Text::CSV;
use Encode::Detect::Detector;
use XML::Simple;
use Encode;
require Encode::Detect;
my #rows;
my $init_file = "new-data-jp-2013-8-8.csv";
my $csv = Text::CSV->new ( { binary => 1 } )
or die "Cannot use CSV: ".Text::CSV->error_diag ();
open my $fh, $init_file or die $init_file.": $!";
while ( my $row = $csv->getline( $fh ) ) {
my #fields = #$row; # get line into array
for (my $i=1; $i<=23; $i++){ # I already know that CSV file has 23 columns
if ((Encode::Detect::Detector::detect($fields[$i-1])) eq undef){
print "the encoding is undef in col".$i.
" where field is ".$fields[$i-1].
" and its length is ".length($fields[$i-1])." \n";
}
else {
my $string = decode("Detect", $fields[$i-1]);
print "this is string print ".$string.
" the encoding is ".Encode::Detect::Detector::detect($fields[$i-1]).
" and its length is ".length($fields[$i-1])."\n";
}
}
}
You have some bad assumptions about encodings, and some errors in your script.
foo() eq undef
does not make any sense. You cannot compare to string equality to undef, as undef isn't a string. It does, however, stringify to the empty string. You should use warnings to get error messages when you do such rubbish. To test whether a value is not undef, use defined:
unless(defined foo()) { .... }
The Encode::Detector::Detect module uses an object oriented interface. Therefore,
Encode::Detect::Detector::detect($foo)
is wrong. According to the docs, you should be doing
Encode::Detect::Detector->detect($foo)
You probably cannot do decoding on a field-by-field basis. Usually, one document has one encoding. You need to specify the encoding when opening the file handle, e.g.
use autodie;
open my $fh, "<:utf8", $init_file;
While CSV can support some degree of binary data (like encoded text), it isn't well suited for this purpose, and you may want to choose another data format.
Finally, ASCII data effectively does not need any de- or encoding. The undef result for encoding detection does make sense here. It cannot be asserted with certaincy that a document was encoded to ASCII (as many encodings are a superset of ASCII), but given a certain document it can be asserted that it isn't valid ASCII (i.e. has the 8th bit set) but must rather be a more complex encoding like Latin-1, UTF-8.