Unpack fields from IBM data file - perl

I have an EBCDIC-encoded data file from an IBM mainframe source that needs to be parsed and converted to ASCII. I was able to convert by reading it per byte in hexadecimal and look for corresponding matches on ASCII.
My issue is that the EBCDIC-encoded file there are 30 bytes that are packed and need to be unpacked to get the actual values. I am trying out ways using PHP pack/unpack function as well as with Perl but found no luck. The value that I am getting doesn't seem to be the exact value that I am looking for. I tried unpacking it with C c H h N.
Assuming that file holds an EBCDIC encoded data;
pack fields are on position 635-664, 30 bytes long
data1 = 9 bytes
data2 = 9 bytes
data3 = 3 bytes
data4 = 3 bytes
data5 = 3 bytes
data6 = 3 bytes
PHP:
$datafile = fopen("/var/www/data/datafile", "rb");
$regebcdicdata = fread($datafile, 634);
$packfields = fread($datafile, 30);
$arr= unpack('c9data1/c9bdata2/c3data3/C3data4/C3data5/C3data6',$packfields);
print_r($arr);
PERL:
open my $fh, '<:raw', '/var/www/html/PERL/test';
my $bytes_read = read $fh, my $bytes, 634;
my $bytes_read2 = read $fh, my $bytes2, 30;
my ($data1,$data2,$data3,$data4,$data5,$data6) = unpack 'C9 C9 C3 C3 C3 C3', $bytes2;
UPDATE:
Already found a solution. Those 30 bytes were packed in a specified format. So I just unpack using PHP unpack function.
For EBCDIC conversion. I read it per byte, get the hexadecimal value using bin2hex() function, find matching ASCII hexadecimal value and get the ASCII representation so user can see it in readable format using chr() function.
I used conversion table at https://www.ibm.com/support/knowledgecenter/SSZJPZ_11.3.0/com.ibm.swg.im.iis.ds.parjob.adref.doc/topics/r_deeadvrf_ASCII_and_EBCDIC_Conversion_Tables.html.

I can't possibly help you to unpack those thirty bytes without knowing how they have been packed. Surely you must have some idea?
As for the regular EBCDIC text, you need to establish exactly which code page your document uses, and then you can simply use Perl IO to decode it
Suppose you are dealing with code page 37, then you can open your file like this
open my $fh, '<:encoding(cp37)', 'ebcdic_file' or die $!
and then you can read the data as normal. It will be retrieved as Unicode characters

This is wild guess, as I know neither which EBCDIC code page you are using nor how the thirty bytes are packed. But there is a slim chance that it will do what you want
Please try running this program and tell us the results
use strict;
use warnings 'all';
use feature 'say';
my #data = do {
open my $fh, '<:encoding(cp37)', '/var/www/html/PERL/test' or die $!;
local $/;
my $data = <$fh>;
unpack '#634 A9 A9 A3 A3 A3 A3', $data;
};
say for #data;

Related

Using perl to split file in flat text

I have a flat file that are created with offsets e.g. row 1: char 1 - 3 = ID, 4-19 = user name, 20 - 40 = last name, etc...
What's the best way to go about creating a perl script to read this? and is there any way to make it flexible based on different offset groups? Thank you!
If the positions/lengths are in terms of Unicode Code Points:
# Use :encoding(UTF-8) on the file handle.
my #fields = unpack('A3 A16 A21', $decoded_line);
If the positions/lengths are in terms of bytes:
use Encode qw( decode );
sub trim_end(_) { $_[0] =~ s/\s+\z//r }
# Use :raw on the file handle.
my #fields =
map trim_end(decode("UTF-8", $_)),
unpack('a3 a16 a21', $encoded_line);
In both cases, trailing whitespace is trimmed.

Split a string based on ASCII value

I need to parse a delimited file.(generated by mainframe job and ftped over to windows).But got few Queries while using the split on delimiter.
As per the documentation, the file is separated by '1D'. But when I open the file in notepad++(when I check the encoding tab, it is set to 'Encode in ANSI'), it seems to me like a 'vertical broken bar'. Q. Not sure what is '1D'?
open my $handle, '<', 'sample.txt';
chomp(my #lines = <$handle>);
close $handle;
my #a = unpack("C*", $lines[0]);
print Dumper \#a;
# $VAR1 = [65,166,66,166,67,166];
From dumper output, we see perl considers the ASCII for vertical broken bar to be 166.
As per link1, 166 is indeed vertical broken bar whereas as per link2, 166 is feminine ordinal indicator.Q. Any suggestion as to why the difference ?
my $str = $lines[0];
print Dumper $str;
# $VAR1 = 'AªBªCª';
We can see that the output contains 'feminine ordinal indicator' not 'vertical broken bar'.Q. Not sure why perl reads a 'bar' but then starts treating it as something else.
# I copied the vertical broken bar from notepad++ for use below
my #b = split(/¦/, $lines[0]);
print Dumper \#b;
# $VAR1 = [ 'AªBªCª' ];
Since perl has started treating bar to be something else, as expected, no split here.I thought to split by giving the ascii code of 166 directly. Seems split() doesn't support ASCII as an argument. Q. Any workaround to pass ASCII code to split() ?
# I copied the vertical broken bar from notepad++ and created A¦B¦C
my #c = split(/¦/, 'A¦B¦C');
print Dumper \#c;
#$VAR1 = [ 'A','B','C']; # works as expected, added here just for completion
Any pointers will be a great help!
Update:
my #a = map {ord $_} split //, $lines[0]; print Dumper \#a;
# $VAR1 = [ 65,166,66,166,67,166];
When you receive an input file from an unknown source, the most important thing to need to know about it is "what character encoding does it use?" Without that information, any processing that you do on the file is based on guesswork.
The problem isn't helped by people who talk about "extended ASCII" as though it's a meaningful term. ASCII only contains 128 characters. There are many definitions of what the next 128 character codes represent, and many of them are contradictory.
It seems that you have a solution to your problem. Splitting on '¦' (copied from Notepad++) does what you want. So I suggest you do that. If you want to use the actual character code, then you can convert 116 to hexadecimal (0xA6) and use that:
split /\xA6/, ... ;
You should always decode your inputs and encode your outputs.
my $acp;
BEGIN {
require Win32;
$acp = "cp".Win32::GetACP();
}
use open ':std', ":encoding($acp)";
Now, #lines will contain strings of Unicode Code Points. As such, you can now use the following:
use utf8; # Source code is encoded using UTF-8.
my #b = split(/¦/, $lines[0]);
Alternatively, every one of the following will also work now:
my #b = split(/\N{BROKEN BAR}/, $lines[0]);
my #b = split(/\N{U+00A6}/, $lines[0]);
my #b = split(/\x{A6}/, $lines[0]);
my #b = split(/\xA6/, $lines[0]);

Can someone explain this loop to me?

I have the following Perl code. I Know what the end result is: if I run it and pass in an x9.37 file, it will spit out each field of text. That's great, but I am trying to port this to another language, and I can't read Perl at all. If someone could turn this into some form of pseudocode (I don't need working Java - I can write that part) I just need someone to explain what is going on in the Perl below!
#!/usr/bin/perl -w
use strict;
use Encode;
my $tiff_flag = 0;
my $count = 0;
open(FILE,'<',$ARGV[0]) or die 'Error opening input file';
binmode(FILE) or die 'Error setting binary mode on input file';
while (read (FILE,$_,4)) {
my $rec_len = unpack("N",$_);
die "Bad record length: $rec_len" unless ($rec_len > 0);
read (FILE,$_,$rec_len);
if (substr($_,0,2) eq "\xF5\xF2") {
$_ = substr($_,0,117);
}
print decode ('cp1047', $_) . "\n";
}
close FILE;
read (FILE,$_,4) : read 4 bytes from FILE input stream and load into the variable $_
$rec_len = unpack("N",$_): interpret the first 4 bytes of the variable $_ as an unsigned 32-bit integer in big-endian order, assign to the variable $rec_len
read (FILE,$_,$rec_len): read $rec_len bytes from FILE stream into variable $_
substr($_,0,2): the first two characters of the variable $_
"\xF5\xF2": a two-character string consisting of the bytes 245 and 242
$_ = substr($_,0,117): set $_ to the first 117 characters of $_
use Encode;print decode ('cp1047', $_): interpret the contents of $_ with "code page 1047", i.e., EBCDIC and output to standard output
-w is the old way of enabling warnings.
my declares a lexically scoped variable.
open with < opens a file for reading, the filename is taken from the #ARGV array, i.e. the program's parameters. FILE is the file handle associated with the file.
read reads four bytes into the $_ variable. unpack interprets it as an unsigned 32-bit long (so the following condition can fail only when it's 0).
The next read reads that many bytes to $_ again. substr extracts a substring, and if the first two bytes there are "\xf5\xf2", it shortens the string to the first 117 bytes. It then converts the string to the code page 1047.

Chilkat encryption doesn't work as expected

I was trying to test file encryption using the chilkat functionality. Based on code found on this example page, I replaced the last part with this:
# Encrypt a string...
# The input string is 44 ANSI characters (i.e. 44 bytes), so
# the output should be 48 bytes (a multiple of 16).
# Because the output is a hex string, it should
# be 96 characters long (2 chars per byte).
my $input = "sample.pdf";
# create file handle for the pdf file
open my $fh, '<', $input or die $!;
binmode ($fh);
# the output should be sample.pdf.enc.dec
open my $ffh, '>', "$input.enc.dec" or die $!;
binmode $ffh;
my $encStr;
# read 16 bytes at a time
while (read($fh,my $block,16)) {
# encrypt the 16 bytes block using encryptStringEnc sub provided by chilkat
$encStr = $crypt->encryptStringENC($block);
# Now decrypt:
# decrypt the encrypted block
my $decStr = $crypt->decryptStringENC($encStr);
# print it in the sample.pdf.enc.dec file
print $ffh $decStr;
}
close $fh;
close $ffh;
Disclaimer:
I know the CBC mode is not recommended for file encryption because if one block is lost, the other blocks are lost too.
The output file is corrupted and when I look with beyond compare at the two files, there are chunks of the file which match and there are chunks of file which doesn't. What am I doing wrong?
You're trying to use character string encryption (encryptStringENC(), decryptStringENC()) for what is, at least partly, a binary file.
This worked for me:
my $input = "sample.pdf";
# create file handle for the pdf file
open my $fh, '<', $input or die $!;
binmode $fh;
# the output should be sample.pdf.enc.dec
open my $ffh, '>', "$input.enc.dec" or die $!;
binmode $ffh;
my $inData = chilkat::CkByteData->new;
my $encData = chilkat::CkByteData->new;
my $outData = chilkat::CkByteData->new;
# read 16 bytes at a time
while ( my $len = read( $fh, my $block, 16 ) ) {
$inData->clear;
$inData->append2( $block, $len );
$crypt->EncryptBytes( $inData, $encData );
$crypt->DecryptBytes( $encData, $outData );
print $ffh $outData->getData;
}
close $fh;
close $ffh;
You likely better off perusing the Chilkat site further though, there are sample codes for binary data.
I'm going to write and post a link to a sample that is much better than the examples posted here. The examples posted here are not quite correct. There are two important Chilkat Crypt2 properties that one needs to be aware of: FirstChunk and LastChunk. By default, both of these properties are true (or the value 1 in Perl). This means that for a given call to encrypt/decrypt, such as EncryptBytes, DecryptBytes, etc. it assumes the entire amount of data was passed. For CBC mode, this is important because the IV is used for the first chunk, and for the last chunk, the output is padded to the block size of the algorithm according to the value of the PaddingScheme property.
One can instead feed the input data to the encryptor chunk-by-chunk by doing the following:
For the 1st chunk, set FirstChunk=1, LastChunk=0.
For middle chunks, set FirstChunk=0, LastChunk=0.
For the final chunk (even if a 0-byte final chunk), set FirstChunk=0, LastChunk=1. This causes a final padded output block to be emitted.
When passing chunks using FirstChunk/LastChunk, one doesn't need to worry about passing chunks matching the block size of the algorithm. If a partial block is passed in, or if the bytes are not an exact multiple of the block size (16 bytes for AES), then Chilkat will buffer the input and the partial block will be added to the data passed in the next chunk. For example:
FirstChunk=1, LastChunk=0, pass in 23 bytes, output is 16 bytes, 7 bytes buffered.
FirstChunk=0, LastChunk=0, pass in 23 bytes, output is 16 bytes, (46-32 bytes) 14 bytes buffered
FirstChunk=0, LastChunk=1, pass in 5 bytes, output is 32 bytes, (14 buffered bytes + 5 more = 19 bytes. The 19 bytes is one full block (16 bytes) plus 3 bytes remainder, which is padded to 16, and thus the output is 32 bytes and the CBC stream is ended.
This example demonstrates using FirstChunk/LastChunk. Here's the example: https://www.example-code.com/perl/encrypt_file_chunks_cbc.asp

How do I convert little Endian to Big Endian using a Perl Script?

I am using the Perl Win32::SerialPort module. In this paticular module I sent over data using the input command. The data that I sent over to a embedded system were scalar data (numbers) using the transmit_char function (if it were C it would be integers, but since its a scripting language I am not sure what the internal format is in perl. My guess is that perl always stores all numbers as 32 bit floating points, which are adjusted by the module when transmitting).
Then after sending the data I receive data using the input command. The data that I recieve is probably in binary form, but perl doesn't know how to interpret it. I use the unpack function like this
my $binData = $PortObj->input;
my $hexData = unpack("H*",$binData);
Suppose I transmit 0x4294 over the serial cable, which is a command on the embedded system that I am communicating with, I expect a response of 0x5245. Now the problem is with the endianess: when I unpack I get 0x4552, which is wrong. Is there a way to correct that by adjusting the binary data. I also tried h*, which gives me 0x5425, which is also not correct.
Note: the data I receive is sent over Byte at a time and the LSB is sent first
Endianess applies to the ordering of bytes of an integer (primarily). You need to know the size of the integer.
Example for 32-bit unsigned:
my $bytes = pack('H*', '1122334455667788');
my #n = unpack('N*', $bytes);
# #n = ( 0x11223344, 0x55667788 );
my $bytes = pack('H*', '4433221188776655');
my #n = unpack('V*', $bytes);
# #n = ( 0x11223344, 0x55667788 );
See pack. Note the "<" and ">" modifiers to control the endianess where of instructions where the default endianess is not the one you want.
Note: If you're reading from the file, you already have bytes. Don't create bytes using pack 'H*'.
Note: If you're reading from the file, don't forget to binmode the handle.
Regarding the example the OP added to his post:
To get 0x5245 from "\x45\x52", use unpack("v", $two_bytes).
What sort of data types are these? Perl's pack has the N and V format specifiers for integers, and Perl 5.10 added the > and < modifiers so you can read shorts, floats, doubles, and quads (and some other types) in the endianness you want.
With these, you read the data in the endianness it uses in the input. After you do that, you have the data internally-represented as the number you expect and you can re-pack them anyway that you like.
For example, the Q format doesn't have an endianness partner like the pair N and V. I'm always going to get the architecture's interpretation of the octet sequence:
my #octets = ( 0x19, 0x36 );
my $bom = pack 'C*', #octets;
my ( $short ) = unpack 'S', $bom;
my $last = $short & 0x00FF;
my $first = ( $short & 0xFF00 ) >> 8;
printf "SHORT: %x FIRST: %x LAST: %x\n", $short, $first, $last;
my $quad_format = $first == $octets[0] ? 'Q' : 'Q>';
say "QUAD_FORMAT: $quad_format";
my $data = pack 'C*', 0b11011110, 0xAD, 0xBE, 0xEF, 0xAA, 0xBB, 0xCC, 0xDD;
my $q = unpack $quad_format, $data;
printf "$quad_format: %x\n", $q;
The output shows that I get the packed value of 0x1936 comes back as 0x3619 with the plain S format. That means that this was run on a little-endian architecture. A same thing will happen with a quad value, so I want to read the quad and tell Perl to interpret get the value then force it to be big-endian (the "big" part of '>' touches the Q) to get the expected internal numerical value:
SHORT: 3619 FIRST: 36 LAST: 19
QUAD_FORMAT: Q>
Q>: deadbeefaabbccdd
I write more about this in Use the > and < pack modifiers to specify the architecture.