unpacking a data structure whose first byte indicates length - perl

I am trying to unpack a TLE (Tagged Logical Element) from an IBM AFP format file.
The specification (http://www.afpcinc.org/wp-content/uploads/2017/12/MODCA-Reference-09.pdf) indicates that these are two triplets (even though there are four values) that are structured as follows (with their byte offsets):
0: Tlength | 1: Tid | 2-n: Parameter (= 2: Type + 3: Format + 4-n: EBCDIC encoded String)
Example (with two triplets, one indicating the name and one the value):
0C 02 0B 00 C3 A4 99 99 85 95 83 A8 07 36 00 00 C5 E4 D9
12 KEY UID CHAR C u r r e n c y 7 VAL RESERVED E U R
I use Perl to parse it as follows (and successfully):
if ($key eq 'Data') {
my $tle = $member->{struct}->{$key};
my $k_length = hex(unpack('H2', substr($tle, 0, 1)));
my $key = decode('cp500', substr($tle, 4, $k_length - 4));
my $v_length = hex(unpack('H2', substr($tle, $k_length, 1)));
my $value = decode('cp500', substr($tle, $k_length + 4, $v_length - 4));
print("'$key' => '$value'\n");
}
Result:
'Currency' => 'EUR'
While the above is successful, I feel that my way is a bit too cpmplicated and that there's a more efficient way to do this. E.g. do pack templates support reading the first n bytes to use as a quantifier for how many successive bytes to unpack? I read the Perl pack tutorial but can't seem to find something along those lines.

If the length field didn't include itself, you could do something like the following:
(my $record, $unparsed) = unpack("C/a a*", $unparsed);
my $key = decode("cp500", unpack("x3 a*", $record));
But the length field includes itself.
(my $length, $unparsed) = unpack("C a*", $unparsed);
(my $record, $unparsed) = unpack("a".($length-1)." a*", $unparsed);
my $key = decode("cp500", unpack("x3 a*", $record));

Please see if following demo code fulfill your requirements.
This code
defines hash decoder subroutines
reads hex representation of bytes provided by OP from DATA block
converts read data into binary representation $data utilizing pack
extracts length, key/tid, type by utilizing unpack
call decoder subroutine for this particular type
gets back hash consisting two arrays keys and vals
forms new hash %data with provided keys and vals
outputs keys and values (returned keys are used to preserve byte/field order)
NOTE: Encode 'from_to' is utilized to decode EBCDIC -- alternative
use strict;
use warnings;
use feature 'say';
use utf8;
use Encode 'from_to';
my $debug = 1;
my %decoder = (
1 => \&decode_type1,
2 => \&decode_currency,
3 => \&decode_type3,
4 => \&decode_type4,
5 => \&decode_type5
);
my $bytes = read_bytes();
my($len,$key,$type) = unpack('C3',$bytes);
my $data = $decoder{$type}($bytes);
my %data;
#data{#{$data->{keys}}} = #{$data->{vals}};
say '
Unpacked data
---------------';
printf "%-8s => %s\n", $_, $data{$_} for #{$data->{keys}};
sub read_bytes {
my $hex_bytes = <DATA>;
chomp $hex_bytes;
my $bytes = pack('H*',$hex_bytes);
return $bytes;
}
sub show_bytes {
my $data = shift;
print "Bytes: ";
printf "%02X ", $_ for unpack 'C*', $data;
print "\n";
}
sub decode_type1 {
my $bytes = shift;
return { keys => 'type1', vals => 'vals1' };
}
sub decode_currency {
my $bytes = shift;
show_bytes($bytes) if $debug;
my #keys = qw/length_1 key uid char data_1 length_2 val reserved data_2/;
my #vals = unpack('C4A8C2SA3',$bytes);
from_to($vals[4], 'cp37', 'latin1');
from_to($vals[8], 'cp37', 'latin1');
return { keys => \#keys, vals => \#vals};
}
sub decode_type3 {
my $bytes = shift;
return { keys => 'type3', vals => 'vals3' };
}
sub decode_type4 {
my $bytes = shift;
return { keys => 'type4', vals => 'vals4' };
}
sub decode_type5 {
my $bytes = shift;
return { keys => 'type5', vals => 'vals5' };
}
__DATA__
0C020B00C3A49999859583A807360000C5E4D9
Output
Bytes: 0C 02 0B 00 C3 A4 99 99 85 95 83 A8 07 36 00 00 C5 E4 D9
Unpacked data
---------------
length_1 => 12
key => 2
uid => 11
char => 0
data_1 => Currency
length_2 => 7
val => 54
reserved => 0
data_2 => EUR
Note:
It looks suspicious that val occupies one byte only what gives range of 0..255 for an amount in Euro. Perhaps reserved bytes might be a part of the val amount of Euro.

Related

Merge Fasta and Qual files with different headers order in FASTQ

I am trying to merge a fasta file and a qual file in a new fastq file having in mind the case that the two files might be provided with different order in their sequence IDs. To do that, I tried the first step of my script to be the sorting of the sequences which works perfectly when I test it as a separate script. The same with the rest, when I run separately the part where it combines the files in a fastq, it runs perfectly. But now that I am trying to combine the two methods in one script it doesn't work and I don't know what else to do! I would appreciate it if you can help me.
Here is my script as far. It creates the new fastq file but the content is messed up and not what I want. I run it from terminal like this:
$ perl script.pl reads.fasta reads.qual > reads.fq
Script :
#!/usr/bin/env perl
use strict;
use warnings;
die ("Usage: script.pl reads.fasta reads.qual > reads.fq") unless (scalar #ARGV) == 2;
open FASTA, $ARGV[0] or die "cannot open fasta: $!\n";
open QUAL, $ARGV[1] or die "cannot open qual: $!\n";
my $offset = 33;
my $count = 0;
local($/) = "\n>";
my %id2seq = ();
my $id = '';
my %idq2seq = ();
my $idq = '';
my (#sort_q, #sort_f);
while(<FASTA>){
chomp;
if($_ =~ /^>(.+)/){
$id = $1;
}else{
$id2seq{$id} .= $_;
}
}
for $id (sort keys %id2seq)
{
#sort_f = "$id\n$id2seq{$id}\n\n";
print #sort_f;
}
while(<QUAL>){
chomp;
if($_ =~ /^>(.+)/){
$idq = $1;
}else{
$idq2seq{$idq} .= $_;
}
}
for $idq (sort keys %idq2seq)
{
#sort_q = "$idq\n$idq2seq{$idq}\n\n";
print "#sort_q";
}
while (my #sort_f) {
chomp #sort_f;
my ($fid, #seq) = split "\n", #sort_f;
my $seq = join "", #seq; $seq =~ s/\s//g;
my $sortq = #sort_q;
chomp my #sortq;
my ($qid, #qual) = split "\n", #sortq;
#qual = split /\s+/, (join( " ", #qual));
# convert score to character code:
my #qual2 = map {chr($_+$offset)} #qual;
my $quals = join "", #qual2; `enter code here`
die "missmatch of fasta and qual: '$fid' ne '$qid'" if $fid ne $qid;
$fid =~ s/^\>//;
print STDOUT (join( "\n", "#".$fid, $seq, "+$fid", $quals), "\n");
$count++;
}
close FASTA;
close QUAL;
print STDERR "wrote $count entries\n";
Thank you in advance
It's been a while since I have used perl, but I would approach this using a hash of key/value pairs for both the fasta and quality input. Then write out all the pairs by looping over the fasta hash and pulling out the corresponding quality string.
I have written something in python that will do what you need, you can see it in action here:
It assumes that your input looks like this:
reads.fasta
>fa_0
GCAGCCTGGGACCCCTGTTGT
>fa_1
CCCACAAATCGCAGACACTGGTCGG
reads.qual
>fa_0
59 37 38 51 56 55 60 44 43 42 56 65 60 68 52 67 43 72 59 65 69
>fa_1
36 37 47 72 34 53 67 41 70 67 66 51 47 41 73 58 75 36 61 48 70 55 46 42 42
output
#fa_0
GCAGCCTGGGACCCCTGTTGT
+
;%&387<,+*8A<D4C+H;AE
#fa_1
CCCACAAATCGCAGACACTGGTCGG
+
$%/H"5C)FCB3/)I:K$=0F7.**
#fa_2
TCGTACAGCAGCCATTTTCATAACCGAACATGACTC
+
C?&93A#:?#F,2:'KF*20CC:I7F9J.,:E8&?F
import sys
# Check there are enough arguments
if len(sys.argv) < 3:
print('Usage: {s} reads.fasta reads.qual > reads.fq'.format(s=sys.argv[0]), file=sys.stderr)
sys.exit(1)
# Initalise dictionaries for reads and qualities
read_dict = dict()
qual_dict = dict()
fa_input = sys.argv[1]
qual_input = sys.argv[2]
# Read in fasta input
with open(fa_input, 'r') as fa:
for line in fa:
line = line.strip()
if line[0] == '>':
read_dict[line[1:]] = next(fa).strip()
else:
next(fa)
# Read in quality input
with open(qual_input, 'r') as qual:
for line in qual:
line = line.strip()
if line[0] == '>':
qual_dict[line[1:]] = next(qual).strip()
else:
next(qual)
count = 0
# Iterate over all fasta reads
for key, seq in read_dict.items():
# Check if read header is in the qualities data
if key in qual_dict.keys():
# There's both sequence and quality data so write stdout
read_str = '#{header}\n{seq}\n+\n{qual}'.format(
header=key,
seq=seq,
qual=''.join([chr(int(x)) for x in qual_dict[key].split(' ')]))
print(read_str, file=sys.stdout)
count += 1
else: # not found
# Write error to stderr
print('Error: {k} not found in qual file'.format(k=key), file=sys.stderr)
# Print count to stderr
print('{c} reads written'.format(c=count), file=sys.stderr)
If you need to use an offset for the quality score edit
qual=''.join([chr(int(x)) for x in qual_dict[key].split(' ')])) to
qual=''.join([chr(int(x) + offset) for x in qual_dict[key].split(' ')])) and define an offset variable before this.

Perl: Generating an Array of hashes from a file

I'm trying to create an array of hashes that contains student names as the keys and multiple grades as the values for each student so that I can compute the average for each student, sort the averages in descending order, and print the "lastname, firstname: grade average" of each student in the sorted order.
The issue i'm having is with the generation of the array of hashes which I'm 90% sure the problem lies in the split on line 10 but I can't seem to find the solution myself.
students.txt:
chipper jones 29 80 70
hank aaron 99 85 81 75
beth allen 64 84 71 5x9 38 68 53
andruw jones 100 100 100 100 100
ty cobb 75 75 100
code:
#!/usr/bin/perl
#Program 5
my #Aoh;
open (FILEIN, "$ARGV[0]");
while(<FILEIN>) {
chomp;
push #Aoh, { split / / };
for $i (0 .. $#Aoh) {
print "{ ";
for $role (keys %{ $Aoh[$i] }) {
print " $role $Aoh[$i]{$role} ";
}
print "}\n";
}
}
output I'm getting:
0 is { chipper=jones 70= 29=80 }
1 is { 81=75 hank=aaron 99=85 }
2 is { 38=68 53= beth=allen 64=84 71=5x9 }
3 is { 100= andruw=jones }
4 is { ty=cobb 75=75 100= }
push #Aoh, { split / / };
Creates the following hash from the first line:
{ chipper => 'jones',
29 => 80,
70 => undef,
}
That's not what you wanted, right?
I'd use a hash of numbers instead of the array of hashes. You can use "lastname, firstname" directly as the hash key and you can store the averages directly as the values:
#!/usr/bin/perl
use warnings;
use strict;
use List::Util qw{ sum };
my %average;
while (<>) {
my ($name, $surname, #grades) = split;
$average{"$surname, $name"} = sum(#grades) / #grades;
}
for my $student (sort { $average{$a} <=> $average{$b} } keys %average) {
print $student, ' ', $average{$student}, "\n";
}
Note that I'm getting a warning:
Argument "5x9" isn't numeric in subroutine entry at ./1.pl line 11, <> line 3.
How should one treat the 5x9 grade?

Perl: pack int to arbitrary length byte string

I want to encode numbers in N bit containers and send them in a UDP packet. A receiver will know N, and the receiver will grab a number from exactly N bits.(N <= 64)
Somethink like this:
sub to56BIT {
return pack("??", shift);
}
sub to24BIT {
return pack("??", shift);
}
my $n = 7;
to24BIT($n);
On the receiver's side:
int n = Get_val24(byte_stream, offset);
Is there any way to do this in Perl?
I think solution might be:
sub packIntN {
my $int = shift;
my $length = shift;
return pack("B" . $length, substr(unpack("B64", pack("Q>", $int)), 64 - $length));
}
But maybe there is more elegant way.
Input/Output example:
We have a script test.pl:
use strict;
use warnings;
sub to24BIT {
#???
}
my $n = 7;
print to24BIT($n);
I want this:
./test.pl | hexdump -C
00000000 00 00 07 |...|
00000003
Another script test2.pl:
use strict;
use warnings;
sub to40BIT {
#???
}
my $n = 7;
print to40BIT($n);
I want this:
./test.pl | hexdump -C
00000000 00 00 00 00 07 |.....|
00000005
Is N always going to be an integer factor of 8 (one of 8, 16, 24, 32, 40, 48, 56, 64)? If so, for speed, I recommend writing a packer for each size and use a dispatch table to find the right packer.
sub pack_8bit { pack('C', $_[0]) }
sub pack_16bit { pack('S>', $_[0]) }
sub pack_24bit { substr(pack('L>', $_[0]), 1) }
sub pack_32bit { pack('L>', $_[0]) }
sub pack_40bit { substr(pack('Q>', $_[0]), 3) }
sub pack_48bit { substr(pack('Q>', $_[0]), 2) }
sub pack_56bit { substr(pack('Q>', $_[0]), 1) }
sub pack_64bit { pack('Q>', $_[0]) }
{
my %packers = (
8 => \&pack_8bit, 40 => \&pack_40bit,
16 => \&pack_16bit, 48 => \&pack_48bit,
24 => \&pack_24bit, 56 => \&pack_56bit,
32 => \&pack_32bit, 64 => \&pack_64bit,
);
sub pack_num {
my $packer = $packers{$_[0]}
or die;
return $packer->($_[1]);
}
sub get_packer {
my $packer = $packers{$_[0]}
or die;
return $packer;
}
}
my $packed = pack_num(40, 7);
-or-
my $packer = get_packer(40);
my $packed = $packer->(7);
If you're planning on packing multiple numbers into one string (like pack('L>*', #nums)), I'd also use a dispatch table like this, though I'm not sure what would be the fastest implementation of pack_24bit, pack_40bit, pack_48bit and pack_56bit (other than a C solution).
Bearing in mind that you will always have a whole
number of bytes, I ended up with
substr(pack("Q>",$n<<(64-$len)),0,($len+7)/8);
and
unpack("Q>",$s.(0 x 8)) >> (64-$len);
as tried in this example:
#!/usr/bin/perl
$len = 40;
$n = 7;
$s = substr(pack("Q>",$n<<(64-$len)),0,($len+7)/8);
open(PIPE,"| hexdump -C");
print PIPE $s;
close PIPE;
$v = unpack("Q>",$s.(0 x 8)) >> (64-$len);
printf "%d\n",$v;

Backslash before a subroutine call

As I was understanding the difference between [] and \ in references,I used both on subroutine the former was fine but when I tried later I thought it should give error but the below program in perl
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my #b;
for my $i ( 0 .. 10 ) {
$b[$i] = \somefunc($i);
}
print Dumper( \#b );
sub somefunc {
my $n = shift;
my ( #a, $k );
for my $j ( 11 .. 13 ) {
$k = $n * $j;
push( #a, $k );
}
print "a: #a \n";
return #a;
}
gives output as :
a: 0 0 0
a: 11 12 13
a: 22 24 26
a: 33 36 39
a: 44 48 52
a: 55 60 65
a: 66 72 78
a: 77 84 91
a: 88 96 104
a: 99 108 117
a: 110 120 130
$VAR1 = [
\0,
\13,
\26,
\39,
\52,
\65,
\78,
\91,
\104,
\117,
\130
];
I was unable to understand the output.Need explanation.
What is happening here is:
You are returning an array from somefunc.
But you are assigning it to a scalar. What this is effectively doing therefore, is simply putting the last value in the array, into the scalar value.
my $value = ( 110, 120, 130 );
print $value;
When you do this - $value is set to the last value in the array. This is what's actually happening in your code. See for example perldata:
List values are denoted by separating individual values by commas (and enclosing the list in parentheses where precedence requires it):
(LIST)
In a context not requiring a list value, the value of what appears to be a list literal is simply the value of the final element, as with the C comma operator. For example,
#foo = ('cc', '-E', $bar);
assigns the entire list value to array #foo, but
foo = ('cc', '-E', $bar);
assigns the value of variable $bar to the scalar variable $foo. Note that the value of an actual array in scalar context is the length of the array; the following assigns the value 3 to $foo:
#foo = ('cc', '-E', $bar);
$foo = #foo; # $foo gets 3
It's this latter case that's often the gotcha, because it's a list in a scalar context.
And in your example - the backslash prefix denotes 'reference to' - which is largely meaningless because it's a reference to a number.
But for a scalar, it might be more meaningful:
my $newvalue = "fish";
my $value = ( 110, 120, 130, \$newvalue );
print Dumper $value;
$newvalue = 'barg';
print Dumper $value;
Gives:
$VAR1 = \'fish';
$VAR1 = \'barg';
That's why you're getting the results. Prefix with the slash indicates that you're getting a reference to the result, not a reference to the sub. Reference to 130 isn't actually all that meaningful.
Normally, when doing the assignment above - you'd get a warning about Useless use of a constant (110) in void context but this doesn't apply when you've got a subroutine return.
If you wanted to insert a sub reference, you'd need to add &, but if you just want to insert the returned array by reference - you either need to:
$b[$i] = [somefunc($i)]
Or:
return \#a;

How do I parse MS-DOS time in perl?

I'm reading a binary file using perl. In the file's headers, there's 4 bytes that represents MS-DOS time. How do I read this time? I'm not familiar with this format.
I've found this for reference: http://www.vsft.com/hal/dostime.htm but I'm still not sure how to read it.
Another approach:
sub mst {
my $msdos_time = shift;
my #t = map { ord }
map { pack("b*", $_) }
map { reverse($_) }
unpack("A5 A6 A5 A5 A4 A7", unpack("b*", $msdos_time));
my %d;
#d{seconds,minutes,hours,day,month,year} = #t;
$d{seconds} *= 2;
$d{year} += 1980;
return \%d;
}
This will work if $msdos_time is represented in little-endian format which (I believe) is how it would be laid out in memory.
(Clearly the chained map-s could be coalesced - I wrote it this way to make it easier to see what was going on.)
Example:
print Dumper(mst("\x22\x12\x01\x41"));
# byte 0 | byte 1 | byte 2 | byte 3
# 76543210 | 76543210 | 76543210 | 76543210
# S...s seconds
# ..m M.. minutes
# H...h hours
# D...d day
# ..m M month
# Y.....y year
# 00100010 | 00010010 | 00000001 | 01000001
$VAR1 = {
'seconds' => 4,
'hours' => 2,
'month' => 8,
'day' => 1,
'minutes' => 17,
'year' => 2012
};
You can't use pack because it always wants to start on a byte boundary. Some of these values go across byte boundaries too, so you don't want to deal with individual bytes (although words would work). It's easier to just to mask and shift.
In this example, I set up the masks so I don't have to think too hard about it, then use those to grab the values out of the string. I don't really know anything about the DOS time format, but from what I've read, you have to multiply the seconds by 2 (notice it's only five bits):
use 5.010;
use strict;
use warnings;
use Data::Dumper;
# seconds minutes hours day month years from 1980
# 5 bits 6 5 5 4 7
my $datetime = 0b11011_000011_11111_01100_1011_0001000;
my $parsed = parse( $datetime );
print Dumper( $parsed );
sub parse {
my( $datetime ) = #_;
state $masks = make_masks();
my %this = map {
$_, ( $datetime & $masks->{$_}[0] ) >> $masks->{$_}[1]
} keys %$masks;
$this{seconds} *= 2;
$this{years} += 1980;
return \%this;
}
sub make_masks {
my %masks = (
seconds => [ 0b11111, 27 ],
minutes => [ 0b111111, 21 ],
hours => [ 0b11111, 16 ],
day => [ 0b11111, 11 ],
month => [ 0b1111, 7 ],
years => [ 0b1111111, 0 ],
);
foreach my $key ( sort { $masks{$a}[1] <=> $masks{$b}[1] } keys %masks ) {
$masks{$key}[0] <<= $masks{$key}[1];
}
return \%masks;
}
My output is just a hash:
$VAR1 = {
'seconds' => 54,
'hours' => 31,
'years' => 1988,
'month' => 11,
'minutes' => 3,
'day' => 12
};