Perl - binary unpack using pointer and index - perl

I have a binary file that contain 3 files, a PNG, a PHP and a TGA file.
Here the file to give you the idea : container.bin
the file is build this way:
first 6 bytes are a pointer to the index, in this case 211794
Then you have all 3 files stacked one after the other
and at the ofset 211794, you have the index, that tell you where the file start and end
in this example you have:
[offset start] [offset end] [random data] [offset start] [name]
6 15149 asdf 6 Capture.PNG
15149 15168 4584 15149 index.php
15168 211794 12 15168 untilted.tga
meaning that capture.png start at offset 6, finish at offset 15149, then asdf is a random data, and the start offset is repeated again.
Now what I want to do is a perl to separate the file on this binary files.
The perl need to check the first 6 offset of the file (header), then jump to the index location, and use the list to extract the file out.

A mix of seek and read can be used to achieve the task:
#!/usr/bin/env perl
use strict;
use warnings;
use Fcntl 'SEEK_SET';
sub get_files_info {
my ( $fh, $offset ) = #_;
my %file;
while (<$fh>) {
chomp;
my $split_count = my ( $offset_start, $offset_end, $random_data, $offset_start_copy,
$file_name ) = split /\s/;
next if $split_count != 5;
if ( $offset_start != $offset_start_copy ) {
warn "Start of offset mismatch: $file_name\n";
next;
}
$file{$file_name} = {
'offset_start' => $offset_start,
'offset_end' => $offset_end,
'random_data' => $random_data,
};
}
return %file;
}
sub write_file {
my ( $fh, $file_name, $file_info ) = #_;
seek $fh, $file_info->{'offset_start'}, SEEK_SET;
read $fh, my $contents,
$file_info->{'offset_end'} - $file_info->{'offset_start'};
open my $fh_out, '>', $file_name or die 'Error opening file: $!';
binmode $fh_out;
print $fh_out $contents;
print "Wrote file: $file_name\n";
}
open my $fh, '<', 'container.bin' or die "Error opening file: $!";
binmode $fh;
read $fh, my $offset, 6;
seek $fh, $offset, SEEK_SET;
my %file = get_files_info $fh, $offset;
for my $file_name ( keys %file ) {
write_file $fh, $file_name, $file{$file_name};
}

The only real difficulty here is to make sure that both input and output files are read in binary mode. This can be achieved by using the :raw PerlIO layer when the files are opened.
This program seems to do what you want. It first locates and reads the index block into a string, and then opens that string for input and reads the start and end position and name of each of the constituent files. Thereafter processing each file is simple.
Be aware that unless the formatting of the index block is more strict than you say, you can rely only on the first, second, and last whitespace-separated fields on each line since random text could contain spaces. There is also no way to specify a file name containing spaces.
The output, using Data::Dump, is there to demonstrate correct functionality and is not necessary for the functioning of the program.
use v5.10;
use warnings;
use Fcntl ':seek';
use autodie qw/ open read seek close /;
open my $fh, '<:raw', 'container.bin';
read $fh, my $index_loc, 6;
seek $fh, $index_loc, SEEK_SET;
read $fh, my ($index), 1024;
my %contents;
open my $idx, '<', \$index;
while (<$idx>) {
my #fields = split;
next unless #fields;
$contents{$fields[-1]} = [ $fields[0], $fields[1] ];
}
use Data::Dump;
dd \%contents;
for my $file (keys %contents) {
my ($start, $end) = #{ $contents{$file} };
my $size = $end - $start;
seek $fh, $start, SEEK_SET;
my $nbytes = read $fh, my ($data), $size;
die "Premature EOF" unless $nbytes == $size;
open my $out, '>:raw', $file;
print { $out } $data;
close $out;
}
output
{
"Capture.PNG" => [6, 15149],
"index.php" => [15149, 15168],
"untilted.tga" => [15168, 211794],
}

Related

Trying to improve Encode::decode warning message: Segfault in $SIG{__WARN__} handler

I am trying to improve the warning message issued by Encode::decode(). Instead of printing the name of the module and the line number in the module, I would like it to print the name of the file being read and the line number in that file where the malformed data was found. To a developer, the origial message can be useful, but to an end user not familiar with Perl, it is probably quite meaningless. The end user would probably rather like to know which file is giving the problem.
I first tried to solve this using a $SIG{__WARN__} handler (which is probably not a good idea), but I get a segfault. Probably a silly mistake, but I could not figure it out:
#! /usr/bin/env perl
use feature qw(say);
use strict;
use warnings;
use Encode ();
binmode STDOUT, ':utf8';
binmode STDERR, ':utf8';
my $fn = 'test.txt';
write_test_file( $fn );
# Try to improve the Encode::FB_WARN fallback warning message :
#
# utf8 "\xE5" does not map to Unicode at <module_name> line xx
#
# Rather we would like the warning to print the filename and the line number:
#
# utf8 "\xE5" does not map to Unicode at line xx of file <filename>.
my $str = '';
open ( my $fh, "<:encoding(utf-8)", $fn ) or die "Could not open file '$fn': $!";
{
local $SIG{__WARN__} = sub { my_warn_handler( $fn, $_[0] ) };
$str = do { local $/; <$fh> };
}
close $fh;
say "Read string: '$str'";
sub my_warn_handler {
my ( $fn, $msg ) = #_;
if ( $msg =~ /\Qdoes not map to Unicode\E/ ) {
recover_line_number_and_char_pos( $fn, $msg );
}
else {
warn $msg;
}
}
sub recover_line_number_and_char_pos {
my ( $fn, $err_msg ) = #_;
chomp $err_msg;
$err_msg =~ s/(line \d+)\.$/$1/; # Remove period at end of sentence.
open ( $fh, "<:raw", $fn ) or die "Could not open file '$fn': $!";
my $raw_data = do { local $/; <$fh> };
close $fh;
my $str = Encode::decode( 'utf-8', $raw_data, Encode::FB_QUIET );
my ($header, $last_line) = $str =~ /^(.*\n)([^\n]*)$/s;
my $line_no = $str =~ tr/\n//;
++$line_no;
my $pos = ( length $last_line ) + 1;
warn "$err_msg, in file '$fn' (line: $line_no, pos: $pos)\n";
}
sub write_test_file {
my ( $fn ) = #_;
my $bytes = "Hello\nA\x{E5}\x{61}"; # 2 lines ending in iso 8859-1: åa
open ( my $fh, '>:raw', $fn ) or die "Could not open file '$fn': $!";
print $fh $bytes;
close $fh;
}
Output:
utf8 "\xE5" does not map to Unicode at ./p.pl line 27
, in file 'test.txt' (line: 2, pos: 2)
Segmentation fault (core dumped)
Here is another way to locate where the warning fires, with un-buffered sysread
use warnings;
use strict;
binmode STDOUT, ':utf8';
binmode STDERR, ':utf8';
my $file = 'test.txt';
open my $fh, "<:encoding(UTF-8)", $file or die "Can't open $file: $!";
$SIG{__WARN__} = sub { print "\t==> WARN: #_" };
my $char_cnt = 0;
my $char;
while (sysread($fh, $char, 1)) {
++$char_cnt;
print "$char ($char_cnt)\n";
}
The file test.txt was written by the posted program, except that I had to add to it to reproduce the behavior -- it runs without warnings on v5.10 and v5.16. I added \x{234234} to the end. The line number can be tracked with $char =~ /\n/.
The sysread returns undef on error. It can be moved into the body of while (1) to allow reads to continue and catch all warnings, breaking out on 0 (returned on EOF).
This prints
H (1)
e (2)
l (3)
l (4)
o (5)
(6)
A (7)
å (8)
a (9)
==> WARN: Code point 0x234234 is not Unicode, may not be portable at ...
(10)
While this does catch the character warned about, re-reading the file using Encode may well be better than reaching for sysread, in particular if sysread uses Encode.
However, Perl is utf8 internally and I am not sure that sysread needs Encode.
Note. The page for sysread supports its use on data with encoding layers
Note that if the filehandle has been marked as :utf8 Unicode
characters are read instead of bytes (the LENGTH, OFFSET, and the
return value of sysread are in Unicode characters). The
:encoding(...) layer implicitly introduces the :utf8 layer.
See binmode, open, and the open pragma.
Note Apparently, things have moved on and after a certain version sysread does not support encoding layers. The link above, while for an older version (v5.10 for one) indeed shows what is quoted, with a newer version tells us that there'll be an exception.

How to compare two text files and removing the matching contents and pass to output in perl?

I have two text files text1.txt and text2.txt like below
text1
ac
abc
abcd
abcde
text2
ab
abc
acd
abcd
output
ac
abcde
I need to compare the two files and remove the content from text1 when there is a match in the second file.
I want the code in Perl. Currently I am trying the below code.
#!usr/bin/perl
use strict;
use warnings;
open (GEN, "text1.txt") || die ("cannot open general.txt");
open (SEA, "text2.txt") || die ("cannot open search.txt");
open (OUT,">> output.txt") || die ("cannot open intflist.txt");
open (LOG, ">> logfile.txt");
undef $/;
foreach (<GEN>) {
my $gen = $_;
chomp ($gen);
print LOG $gen;
foreach (<SEA>) {
my $sea = $_;
chomp($sea);
print LOG $sea;
if($gen ne $sea) {
print OUT $gen;
}
}
}
In this I am getting all content from text1, not the unmatched content. Please help me out.
I think you should read the text2 in an array and then in the second foreach on that array use the array.
#b = <SEA>;
Or else in the second loop the file pointer would be at the end already
One way:
#!/usr/bin/perl
use strict;
use warnings;
$\="\n";
open my $fh1, '<', 'file1' or die $!;
open my $fh2, '<', 'file2' or die $!;
open my $out, '>', 'file3' or die $!;
chomp(my #arr1=<$fh1>);
chomp(my #arr2=<$fh2>);
foreach my $x (#arr1){
print $out $x if (!grep (/^\Q$x\E$/,#arr2));
}
close $fh1;
close $fh2;
close $out;
After executing the above, the file 'file3' contains:
$ cat file3
ac
abcde
This is my plan:
Read the contents of first file in a hash, with a counter of occurrences. For example, working with your data you get:
%lines = ( 'ac' => 1,
'abc' => 1,
'abcd' => 1,
'abcde' => 1);
Read the second file, deleting the previous hash %lines if key exists.
Print the keys %lines to the desired file.
Example:
use strict;
open my $fh1, '<', 'text1' or die $!;
open my $fh2, '<', 'text2' or die $!;
open my $out, '>', 'output' or die $!;
my %lines = ();
while( my $key = <$fh1> ) {
chomp $key;
$lines{$key} = 1;
}
while( my $key = <$fh2> ) {
chomp $key;
delete $lines{$key};
}
foreach my $key(keys %lines){
print $out $key, "\n";
}
close $fh1;
close $fh2;
close $out;
Your main problem is that you have undefined the input record separator $/. That means the whole file will be read as a single string, and all you can do is say that the two files are different.
Remove undef $/ and things will work a whole lot better. However the inner for loop will read and print all the lines in file2 that don't match the first line of file1. The second time this loop is encountered all the data has been read from the file so the body of the loop won't be executed at all. You must either open file2 inside the outer loop or read the file into an array and loop over that instead.
Then again, do you really want to print all lines from file2 that aren't equal to each line in file1?
Update
As I wrote in my comment, it sounds like you want to output the lines in text1 that don't appear anywhere in text2. That is easily achieved using a hash:
use strict;
use warnings;
my %exclude;
open my $fh, '<', 'text2.txt' or die $!;
while (<$fh>) {
chomp;
$exclude{$_}++;
}
open $fh, '<', 'text1.txt' or die $!;
while (<$fh>) {
chomp;
print "$_\n" unless $exclude{$_};
}
With the data you show in your question, that produces this output
ac
abcde
I would like to view your problem like this:
You have a set S of strings in file.txt.
You have a set F of forbidden strings in forbidden.txt.
You want the strings that are allowed, so S \ F (setminus).
There is a data structure in Perl that implements a set of strings: The hash. (It can also map to scalars, but that is secondary here).
So first we create the set of the lines we have. We let all the strings in that file map to undef, as we don't need that value:
open my $FILE, "<", "file.txt" or die "Can't open file.txt: $!";
my %Set = map {$_ => undef} <$FILE>;
We create the forbidden set the same way:
open my $FORBIDDEN, "<", "forbidden.txt" or die "Can't open forbidden.txt: $!";
my %Forbidden = map {$_ => undef} <$FORBIDDEN>;
The set minus works like either of these ways:
For each element x in S, x is in the result set R iff x isn't in F.
my %Result = map {$_ => $Set{$_}} grep {not exists $Forbidden{$_}} keys %Set;
The result set R initially is S. For each element in F, we delete that item from R:
my %Result = %Set; # make a copy
delete $Result{$_} for keys %Forbidden;
(the keys function accesses the elements in the set of strings)
We can then print out all the keys: print keys %Result.
But what if we want to preserve the order? Entries in a hash can also carry an associated value, so why not the line number? We create the set S like this:
open my $FILE, "<", "file.txt" or die "Can't open file.txt: $!";
my $line_no = 1;
my %Set = map {$_ => $line_no++} <$FILE>;
Now, this value is carried around with the string, and we can access it at the end. Specifically, we sort the keys in the hash after their line number:
my #sorted_keys = sort { $Result{$a} <=> $Result{$b} } keys %Result;
print #sorted_keys;
Note: All of this assumes that the files are terminated by newline. Else, you would have to chomp.

byte swap with perl

I am reading a file, containing integers using the "33441122" byte order. How can I convert the file to the "11223344" (big endian) byte order? I have tried a few things, but I am really lost.
I have read a lot about Perl, but when it comes to swapping bytes, I'm in the dark. How can I convert this:
33 44 11 22
into this:
11 22 33 44
using Perl.
Any input would be greatly appreciated :)
You can read 4 bytes at a time, split it into individual bytes, swap them and write them out again
#! /usr/bin/perl
use strict;
use warnings;
open(my $fin, '<', $ARGV[0]) or die "Cannot open $ARGV[0]: $!";
binmode($fin);
open(my $fout, '>', $ARGV[1]) or die "Cannot create $ARGV[1]: $!";
binmode($fout);
my $hexin;
my $n;
while (($n = read($fin, $bytes_in, 4)) == 4) {
my #c = split('', $bytes_in);
my $bytes_out = join('', $c[2], $c[3], $c[0], $c[1]);
print $fout $bytes_out;
}
if ($n > 0) {
print $fout $bytes_in;
}
close($fout);
close($fin);
This will be called on the command line as
perl script.pl infile.bin outfile.bin
outfile.bin will be overwritten.
I think the best way is to read two bytes at a time and dwap them before outputting them.
This program creates a data file test.bin, and then reads it in, swapping the bytes as described.
use strict;
use warnings;
use autodie;
open my $fh, '>:raw', 'test.bin';
print $fh "\x34\x12\x78\x56";
open my $out, '>:raw', 'new.bin';
open $fh, '<:raw', 'test.bin';
while (my $n = read $fh, my $buff, 2) {
$buff = reverse $buff if $n == 2;
print $out $buff;
}

Why does this corrupt my yaml file?

When I run the below script I get a corrupt yaml file like so
---
1:
name1: abc
name2: abc
---
me3: abc
---
Question
Can anyone see that I am doing wrong?
#!/usr/bin/perl
use strict;
use YAML::Syck;
use Fcntl ':flock', 'SEEK_SET';
use warnings;
use Data::Dumper;
my $acc;
my $acc_fh;
$acc->{1}{name1} = "abc";
unlink 'test.yaml';
# write initial
open F, '>', 'test.yaml';
print F YAML::Syck::Dump($acc);
close F;
($acc, $acc_fh) = read_yaml_with_lock('test.yaml');
$acc->{1}{name2} = "abc";
$acc->{1}{name3} = "abc";
write_yaml_with_lock($acc, $acc_fh);
($acc, $acc_fh) = read_yaml_with_lock('test.yaml');
delete $acc->{1}{name3};
write_yaml_with_lock($acc, $acc_fh);
sub read_yaml_with_lock {
my ($file) = #_;
open my $fh, '+<', $file or die $!;
flock($fh, LOCK_EX) or die $!;
my $obj = YAML::Syck::LoadFile($fh); # this dies on failure
return ($obj, $fh);
}
sub write_yaml_with_lock {
my ($obj, $fh) = #_;
my $yaml = YAML::Syck::Dump($obj);
$YAML::Syck::ImplicitUnicode = 1;
seek $fh, 0, SEEK_SET; # seek back to the beginning of file
print $fh $yaml . "---\n";
close $fh;
}
You write to the same file twice. During the second time the YAML code you're writing is shorter than the first time because you delete that hash key inbetween the calls. However, you neither unlink the file after the first time nor do you truncate it after writing to it the second time. So what you see as corruption is the part of the file that has been written the first time but that hasn't been overwritten the second time.
The "me3" part is what is left of " name3", which gets partially overwritten by "---\n" (4 characters). When you write the first time, you have more data. Then you rewind the file handle position and write a shorter data, which does not overwrite all of the old.
I think your solution "should" be to skip this passing a file handle around and rewinding it and instead use the appropriate open for each subroutine. E.g.:
sub read_yaml {
my $file = shift;
open my $fh, '<', $file or die $!;
...
close $fh;
}
sub write_yaml {
my ($file, $obj) = #_;
open my $fh, '>', $file or die $!;
...
close $fh;
}
Keeping the file handle open in between operations not really that useful or efficient, and it introduces some difficulties.

Perl while loops and reading lines

Each record has 4 lines:
Like the following:
#NCYC361­11a03.q1k bases 1 to 1576
GCGTGCCCGAAAAAATGCTTTTGGAGCCGCGCGTGAAAT
+
!)))))****(((***%%((((*(((+,**(((+**+,­
There are two files in which 1 file corresponded to the other
there are an array of seqeunces A1
So read 1 record at a time from file 1. read record from file 2. if the sequence in record 1 file 1 (line 2) matches the seuqnece in the array A1, i print the record from file 2 to an output file so on...but the point is i need to read a record at a time.... how would i break out of the inner loop so that i can read the next record from the file 1 and then compare it to the next record in file 2
If you ask about controlling nested loops you can do that with labels.
Example:
OUTER:
while(<>){
for(#something){
last OUTER;
}
}
See last for example.
In case only lines with same number could ever match, you don't really need more than one loop. You can call reading operation (<>, read, sysread) wherever you want. It only usually placed directly in loop because it conveniently returns undef and breaks it when work is done.
while(defined(my $first_line = <FIRST>)){
my $second_line = <SECOND>;
if($first_line eq $second_line){
print "match\n";
} else {
print "no match\n";
}
}
From your sentence I need to check if the sequence matches any with the sequence from the second I gather that you want to check whether any lines in the two files match?
If you need to read a file several times then you can use seek to rewind to the start of it without reopening it.
This program shows the idea.
use strict;
use warnings;
open my $fh1, '<', 'file1' or die $!;
open my $fh2, '<', 'file2' or die $!;
open my $out, '>', 'matches' or die $!;
while (my $line1 = <$fh1>) {
seek $fh2, 0, 0;
while (my $line2 = <$fh2>) {
if ($line1 eq $line2) {
print $out $line1;
last;
}
}
}
Edit
Your comment has changed the problem. Both files have four-line records and you want to compare the second line in corresponding records across the two files.
use strict;
use warnings;
open my $fh1, '<', 'file1' or die $!;
open my $fh2, '<', 'file2' or die $!;
open my $match, '>', 'matches' or die $!;
open my $nomatch, '>', 'nomatch' or die $!;
while (1) {
my (#data1, #data2);
for (1 .. 4) {
my $line;
$line = <$fh1>;
push #data1, $line if defined $line;
$line = <$fh2>;
push #data2, $line if defined $line;
}
last unless #data1 == 4 and #data2 == 4;
if ($data1[1] eq $data2[1]) {
print $match #data2;
}
else {
print $nomatch #data2;
}
}
A full example :
#!/usr/bin/env perl
use strict;
use warnings;
open F1, "<", "/path/1";
open F2, "<", "/path/2";
#a1 = <F1>;
#a2 = <F2>;
for (0..$#a1) {
if ($a1[$_] eq $a2[$_]) {
print "MATCH line [$_]\n";
} else {
print "DOESN'T MATCH line [$_]\n";
}
}