Trying to improve Encode::decode warning message: Segfault in $SIG{__WARN__} handler - perl

I am trying to improve the warning message issued by Encode::decode(). Instead of printing the name of the module and the line number in the module, I would like it to print the name of the file being read and the line number in that file where the malformed data was found. To a developer, the origial message can be useful, but to an end user not familiar with Perl, it is probably quite meaningless. The end user would probably rather like to know which file is giving the problem.
I first tried to solve this using a $SIG{__WARN__} handler (which is probably not a good idea), but I get a segfault. Probably a silly mistake, but I could not figure it out:
#! /usr/bin/env perl
use feature qw(say);
use strict;
use warnings;
use Encode ();
binmode STDOUT, ':utf8';
binmode STDERR, ':utf8';
my $fn = 'test.txt';
write_test_file( $fn );
# Try to improve the Encode::FB_WARN fallback warning message :
#
# utf8 "\xE5" does not map to Unicode at <module_name> line xx
#
# Rather we would like the warning to print the filename and the line number:
#
# utf8 "\xE5" does not map to Unicode at line xx of file <filename>.
my $str = '';
open ( my $fh, "<:encoding(utf-8)", $fn ) or die "Could not open file '$fn': $!";
{
local $SIG{__WARN__} = sub { my_warn_handler( $fn, $_[0] ) };
$str = do { local $/; <$fh> };
}
close $fh;
say "Read string: '$str'";
sub my_warn_handler {
my ( $fn, $msg ) = #_;
if ( $msg =~ /\Qdoes not map to Unicode\E/ ) {
recover_line_number_and_char_pos( $fn, $msg );
}
else {
warn $msg;
}
}
sub recover_line_number_and_char_pos {
my ( $fn, $err_msg ) = #_;
chomp $err_msg;
$err_msg =~ s/(line \d+)\.$/$1/; # Remove period at end of sentence.
open ( $fh, "<:raw", $fn ) or die "Could not open file '$fn': $!";
my $raw_data = do { local $/; <$fh> };
close $fh;
my $str = Encode::decode( 'utf-8', $raw_data, Encode::FB_QUIET );
my ($header, $last_line) = $str =~ /^(.*\n)([^\n]*)$/s;
my $line_no = $str =~ tr/\n//;
++$line_no;
my $pos = ( length $last_line ) + 1;
warn "$err_msg, in file '$fn' (line: $line_no, pos: $pos)\n";
}
sub write_test_file {
my ( $fn ) = #_;
my $bytes = "Hello\nA\x{E5}\x{61}"; # 2 lines ending in iso 8859-1: åa
open ( my $fh, '>:raw', $fn ) or die "Could not open file '$fn': $!";
print $fh $bytes;
close $fh;
}
Output:
utf8 "\xE5" does not map to Unicode at ./p.pl line 27
, in file 'test.txt' (line: 2, pos: 2)
Segmentation fault (core dumped)

Here is another way to locate where the warning fires, with un-buffered sysread
use warnings;
use strict;
binmode STDOUT, ':utf8';
binmode STDERR, ':utf8';
my $file = 'test.txt';
open my $fh, "<:encoding(UTF-8)", $file or die "Can't open $file: $!";
$SIG{__WARN__} = sub { print "\t==> WARN: #_" };
my $char_cnt = 0;
my $char;
while (sysread($fh, $char, 1)) {
++$char_cnt;
print "$char ($char_cnt)\n";
}
The file test.txt was written by the posted program, except that I had to add to it to reproduce the behavior -- it runs without warnings on v5.10 and v5.16. I added \x{234234} to the end. The line number can be tracked with $char =~ /\n/.
The sysread returns undef on error. It can be moved into the body of while (1) to allow reads to continue and catch all warnings, breaking out on 0 (returned on EOF).
This prints
H (1)
e (2)
l (3)
l (4)
o (5)
(6)
A (7)
å (8)
a (9)
==> WARN: Code point 0x234234 is not Unicode, may not be portable at ...
(10)
While this does catch the character warned about, re-reading the file using Encode may well be better than reaching for sysread, in particular if sysread uses Encode.
However, Perl is utf8 internally and I am not sure that sysread needs Encode.
Note. The page for sysread supports its use on data with encoding layers
Note that if the filehandle has been marked as :utf8 Unicode
characters are read instead of bytes (the LENGTH, OFFSET, and the
return value of sysread are in Unicode characters). The
:encoding(...) layer implicitly introduces the :utf8 layer.
See binmode, open, and the open pragma.
Note Apparently, things have moved on and after a certain version sysread does not support encoding layers. The link above, while for an older version (v5.10 for one) indeed shows what is quoted, with a newer version tells us that there'll be an exception.

Related

Can't call method "print" on an undefined value

Can't call method print on an undefined value in line 40 line 2.
Here is the code. I use FileHandle to settle files:
#!/usr/bin/perl
use strict;
use warnings;
use FileHandle;
die unless (#ARGV ==4|| #ARGV ==5);
my #input =();
$input[0]=$ARGV[3];
$input[1]=$ARGV[4] if ($#ARGV==4);
chomp #input;
$input[0] =~ /([^\/]+)$/;
my $out = "$1.insert";
my $lane= "$1";
my %fh=();
open (Info,">$ARGV[1]") || die "$!";
open (AA,"<$ARGV[0]") || die "$!";
while(<AA>){
chomp;
my #inf=split;
my $iden=$inf[0];
my $outputfile="$ARGV[2]/$iden";
$fh{$iden}=FileHandle->new(">$outputfile");
}
close AA;
foreach my $input (#input) {
open (IN, "<$input" ) or die "$!" ;
my #path=split (/\//,$input);
print Info "#$path[-1]\n";
while (<IN>) {
my $line1 = $_;
my ($id1,$iden1) = (split "\t", $line1)[6,7];
my $line2 = <IN> ;
my ($id2,$iden2) = (split "\t", $line2)[6,7];
if ($id1 eq '+' && $id2 eq '-') {
my #inf=split(/\t/,$line1);
$fh{$iden1}->print($line1);
$fh{$iden2}->print($line2);
}
}
close IN;
}
I’ve tried multiple variations of this, but none of them seem to work. Any ideas?
Please remember that the primary worth of a Stack Overflow post is not to fix your particular problem, but to help the thousands of others who may be stuck in the same way. With that in mind, "I fixed it, thanks, bye" is more than a little selfish
As I said in my comment, using open directly on a hash element is much preferable to involving FileHandle. Perl will autovivify the hash element and create a file handle for you, and most people at all familiar with Perl will thank you for not making them read up again on the FileHandle documentation
I rewrote your code like this, which is much more Perlish and relies less on "magic numbers" to access #ARGV. You should really assign #ARGV to a list of named scalars, or - better still - use Getopt::Long so that they are named anyway
You should open your file handles as late as possible, and close the output handles early. This is effected most easily by using lexical file handles and limiting their scope to a block. Perl will implicitly close lexical handles for you when they go out of scope
There is no need to chomp the contents of #ARGVunless you could be be called under strange and errant circumstances, in which case you need to do a hell of a lot more to verify the input
You never use the result of $input[0] =~ /([^\/]+)$/ or the variables $out and $lane, so I removed them
#!/usr/bin/perl
use strict;
use warnings 'all';
# $ARGV[0] -- input file
# $ARGV[1] -- output log file
# $ARGV[2] -- directory for outputs per ident
# $ARGV[3] -- 1, $input[0]
# $ARGV[4] -- 2, $input[1] or undef
die "Fix the parameters" unless #ARGV == 4 or #ARGV == 5;
my #input = #ARGV[3,4];
my %fh;
{
open my $fh, '<', $ARGV[0] or die $!;
while ( <$fh> ) {
my $id = ( split )[0];
my $outputfile = "$ARGV[2]/$id";
open $fh{$id}, '>', $outputfile or die qq{Unable to open "$outputfile" for output: $!};
}
}
open my $log_fh, '>', $ARGV[1] or die qq{Unable to open "$ARGV[1]" for output: $!};
for my $input ( #input ) {
next unless $input; # skip unspecified parameters
my #path = split qr|/|, $input; # Really should be done by File::Spec
print $log_fh "#$path[-1]\n"; # Or File::Basename
open my $fh, '<', $input or die qq{Unable to open "$input" for input: $!};
while ( my $line0 = <$fh> ) {
chomp $line0;
my $line1 = <$fh>;
chomp $line1;
my ($id0, $iden0) = (split /\t/, $line0)[6,7];
my ($id1, $iden1) = (split /\t/, $line1)[6,7];
if ( $id0 eq '+' and $id1 eq '-' ) {
$fh{$_} or die qq{No output file for "$_"} for $iden0, $iden1;
print { $fh{$iden0} } $line0;
print { $fh{$iden1} } $line1;
}
}
}
while ( my ($iden, $fh) = each %fh ) {
close $fh or die qq{Unable to close file handle for "$iden": $!};
}
You don't have any error handling on this line:
$fh{$iden}=FileHandle->new(">$outputfile");
It's possible that opening a filehandle is silently failing, and only producing an error when you try to print to it. For example, if you have specified an invalid filename.
Also, you never check if $iden1 and $iden2 are names of open filehandles that actually exist. It's possible one of them does not exist.
In particular, you aren't removing a newline from $line1, so if $iden1 and $iden2 happen to be the last values on the line, this will be included in the name you are trying to use, and it will fail.
In your first while loop, you set up a hash of filehandles that you will write to later. The keys in this hash are the "iden" strings from the first file passed to the program.
Later, you parse another file and use the "iden" values in that file to choose which filehandle to write data to. But one (or more) of the "iden" values in the second file is missing from the first file. So that filehandle can't be found in the %fh hash. Because you don't check for that, you get `undef back from the hash and you can't print to an undefined filehandle.
To fix it, put a check before trying to use one of the filehandles from the %fh hash.
die "Unknown fh identifier '$iden1'" unless exists $fh{$iden1};
die "Unknown fh identifier '$iden2'" unless exists $fh{$iden2};
$fh{$iden1}->print($line1);
$fh{$iden2}->print($line2);

How to detect malformed UTF-8 at the end of the file?

I am trying to print a warning message when reading a file (that is supposed to contain valid UTF-8) contains invalid UTF-8. However, if the invalid data is at the end of the file I am not able to output any warnings. The following MVCE creates a file containing invalid UTF-8 data (creation of the file is not relevant to the the general question, it was just added here to produce a MVCE):
use feature qw(say);
use strict;
use warnings;
binmode STDOUT, ':utf8';
binmode STDERR, ':utf8';
my $bytes = "\x{61}\x{E5}\x{61}"; # 3 bytes in iso 8859-1: aåa
test_read_invalid( $bytes );
$bytes = "\x{61}\x{E5}"; # 2 bytes in iso 8859-1: aå
test_read_invalid( $bytes );
sub test_read_invalid {
my ( $bytes ) = #_;
say "Running test case..";
my $fn = 'test.txt';
open ( my $fh, '>:raw', $fn ) or die "Could not open file '$fn': $!";
print $fh $bytes;
close $fh;
my $str = '';
open ( $fh, "<:encoding(utf-8)", $fn ) or die "Could not open file '$fn': $!";
$str = do { local $/; <$fh> };
close $fh;
say "Read string: '$str'\n";
}
The output is:
Running test case..
utf8 "\xE5" does not map to Unicode at ./p.pl line 22.
Read string: 'a\xE5a'
Running test case..
Read string: 'a'
In the last test case, the invalid byte at the end of the file seems to be silently ignored by the PerlIO layer :encoding(utf-8).
Essentially what you're seeing is the perlIO system attempting to deal with a block read ending in the middle of a utf-8 sequence. So the raw byte buffer still has the invalid byte you want, but the encoded buffer does not yet have that content because it doesn't decode properly yet and it's hoping to find another character later. You can check for this by popping the encoding layer off and doing another read and checking the length.
binmode $fh, ':pop';
my $remainder = do { local $/; <$fh>};
die "Unread Characters" if length $remainder;
I'm not sure, you may want to have your open encoding start with :raw or do binmode $fh, ':raw' instead, I've never paid much attention to the layers themselves since it usually just works. I do know that this code block works for your test case :)
I'm not sure what you are asking. To detect encoding errors in a string, you can simply attempt to decode the string. As for getting an error from writing to the file, maybe close returns an error, or you can use chomp($_); print($fh "$_\n"); (seeing as unix text files should always end with a newline anyway).
open ( my $fh, '>:raw', $fn ) or die "Could not open file '$fn': $!";
#the end of the file need a single space to find a invalid UTF-8 characters.
print $fh "$bytes ";
Output:
Running test case..
utf8 "\xE5" does not map to Unicode at ent.pl line 23.
Read string: 'a\xE5a '
Running test case..
utf8 "\xE5" does not map to Unicode at ent.pl line 23.
Read string: 'a\xE5a '

Read newline delimited file in Perl

I am trying to read a newline-delimited file into an array in Perl. I do NOT want the newlines to be part of the array, because the elements are filenames to read later. That is, each element should be "foo" and not "foo\n". I have done this successfully in the past using the methods advocated in Stack Overflow question Read a file into an array using Perl and Newline Delimited Input.
My code is:
open(IN, "< test") or die ("Couldn't open");
#arr = <IN>;
print("$arr[0] $arr[1]")
And my file 'test' is:
a
b
c
d
e
My expected output would be:
a b
My actual output is:
a
b
I really don't see what I'm doing wrong. How do I read these files into arrays?
Here is how I generically read from files.
open (my $in, "<", "test") or die $!;
my #arr;
while (my $line = <$in>) {
chomp $line;
push #arr, $line;
}
close ($in);
chomp will remove newlines from the line read. You should also use the three-argument version of open.
Put the file path in its own variable so that it can be easily
changed.
Use the 3-argument open.
Test all opens, prints, and closes for success, and if not, print the error and the file name.
Try:
#!/usr/bin/env perl
use strict;
use warnings;
# --------------------------------------
use charnames qw( :full :short );
use English qw( -no_match_vars ); # Avoids regex performance penalty
# conditional compile DEBUGging statements
# See http://lookatperl.blogspot.ca/2013/07/a-look-at-conditional-compiling-of.html
use constant DEBUG => $ENV{DEBUG};
# --------------------------------------
# put file path in a variable so it can be easily changed
my $file = 'test';
open my $in_fh, '<', $file or die "could not open $file: $OS_ERROR\n";
chomp( my #arr = <$in_fh> );
close $in_fh or die "could not close $file: $OS_ERROR\n";
print "#arr[ 0 .. 1 ]\n";
A less verbose option is to use File::Slurp::read_file
my $array_ref = read_file 'test', chomp => 1, array_ref => 1;
if, and only if, you need to save the list of file names anyway.
Otherwise,
my $filename = 'test';
open (my $fh, "<", $filename) or die "Cannot open '$filename': $!";
while (my $next_file = <$fh>) {
chomp $next_file;
do_something($next_file);
}
close ($fh);
would save memory by not having to keep the list of files around.
Also, you might be better off using $next_file =~ s/\s+\z// rather than chomp unless your use case really requires allowing trailing whitespace in file names.

Perl - binary unpack using pointer and index

I have a binary file that contain 3 files, a PNG, a PHP and a TGA file.
Here the file to give you the idea : container.bin
the file is build this way:
first 6 bytes are a pointer to the index, in this case 211794
Then you have all 3 files stacked one after the other
and at the ofset 211794, you have the index, that tell you where the file start and end
in this example you have:
[offset start] [offset end] [random data] [offset start] [name]
6 15149 asdf 6 Capture.PNG
15149 15168 4584 15149 index.php
15168 211794 12 15168 untilted.tga
meaning that capture.png start at offset 6, finish at offset 15149, then asdf is a random data, and the start offset is repeated again.
Now what I want to do is a perl to separate the file on this binary files.
The perl need to check the first 6 offset of the file (header), then jump to the index location, and use the list to extract the file out.
A mix of seek and read can be used to achieve the task:
#!/usr/bin/env perl
use strict;
use warnings;
use Fcntl 'SEEK_SET';
sub get_files_info {
my ( $fh, $offset ) = #_;
my %file;
while (<$fh>) {
chomp;
my $split_count = my ( $offset_start, $offset_end, $random_data, $offset_start_copy,
$file_name ) = split /\s/;
next if $split_count != 5;
if ( $offset_start != $offset_start_copy ) {
warn "Start of offset mismatch: $file_name\n";
next;
}
$file{$file_name} = {
'offset_start' => $offset_start,
'offset_end' => $offset_end,
'random_data' => $random_data,
};
}
return %file;
}
sub write_file {
my ( $fh, $file_name, $file_info ) = #_;
seek $fh, $file_info->{'offset_start'}, SEEK_SET;
read $fh, my $contents,
$file_info->{'offset_end'} - $file_info->{'offset_start'};
open my $fh_out, '>', $file_name or die 'Error opening file: $!';
binmode $fh_out;
print $fh_out $contents;
print "Wrote file: $file_name\n";
}
open my $fh, '<', 'container.bin' or die "Error opening file: $!";
binmode $fh;
read $fh, my $offset, 6;
seek $fh, $offset, SEEK_SET;
my %file = get_files_info $fh, $offset;
for my $file_name ( keys %file ) {
write_file $fh, $file_name, $file{$file_name};
}
The only real difficulty here is to make sure that both input and output files are read in binary mode. This can be achieved by using the :raw PerlIO layer when the files are opened.
This program seems to do what you want. It first locates and reads the index block into a string, and then opens that string for input and reads the start and end position and name of each of the constituent files. Thereafter processing each file is simple.
Be aware that unless the formatting of the index block is more strict than you say, you can rely only on the first, second, and last whitespace-separated fields on each line since random text could contain spaces. There is also no way to specify a file name containing spaces.
The output, using Data::Dump, is there to demonstrate correct functionality and is not necessary for the functioning of the program.
use v5.10;
use warnings;
use Fcntl ':seek';
use autodie qw/ open read seek close /;
open my $fh, '<:raw', 'container.bin';
read $fh, my $index_loc, 6;
seek $fh, $index_loc, SEEK_SET;
read $fh, my ($index), 1024;
my %contents;
open my $idx, '<', \$index;
while (<$idx>) {
my #fields = split;
next unless #fields;
$contents{$fields[-1]} = [ $fields[0], $fields[1] ];
}
use Data::Dump;
dd \%contents;
for my $file (keys %contents) {
my ($start, $end) = #{ $contents{$file} };
my $size = $end - $start;
seek $fh, $start, SEEK_SET;
my $nbytes = read $fh, my ($data), $size;
die "Premature EOF" unless $nbytes == $size;
open my $out, '>:raw', $file;
print { $out } $data;
close $out;
}
output
{
"Capture.PNG" => [6, 15149],
"index.php" => [15149, 15168],
"untilted.tga" => [15168, 211794],
}

Read and Write in the same file with different process

I have written the two program. One program is write the content to the text file simultaneously. Another program is read that content simultaneously.
But both the program should run at the same time. For me the program is write the file is correctly. But another program is not read the file.
I know that once the write process is completed than only the data will be stored in the hard disk. Then another process can read the data.
But I want both read and write same time with different process in the single file. How can I do that?
Please help me.
The following code write the content in the file
sub generate_random_string
{
my $length_of_randomstring=shift;# the length of
# the random string to generate
my #chars=('a'..'z','A'..'Z','0'..'9','_');
my $random_string;
foreach (1..$length_of_randomstring)
{
# rand #chars will generate a random
# number between 0 and scalar #chars
$random_string.=$chars[rand #chars];
}
return $random_string;
}
#Generate the random string
open (FH,">>file.txt")or die "Can't Open";
while(1)
{
my $random_string=&generate_random_string(20);
sleep(1);
#print $random_string."\n";
print FH $random_string."\n";
}
The following code is read the content. This is another process
open (FH,"<file.txt") or die "Can't Open";
print "Open the file Successfully\n\n";
while(<FH>)
{
print "$_\n";
}
You might use an elaborate cooperation protocol such as in the following. Both ends, reader and writer, use common code in the TakeTurns module that handles fussy details such as locking and where the lock file lives. The clients need only specify what they want to do when they have exclusive access to the file.
reader
#! /usr/bin/perl
use warnings;
use strict;
use TakeTurns;
my $runs = 0;
reader "file.txt" =>
sub {
my($fh) = #_;
my #lines = <$fh>;
print map "got: $_", #lines;
++$runs <= 10;
};
writer
#! /usr/bin/perl
use warnings;
use strict;
use TakeTurns;
my $n = 10;
my #chars = ('a'..'z','A'..'Z','0'..'9','_');
writer "file.txt" =>
sub { my($fh) = #_;
print $fh join("" => map $chars[rand #chars], 1..$n), "\n"
or warn "$0: print: $!";
};
The TakeTurns module is execute-around at work:
package TakeTurns;
use warnings;
use strict;
use Exporter 'import';
use Fcntl qw/ :DEFAULT :flock /;
our #EXPORT = qw/ reader writer /;
my $LOCKFILE = "/tmp/taketurns.lock";
sub _loop ($&) {
my($path,$action) = #_;
while (1) {
sysopen my $lock, $LOCKFILE, O_RDWR|O_CREAT
or die "sysopen: $!";
flock $lock, LOCK_EX or die "flock: $!";
my $continue = $action->();
close $lock or die "close: $!";
return unless $continue;
sleep 0;
}
}
sub writer {
my($path,$w) = #_;
_loop $path =>
sub {
open my $fh, ">", $path or die "open $path: $!";
my $continue = $w->($fh);
close $fh or die "close $path: $!";
$continue;
};
}
sub reader {
my($path,$r) = #_;
_loop $path =>
sub {
open my $fh, "<", $path or die "open $path: $!";
my $continue = $r->($fh);
close $fh or die "close $path: $!";
$continue;
};
}
1;
Sample output:
got: 1Upem0iSfY
got: qAALqegWS5
got: 88RayL3XZw
got: NRB7POLdu6
got: IfqC8XeWN6
got: mgeA6sNEpY
got: 2TeiF5sDqy
got: S2ksYEkXsJ
got: zToPYkGPJ5
got: 6VXu6ut1Tq
got: ex0wYvp9Y8
Even though you went to so much trouble, there are still issues. The protocol is unreliable, so reader has no guarantee of seeing all messages that writer sends. With no writer active, reader is content to read the same message over and over.
You could add all this, but a more sensible approach would be using abstractions the operating system provides already.
For example, Unix named pipes seem to be a pretty close match to what you want, and note how simple the code is:
pread
#! /usr/bin/perl
use warnings;
use strict;
my $pipe = "/tmp/mypipe";
system "mknod $pipe p 2>/dev/null";
open my $fh, "<", $pipe or die "$0: open $pipe: $!";
while (<$fh>) {
print "got: $_";
sleep 0;
}
pwrite
#! /usr/bin/perl
use warnings;
use strict;
my $pipe = "/tmp/mypipe";
system "mknod $pipe p 2>/dev/null";
open my $fh, ">", $pipe or die "$0: open $pipe: $!";
my $n = 10;
my #chars = ('a'..'z','A'..'Z','0'..'9','_');
while (1) {
print $fh join("" => map $chars[rand #chars], 1..$n), "\n"
or warn "$0: print: $!";
}
Both ends attempt to create the pipe using mknod because they have no other method of synchronization. At least one will fail, but we don't care as long as the pipe exists.
As you can see, all the waiting machinery is handled by the system, so you do what you care about: reading and writing messages.
This works.
The writer:
use IO::File ();
sub generate_random_string {...}; # same as above
my $file_name = 'file.txt';
my $handle = IO::File->new($file_name, 'a');
die "Could not append to $file_name: $!" unless $handle;
$handle->autoflush(1);
while (1) {
$handle->say(generate_random_string(20));
}
The reader:
use IO::File qw();
my $file_name = 'file.txt';
my $handle = IO::File->new($file_name, 'r');
die "Could not read $file_name: $!" unless $handle;
STDOUT->autoflush(1);
while (defined (my $line = $handle->getline)) {
STDOUT->print($line);
}
are you on windows or *nix? you might be able to string something like this together on *nix by using tail to get the output as it is written to the file. On windows you can call CreateFile() with FILE_SHARE_READ and/or FILE_SHARE_WRITE in order to allow others to access the file while you have it opened for read/write. you may have to periodically check to see if the file size has changed in order to know when to read (i'm not 100% certain here.)
another option is a memory mapped file.