Understanding lexical scoping of "use open ..." of Perl - perl

use open qw( :encoding(UTF-8) :std );
Above statement seems to be effective in its lexical scope only and should not affect outside of it's scope. But I have observed the following.
$ cat data
€
#1
$ perl -e '
open (my $fh, "<encoding(UTF-8)", "data");
print($_) while <$fh>;'
Wide character in print at -e line 1, <$fh> line 1.
€
The Wide character ... warning is perfect here. But
#2
$ perl
my ($fh, $row);
{
use open qw( :encoding(UTF-8) :std );
open ($fh, "<", "data");
}
$row = <$fh>;
chomp($row);
printf("%s (0x%X)", $row, ord($row));
€ (0x20AC)
Does not show the wide character warning!! Here is whats going on here imo
We are using open pragma to set the IO streams to UTF-8, including STDOUT.
Opening the file inside the same scope. It reads the character as multibyte char.
But printing outside the scope. The print statement should show "Wide character" warning, but it is not. Why?
Now look at the following, a little variation
#3
my ($fh, $row);
{
use open qw( :encoding(UTF-8) :std );
}
open ($fh, "<", "data");
$row = <$fh>;
chomp($row);
printf("%s (0x%X)", $row, ord($row));
⬠(0xE2)
Now this time since the open statement is out of the lexical scope, the open opened the file in non utf-8 mode.
Does this mean use open qw( :encoding(UTF-8) :std ); statement changes the STDOUT globally but STDIN within lexical scope?

You aren't using STDIN. You're opening a file with an explicit encoding (except for your last example) and reading from that.
The use open qw(:std ...) affects the standard file handles, but you're only using standard output. When you don't use that and print UTF-8 data to standard output, you get the warning.
In your last example, you don't read the data with an explicit encoding, so when you print it to standard output, it's already corrupted.
That's the trick of encodings no matter what they are. Every part of the process has to be correct.
If you want use open to affect all file handles, you have to import it differently. There are several examples in the top of the documentation.

Unfortunately, the open qw(:std) pragma does not seem to behave as a lexical pragma since it changes the IO layers associated with the standard handles STDIN, STDOUT and STDERR globally. Even code earlier in source file is affected since the use statement happens at compile time. So the following
say join ":", PerlIO::get_layers(\*STDIN);
{
use open qw( :encoding(UTF-8) :std );
}
prints ( on my linux platform ) :
unix:perlio:encoding(utf-8-strict):utf8
whereas without the use open qw( :encoding(UTF-8) :std ) it would just print
unix:perlio.
A way to not affect the global STDOUT for example is to duplicate the handle within a lexical scope and then add IO layers to the duplicate handle within that scope:
use feature qw(say);
use strict;
use warnings;
use utf8;
my $str = "€";
say join ":", PerlIO::get_layers(\*STDOUT);
{
open ( my $out, '>&STDOUT' ) or die "Could not duplicate stdout: $!";
binmode $out, ':encoding(UTF-8)';
say $out $str;
}
say join ":", PerlIO::get_layers(\*STDOUT);
say $str;
with output:
unix:perlio
€
unix:perlio
Wide character in say at ./p.pl line 16.
€

Related

Why does File::Slurp get UTF8 characters wrong when I use open ':std', ':encoding(UTF-8)';?

I have a Perl 5.30.0 program on Ubuntu where the combination of File::Slurp and open ':std', ':encoding(UTF-8)' results in UTF8 not getting read correctly:
use strict;
use warnings;
use open ':std', ':encoding(UTF-8)';
use File::Slurp;
my $text = File::Slurp::slurp('input.txt');
print "$text\n";
with "input.txt" being an UTF8 encoded text file with this content (no BOM):
ö
When I run this, the ö gets displayed as ö. Only when I remove the use open... line, it works as expected and the ö is printed as an ö.
When I manually read the file like below, everything works as expected and I do get the ö:
$text = '';
open my $F, '<', "input.txt" or die "Cannot open file: $!";
while (<$F>) {
$text .= $_;
}
close $F;
print "$text\n";
Why is that and what is the best way to go here? Is the open pragma outdated or am I missing something else?
As with many pragmas,[1] the effect of use open is lexically-scoped.[2] This means it only affects the remainder of the block or file in which it's found. Such a pragma doesn't affect code in functions outside of its scope, even if they are called from which its scope.
You need to communicate the desire to decode the stream to File::Slurp. This can't be done using slurp, but it can be done using read_file via its binmode parameter.
use open ':std', ':encoding(UTF-8)'; # Still want for effect on STDOUT.
use File::Slurp qw( read_file );
my $text = read_file('input.txt', { binmode => ':encoding(UTF-8)' });
A better module is File::Slurper.
use open ':std', ':encoding(UTF-8)'; # Still want for effect on STDOUT.
use File::Slurper qw( read_text );
my $text = read_text('input.txt');
File::Slurper's read_text defaults to decoding using UTF-8.
Without modules, you could use
use open ':std', ':encoding(UTF-8)';
my $text = do {
my $qfn = "input.txt";
open(my $F, '<', $qfn)
or die("Can't open file \"$file\": $!\n");
local $/;
<$fh>
};
Of course, that's not as clear as the earlier solutions.
Other notable examples include use VERSION, use strict, use warnings, use feature and use utf8.
The effect on STDIN, STDOUT and STDERR from :std is global.
Not really an answer to your question, but my favourite file I/O module these days is Path::Tiny.
use Path::Tiny;
my $text = path('input.txt')->slurp_utf8;

Perl UTF8 output to a variable

I have the following Perl code, in which I am opening a handle to a scalar variable and writing some utf8 text to it:
use warnings;
use strict;
use 5.010;
use utf8;
use open qw( :std :encoding(utf8) );
my $output;
open my $oh, ">", \$output;
say $oh "Žluťoučký kůň.";
close $oh;
say "Žluťoučký kůň.";
print $output;
and when I run it I get the following output:
Žluťoučký kůň.
ŽluÅ¥ouÄký kůÅ.
(without any warnings or errors). So, obviously, writing an utf8 string into a variable via a handle does not work correctly here as the string seems to be double-encoded. I have tried opening $oh with >:raw, >:bytes, >:encoding(ascii), but none of it helped.
I might be doing something stupid but I cannot figure how to fix this. Any ideas?
First of all, :encoding(utf8) should be :encoding(utf-8).
UTF-8 is the well known encoding standard.
utf8 is a Perl-specific extension to UTF-8.
Reference
(Encoding names are case-insensitive.)
use open qw( :std :encoding(utf8) ); has two effects:
It adds :encoding(utf8) to STDIN, STDOUT and STDERR.
It sets the default layer for open in the lexical scope of the use to :encoding(utf8).
So,
use utf8;
use open qw( :std :encoding(UTF-8) );
# String of decoded text aka string of Unicode Code Points, thanks to `use utf8`.
my $text_ucp = "Žluťoučký kůň.";
# $output will contain text encoded using UTF-8 thanks to `use open`.
open my $oh, ">", \my $text_utf8;
say $oh $text_ucp;
close $oh;
# ok. Will encode the decoded text using UTF-8 thanks to `use open`.
say $text_ucp;
# XXX. Will encode the already-encoded text using UTF-8 thanks to `use open`.
print $text_utf8;
You tried to override the second effect of use open to obtain a file of Unicode Code Points, but that's futile since files can only contain bytes. Some kind of encoding or failure must occur if you try to store something other than bytes in a file.
So live with it, and decode the "file" before using it.
use utf8;
use open qw( :std :encoding(UTF-8) );
use Encode qw( decode_utf8 );
my $text_ucp = "Žluťoučký kůň.";
open my $oh, ">", \my $text_utf8;
say $oh $text_ucp;
close $oh;
my $text2_ucp = decode_utf8($text_utf8);
... Do stuff with $text_ucp and/or $text2_ucp ...
say $text_ucp;
say $text2_ucp;
It is possible to avoid the decode by working directly with UTF-8 in the second half of the program.
use utf8;
BEGIN { binmode(STDERR, ":encoding(UTF-8)"); } # We'll handle STDOUT manually.
use open qw( :encoding(UTF-8) );
use Encode qw( encode_utf8 );
my $text_ucp = "Žluťoučký kůň.";
open my $oh, ">", \my $text_utf8;
say $oh $text_ucp;
close $oh;
say encode_utf8($text_ucp);
say $text_utf8;
Of course, that means you can't use $text_utf8 anywhere that expects decoded text.

Perl : Name "main::IN" used only once, but it is actually used

I writing a short perl script that reads in a file. See tmp.txt:
1 gene_id "XLOC_000001"; gene_name "DDX11L1"; oId
1 gene_id "XLOC_000001"; gene_name "DDX11L1"; oId
1 gene_id "XLOC_000001"; gene_name "DDX11L1"; oId
1 gene_id "XLOC_000001"; gene_name "DDX11L1"; oId
My perl program, convert.pl is :
use warnings;
use strict;
use autodie; # die if io problem with file
my $line;
my ($xloc, $gene, $ens);
open (IN, "tmp.txt")
or die ("open 'tmp.txt' failed, $!\n");
while ($line = <IN>) {
($xloc, $gene) = ($line =~ /gene_id "([^"]+)".*gene_name "([^"]+)"/);
print("$xloc $gene\n");
}
close (IN)
or warn $! ? "ERROR 1" : "ERROR 2";
It outputs:
Name "main::IN" used only once: possible typo at ./convert.pl line 8.
XLOC_000001 DDX11L1
XLOC_000001 DDX11L1
XLOC_000001 DDX11L1
XLOC_000001 DDX11L1
I used IN, so I don't understand the Name "main::IN" used... warning. Why is it complaining?
This is mentioned under BUGS section of autodie
"Used only once" warnings can be generated when autodie or Fatal is used with package filehandles (eg, FILE). Scalar filehandles are strongly recommended instead.
use diagnostics; says:
Name "main::IN" used only once: possible typo at test.pl line 9 (#1)
(W once) Typographical errors often show up as unique variable names.
If you had a good reason for having a unique name, then just mention
it again somehow to suppress the message. The our declaration is also
provided for this purpose.
NOTE: This warning detects package symbols that have been used only
once. This means lexical variables will never trigger this warning.
It also means that all of the package variables $c, #c, %c, as well as
*c, &c, sub c{}, c(), and c (the filehandle or format) are considered the same; if a program uses $c only once but also uses any of the
others it will not trigger this warning. Symbols beginning with an
underscore and symbols using special identifiers (q.v. perldata) are
exempt from this warning.
So if you use lexical filehandle then it will not warn.
use warnings;
use strict;
use autodie; # die if io problem with file
use diagnostics;
my $line;
my ($xloc, $gene, $ens);
open (my $in, "<", "tmp.txt")
or die ("open 'tmp.txt' failed, $!\n");
while ($line = <$in>) {
($xloc, $gene) = ($line =~ /gene_id "([^"]+)".*gene_name "([^"]+)"/);
print("$xloc $gene\n");
}
close ($in)
or warn $! ? "ERROR 1" : "ERROR 2";
I'm pretty sure this is because of autodie.
I don't know exactly why, but if you remove it, it goes away.
If you read perldoc autodie you'll see:
BUGS ^
"Used only once" warnings can be generated when autodie or Fatal is used with package filehandles (eg, FILE). Scalar filehandles are strongly recommended instead.
I'd suggest that's because of how the or die is being handled, compared to autodie trying to handle it.
However I'd also suggest it would be much better style to use a 3 argument open:
open ( my $input, '<', 'tmp.txt');
And either autodie or or die. I must confess, I'm not really sure which way around the two would be applied if your process did fail the open.

Sort file using pipe in perl

I am looking for a way to sort a file using pipe. I have checked different examples online but I am still confused
Let's say I have a file called "perlRocks.txt" with different names I want to sort.
This is what I have so far:
open(SORT, "| sort>perlRocks.txt") or die "Can't sort";
close (SORT);
What am I missing?
This isn't using perl to sort. To do this in perl, you would want to:
open ( my $input_fh, "<", "perlRocks.txt" ) or die $!;
my #lines = <$input_fh>;
print sort #lines;
What you're doing is trying to call the command sort.
There is no need to use a pipe, use system instead:
system("sort perlRocks.txt");
This will invoke the system command sort and give it perlRocks.txt as parameter. You will see the output of sort in the shell from which you invoked your script.
Of course, with just this command, the sorted content will be shown and then be forgotten. This might or might not what you have in mind. If you want to permanently store the sorted lines you need to redirect the output into another file.
Of course, perl comes with its own sort operator, so that you don't have to use an external sort command: sort #lines. In order to get the content of your file into #lines, you might want to use the module File::Slurp:
use warnings;
use strict;
use File::Slurp;
my #lines = read_file('perlRocks.txt');
print sort #lines;
You are opening pipe for writing. If you already have this file you probably want read content sorted. See the example with $in below. If you want to write something from your script instead see example with $out below. See open documentation for more variants. For sorting existing file, you have to write to a new file and then rename. It's best to use some shell for this task.
use strict;
use warnings;
use autodie;
use constant FILE_NAME_IN => 'perlRocks.in';
use constant FILE_NAME_OUT => 'perlRocks.out';
open my $in, '-|', 'sort', FILE_NAME_IN;
while (<$in>) {print};
open my $out, '|-', "sort >'#{[FILE_NAME_OUT]}'";
print $out $_, "\n" for qw(foo bar baz quux);
There is safer version for output pipe which avoids problems with shell interpretation of FILE_NAME_OUT content (You can escape this content but ... no.)
open my $out, '|-' or do {
close STDOUT;
open STDOUT, '>', FILE_NAME_OUT;
exec 'sort' or die $!;
};
If you insist you don't want to use shell, you can use Perl.
use strict;
use warnings;
use autodie;
use constant FILE_NAME_IN => 'perlRocks.txt';
use constant FILE_NAME_OUT => 'perlRocks.txt';
# don't bother with fork if you don't want continue with Perl process
my $pid = fork;
die "cannot fork" unless defined $pid;
unless ($pid) { # use just this code inside in this case
close STDIN;
close STDOUT;
open STDIN, '<', FILE_NAME_IN;
unlink FILE_NAME_IN if FILE_NAME_IN eq FILE_NAME_OUT;
open STDOUT, '>', FILE_NAME_OUT;
exec 'sort' or die $!;
}
waitpid( $pid, 0 );
Note FILE_NAME_IN and FILE_NAME_OUT can be same, but it is not in-place sorting anyway. There are both versions of the file at the disc in some time even one can be hidden and inaccessible. There is also good IPC::Run module for this sort of tasks.

Perl script to parse a text file and match a string

I'm editing my question to add more details
The script executes the command and redirects the output to a text file.
The script then parses the text file to match the following string " Standard 1.1.1.1"
The output in the text file is :
Host Configuration
------------------
Profile Hostname
-------- ---------
standard 1.1.1.1
standard 1.1.1.2
The code works if i search for either 1.1.1.1 or standard . When i search for standard 1.1.1.1 together the below script fails.
this is the error that i get "Unable to find string: standard 172.25.44.241 at testtest.pl
#!/usr/bin/perl
use Net::SSH::Expect;
use strict;
use warnings;
use autodie;
open (HOSTRULES, ">hostrules.txt") || die "could not open output file";
my $hos = $ssh->exec(" I typed the command here ");
print HOSTRULES ($hos);
close(HOSTRULES);
sub find_string
{
my ($file, $string) = #_;
open my $fh, '<', $file;
while (<$fh>) {
return 1 if /\Q$string/;
}
die "Unable to find string: $string";
}
find_string('hostrules.txt', 'standard 1.1.1.1');
Perhaps write a function:
use strict;
use warnings;
use autodie;
sub find_string {
my ($file, $string) = #_;
open my $fh, '<', $file;
while (<$fh>) {
return 1 if /\Q$string/;
}
die "Unable to find string: $string";
}
find_string('output.txt', 'object-cache enabled');
Or just slurp the entire file:
use strict;
use warnings;
use autodie;
my $data = do {
open my $fh, '<', 'output.txt';
local $/;
<$fh>;
};
die "Unable to find string" if $data !~ /object-cache enabled/;
You're scanning a file for a particular string. If that string is not found in that file, you want an error thrown. Sounds like a job for grep.
use strict;
use warnings;
use features qw(say);
use autodie;
use constant {
OUTPUT_FILE => 'output.txt',
NEEDED_STRING => "object-cache enabled",
};
open my $out_fh, "<", OUTPUT_FILE;
my #output_lines = <$out_fh>;
close $out_fh;
chomp #output_lines;
grep { /#{[NEEDED_STRING]}/ } #output_lines or
die qq(ERROR! ERROR! ERROR!); #Or whatever you want
The die command will end the program and exit with a non-zero exit code. The error will be printed on STDERR.
I don't know why, but using qr(object-cache enabled), and then grep { NEEDED_STRING } didn't seem to work. Using #{[...]} allows you to interpolate constants.
Instead of constants, you might want to be able to pass in the error string and the name of the file using GetOptions.
I used the old fashion <...> file handling instead of IO::File, but that's because I'm an old fogy who learned Perl back in the 20th century before it was cool. You can use IO::File which is probably better and more modern.
ADDENDUM
Any reason for slurping the entire file in memory? - Leonardo Herrera
As long as the file is reasonably sized (say 100,000 lines or so), reading the entire file into memory shouldn't be that bad. However, you could use a loop:
use strict;
use warnings;
use features qw(say);
use autodie;
use constant {
OUTPUT_FILE => 'output.txt',
NEEDED_STRING => qr(object-cache enabled),
};
open my $out_fh, "<", OUTPUT_FILE;
my $output_string_found; # Flag to see if output string is found
while ( my $line = <$out_fh> ) {
if ( $line =~ NEEDED_STRING ){
$output_string_found = "Yup!"
last; # We found the string. No more looping.
}
}
die qq(ERROR, ERROR, ERROR) unless $output_string_found;
This will work with the constant NEEDED_STRING defined as a quoted regexp.
perl -ne '/object-cache enabled/ and $found++; END{ print "Object cache disabled\n" unless $found}' < input_file
This just reads the file a line at a time; if we find the key phrase, we increment $found. At the end, after we've read the whole file, we print the message unless we found the phrase.
If the message is insufficient, you can exit 1 unless $found instead.
I suggest this because there are two things to learn from this:
Perl provides good tools for doing basic filtering and data munging right at the command line.
Sometimes a simpler approach gets a solution out better and faster.
This absolutely isn't the perfect solution for every possible data extraction problem, but for this particular one, it's just what you need.
The -ne option flags tell Perl to set up a while loop to read all of standard input a line at a time, and to take any code following it and run it into the middle of that loop, resulting in a 'run this pattern match on each line in the file' program in a single command line.
END blocks can occur anywhere and are always run at the end of the program only, so defining it inside the while loop generated by -n is perfectly fine. When the program runs out of lines, we fall out the bottom of the while loop and run out of program, so Perl ends the program, triggering the execution of the END block to print (or not) the warning.
If the file you are searching contained a string that indicated the cache was disabled (the condition you want to catch), you could go even shorter:
perl -ne '/object-cache disabled/ and die "Object cache disabled\n"' < input_file
The program would scan the file only until it saw the indication that the cache was disabled, and would exit abnormally at that point.
First, why are you using Net::SSH::Expect? Are you executing a remote command? If not, all you need to execute a program and wait for its completion is system.
system("cmd > file.txt") or die "Couldn't execute: $!";
Second, it appears that what fails is your regular expression. You are searching for the literal expression standard 1.1.1.1 but in your sample text it appears that the wanted string contains either tabs or several spaces instead of a single space. Try changing your call to your find_string function:
find_string('hostrules.txt', 'standard\s+1.1.1.1'); # note '\s+' here