With Strawberry perl v5.28.1 on Windows 10 I am trying to achieve the same result as on Linux - namely get a UTF8 encoded file with Unix line endings.
Here is my Perl script:
#!perl -w
use strict;
use utf8;
use Encode qw(encode_utf8);
use Digest::MD5 qw(md5_hex);
binmode(STDIN, ":utf8");
binmode(STDOUT, ":utf8");
my %words;
while(<>) {
# change yo to ye
tr/ёЁ/еЕ/;
# extract russian word and its optional explanation
next unless /^([А-Я]{2,})\|?([А-Я ,-]*)/i;
my ($word, $expl) = (uc $1, $2);
if (length($word) <= 3) {
print $word;
# if explanation is missing, omit the pipe
print (length($expl) > 3 ? "|$expl\x0A" : "\x0A");
} else {
# print the md5 hash and omit the pipe and explanation
print md5_hex(encode_utf8('my secret' . $word)) . "\x0A";
}
}
Here is my input file:
ААК|Плоскодонное речное судно
ААРОНОВЕЦ|
ААРОНОВЩИНА|
ААТ|Драгоценный красный камень в Японии
АБА|Толстое и редкое белое сукно
АБАЖУР|
АБАЖУРОДЕРЖАТЕЛЬ|
АБАЗ|Грузинская серебряная монета
АБАЗА|
Here is how I run it (I use type instead of < because I have numerous input files in my real use case):
type input.txt | perl encode-words-ru.pl > output.txt
Regardless of what I try in the above Perl source code, the lines in output.txt are terminated by \x0D\x0A
Please help me to stop perl from "helping" me!
There is probably a better way, but you could make STDOUT a :raw file handle and then encode the output there yourself.
binmode STDOUT; # or binmode STDOUT, ":raw";
...
print (length($expl) > 3 ? encode_utf8("|$expl\n") : "\n"); # $exp1 is already decoded
...
print md5_hex(encode_utf8('my secret' . $word)) . "\n";
Related
I started using Term::Readline recently, but now I realized cat text | ./script.pl doesn't work (no output).
script.pl snippet before (working ok):
#!/usr/bin/perl
use strict;
use warnings;
$| = 1;
while (<>) {
print $_;
}
script.pl snippet after (working only interactively):
#!/usr/bin/perl
use strict;
use warnings;
use Term::ReadLine
$| = 1;
my $term = Term::ReadLine->new('name');
my $input;
while (defined ($input = $term->readline('')) ) {
print $input;
}
Is there anything I can do to preserve this behavior (to have the lines printed) ?
You need to set it up to use the input and output filehandles that you want. The docs don't spell it out, but the constructor takes either a string (to serve as a name), or that string and globs for input and output filehandles (need both).
use warnings;
use strict;
use Term::ReadLine;
my $term = Term::ReadLine->new('name', \*STDIN, \*STDOUT);
while (my $line = $term->readline()) {
print $line, "\n";
}
Now
echo "hello\nthere" | script.pl
prints the two lines with hello and there, while scipt.pl < input.txt prints out the lines of the file input.txt. After this the normal STDIN and STDOUT will be used by the module's $term for all future I/O. Note that the module has methods for retrieving input and output filehandles ($term->OUT and $term->IN) so you can change later where your I/O goes.
The Term::ReaLine itself doesn't have much detail but this is a front end for other modules, listed on the page. Their pages have far more information. Also, I believe that uses of this are covered elsewhere, for example in the good old Cookbook.
I have a CSV file, say win.csv, whose text is encoded in windows-1252. First I use iconv to make it in utf8.
$iconv -o test.csv -f windows-1252 -t utf-8 win.csv
Then I read the converted CSV file with the following Perl script (utfcsv.pl).
#!/usr/bin/perl
use utf8;
use Text::CSV;
use Encode::Detect::Detector;
my $csv = Text::CSV->new({ binary => 1, sep_char => ';',});
open my $fh, "<encoding(utf8)", "test.csv";
while (my $row = $csv->getline($fh)) {
my $line = join " ", #$row;
my $enc = Encode::Detect::Detector::detect($line);
print "($enc) $line\n";
}
$csv->eof || $csv->error_diag();
close $fh;
$csv->eol("\r\n");
exit;
Then the output is like the following.
(UFT-8) .........
() .....
Namely the encoding of all lines are detected as UTF-8 (or ASCII). But the actual output does not seem to be UTF-8. In fact, if I save the output on a file
$./utfcsv.pl > output.txt
then the encoding of output.txt is detected as windows-1252.
Question: How can I get the output text in UFT-8?
Notes:
Environment: openSUSE 13.2 x86_64, perl 5.20.1
I do not use Text::CSV::Encoded because the installation fails. (Because test.csv is converted in UTF-8, so it is strange to use Text::CSV::Encoded.)
I use the following script to check the encoding. (I also use it to find out the encoding of the initial CSV file win.csv.)
.
#!/usr/bin/perl
use Encode::Detect::Detector;
open my $in, "<","$ARGV[0]" || die "open failed";
while (my $line = <$in>) {
my $enc = Encode::Detect::Detector::detect($line);
chomp $enc;
if ($enc) {
print "$enc\n";
}
}
You have set the encoding of the input file handle (which, by the way, should be <:encoding(utf8) -- note the colon) but you haven't specified the encoding of the output channel, so Perl will send unencoded character values to the output
The Unicode values for characters that will fit in a single byte -- Basic Latin (ASCII) between 0 and 0x7F, and Latin-1 Supplement between 0x80 and 0xFF -- are very similar to Windows code page 1252. In particular a small letter u with a diaresis is 0xFC in both Unicode and CP1252, so the text will look like CP1252 if it is output unencoded, instead of the two-byte sequence 0xC3 0xBC which is the same codepoint encoded in UTF-8
If you use binmode on STDOUT to set the encoding then the data will be output correctly, but it is simplest to use the open pragma like this
use open qw/ :std :encoding(utf-8) /;
which will set the encoding for STDIN, STDOUT and STDERR, as well as any newly-opened file handles. That means you don't have to specify it when you open the CSV file, and your code will look like this
Note that I have also added use strict and use warnings, which are essential in any Perl program. I have also
used autodie to remove the need for checks on the status of all IO operations, and I have taken advantage of the way Perl interpolates arrays inside double quotes by putting a space between the elements which avoids the need for a join call
#!/usr/bin/perl
use utf8;
use strict;
use warnings 'all';
use open qw/ :std :encoding(utf-8) /;
use autodie;
use Text::CSV;
my $csv = Text::CSV->new({ binary => 1, sep_char => ';' });
open my $fh, '<', 'test.csv';
while ( my $row = $csv->getline($fh) ) {
print "#$row\n";
}
close $fh;
I'd like advice about Perl.
I have text files I want to process with Perl. Those text files are encoded in cp932, but for some reasons they may contain malformed characters.
My program is like:
#! /usr/bin/perl -w
use strict;
use encoding 'utf-8';
# 'workfile.txt' is supposed to be encoded in cp932
open my $in, "<:encoding(cp932)", "./workfile.txt";
while ( my $line = <$in> ) {
# my process comes here
print $line;
}
If workfile.txt includes malformed characters, Perl complains:
cp932 "\x81" does not map to Unicode at ./my_program.pl line 8, <$in> line 1234.
Perl knows if its input contains malformed characters. So I want to rewrite to see if my input is good or bad and act accordingly, say print all good lines (lines that do not contain malformed characters) to output filehandle A, and print lines that do contain malformed characters to output filehandle B.
#! /usr/bin/perl -w
use strict;
use encoding 'utf-8';
use English;
# 'workfile.txt' is supposed to be encoded in cp932
open my $in, "<:encoding(cp932)", "./workfile.txt";
open my $output_good, ">:encoding(utf8)", "good.txt";
open my $output_bad, ">:encoding(utf8)", "bad.txt";
select $output_good; # in most cases workfile.txt lines are good
while ( my $line = <$in> ) {
if ( $line contains malformed characters ) {
select $output_bad;
}
print "$INPUT_LINE_NUMBER: $line";
select $output_good;
}
My question is how I can write this "if ($line contains malfoomed characters)" part. How can I check if input is good or bad.
Thanks in advance.
#! /usr/bin/perl -w
use strict;
use utf8; # Source encoded using UTF-8
use open ':std', ':encoding(UTF-8)'; # STD* is UTF-8;
# UTF-8 is default encoding for open.
use Encode qw( decode );
open my $fh_in, "<:raw", "workfile.txt"
or die $!;
open my $fh_good, ">", "good.txt"
or die $!;
open my $fh_bad, ">:raw", "bad.txt"
or die $!;
while ( my $line = <$fh_in> ) {
my $decoded_line =
eval { decode('cp932', $line, Encode::FB_CROAK|Encode::LEAVE_SRC) };
if (defined($decoded_line)) {
print($fh_good "$. $decoded_line");
} else {
print($fh_bad "$. $line");
}
}
I have a file containing the following content 1000 line in the following format:
abc def ghi gkl
How can I write a Perl script to print only the first and the third fields?
abc ghi
perl -lane 'print "#F[0,2]"' file
If no answer is good for you yet, I'll try to get the bounty ;-)
#!/usr/bin/perl
# Lines beginning with a hash (#) denote optional comments,
# except the first line, which is required,
# see http://en.wikipedia.org/wiki/Shebang_(Unix)
use strict; # http://perldoc.perl.org/strict.html
use warnings; # http://perldoc.perl.org/warnings.html
# http://perldoc.perl.org/perlsyn.html#Compound-Statements
# http://perldoc.perl.org/functions/defined.html
# http://perldoc.perl.org/functions/my.html
# http://perldoc.perl.org/perldata.html
# http://perldoc.perl.org/perlop.html#I%2fO-Operators
while (defined(my $line = <>)) {
# http://perldoc.perl.org/functions/split.html
my #chunks = split ' ', $line;
# http://perldoc.perl.org/functions/print.html
# http://perldoc.perl.org/perlop.html#Quote-Like-Operators
print "$chunks[0] $chunks[2]\n";
}
To run this script, given that its name is script.pl, invoke it as
perl script.pl FILE
where FILE is the file that you want to parse. See also http://perldoc.perl.org/perlrun.html. Good luck! ;-)
That's really kind of a waste for something as powerful as perl, since you can do the same thing in one trivial line of awk.
awk '{ print $1 $3 }'
while ( <> ) {
my #fields = split;
print "#fields[0,2]\n";
}
and just for variety, on Windows:
C:\Temp> perl -pale "$_=qq{#F[0,2]}"
and on Unix
$ perl -pale '$_="#F[0,2]"'
As perl one-liner:
perl -ane 'print "#F[0,2]\n"' file
Or as executable script:
#!/usr/bin/perl
use strict;
use warnings;
open my $fh, '<', 'file' or die "Can't open file: $!\n";
while (<$fh>) {
my #fields = split;
print "#fields[0,2]\n";
}
Execute the script like this:
perl script.pl
or
chmod 755 script.pl
./script.pl
I'm sure I shouldn't get the bounty since the question asks for the result to be given in perl, but anyway:
In bash/ksh/ash/etc:
cut -d " " -f 1,3 "file"
In Windows/DOS:
for /f "tokens=1-4 delims= " %i in (file) do (echo %i %k)
Advantages: like others said, no need to learn Pearl, Awk, nothing, just knowing some tools. The result of both calls can be saved to the disk by using the ">" and the ">>" operator.
while(<>){
chomp;
#s = split ;
print "$s[0] $s[2]\n";
}
please start to go through the documentation as well
#!/usr/bin/env perl
open my$F, "<", "file" or die;
print join(" ",(split)[0,2])."\n" while(<$F>);
close $F
One easy way is:
(split)[0,2]
Example:
$_ = 'abc def ghi gkl';
print( (split)[0,2] , "\n");
print( join(" ", (split)[0,2] ),"\n");
Command line:
perl -e '$_="abc def ghi gkl";print(join(" ",(split)[0,2]),"\n")'
So I'm trying to read in a config. file in Perl. The config file uses a trailing backslash to indicate a line continuation. For instance, the file might look like this:
=== somefile ===
foo=bar
x=this\
is\
a\
multiline statement.
I have code that reads in the file, and then processes the trailing backslash(es) to concatenate the lines. However, it looks like Perl already did it for me. For instance, the code:
open(fh, 'somefile');
#data = <fh>;
print join('', #data);
prints:
foo=bar
x=thisisamultiline statement
Lo and behold, the '#data = ;' statement appears to have already handled the trailing backslash!
Is this defined behavior in Perl?
I have no idea what you are seeing, but that is not valid Perl code and that is not a behavior in Perl. Here is some Perl code that does what you want:
#!/usr/bin/perl
use strict;
use warnings;
while (my $line = <DATA>) {
#collapse lines that end with \
while ($line =~ s/\\\n//) {
$line .= <DATA>;
}
print $line;
}
__DATA__
foo=bar
x=this\
is\
a\
multiline statement.
Note: If you are typing the file in on the commandline like this:
perl -ple 1 <<!
foo\
bar
baz
!
Then you are seeing the effect of your shell, not Perl. Consider the following counterexample:
printf 'foo\\\nbar\nbaz\n' | perl -ple 1
My ConfigReader::Simple module supports continuation lines in config files, and should handle your config if it's the format in your question.
If you want to see how to do it yourself, check out the source for that module. It's not a lot of code.
I don't know what exactly you are doing, but the code you gave us doesn't even run:
=> cat z.pl
#!/usr/bin/perl
fh = open('somefile', 'r');
#data = <fh>;
print join('', #data);
=> perl z.pl
Can't modify constant item in scalar assignment at z.pl line 2, near ");"
Execution of z.pl aborted due to compilation errors.
And if I change the snippet to be actual perl:
=> cat z.pl
#!/usr/bin/perl
open my $fh, '<', 'somefile';
my #data = <$fh>;
print join('', #data);
it clearly doesn't mangle the data:
=> perl z.pl
foo=bar
x=this\
is\
a\
multiline statement.