I want to write a Perl script which captures binary data from a pipe and read the binary data inside Perl and process the received binary content as file handler.
I am able to receive the binary content from pipe and the problem is here that the binary format is not preserved correctly while reading the data from Perl . The NUL character are not preserved and converted this character as newline in Perl environment. Following is the command line arguments and sample
>more D:\Sample_binary.zip| perl readpipe.pl D:\sample_output.txt
readpipe.pl
local $/;
my $lines = <STDIN>; # Read the binary data from pipe
open my $IN, "+<", \$lines; # Load the content as file handler
$zip = Archive::Zip->new;
$zip->readFromFileHandle ($IN); # Read ZIP file from the received binary data
When dealing with binary data, use binmode(STDIN);. It will prevent CRLF⇔LF conversions, and will disable any encoding layer (added by use open or whatever).
The NUL character are not preserved and converted this character as newline in Perl environment.
No, Perl is not doing that. Perhaps more is? Use
perl readpipe.pl <D:\Sample_binary.zip
Related
I have a folder of several hundred text file. Each file has the same format, for instance the file with the name ATextFile1.txt reads
ATextFile1.txt 09 Oct 2013
1
2
3
4
...
I have a simplified Perl script that is supposed to read the file and print it back out in the terminal window:
#!/usr/bin/Perl
use warnings;
use strict;
my $fileName = shift(#ARGV);
open(my $INFILE, "<:encoding(UTF-8)", $fileName) || die("Cannot open $fileName: $!.\n");
foreach (<$INFILE>){
print("$_"); # Uses the newline character from the file
}
When I use this script on files generated by the Windows version of the program that generates the ATextFile1.txt, my output is exactly as I'd expect (being the content of the text file), however, when I run this script on files generated by the Mac version of the file generating program, the output looks like the following:
2016tFile1.txt 09 Oct 2013
After some testing, it seems that it is only printing the first line of the text where the first 4 characters are overwritten by what can be expressed in RegEx as /[0-9][0-9]16/. If in my Perl script, I replace the output statement with print("\t$_");, I get the following line printed to STDOUT:
2016 ATextFile1.txt 09 Oct 2013
Each of these files can be read normally using any standard text editor but for some reason, my Perl script can't seem to properly read and write from the file. Any help would be greatly appreciated (I'm hoping it's something obvious that I'm missing). Thanks in advance!
Note that if you are printing UTF-8 characters to STDOUT you will need to use
binmode STDOUT, ':encoding(utf8)';
beforehand.
It looks as if your Mac files have just CR as the line ending. I understood that recent versions of Macintosh systems used LF as the line ending (the same as Linux) but Mac OS 9 uses just CR, while Windows uses the two characters CR LF inside the file, which is converted to just LF by the PerlIO layer when perl is running in a Windows platform.
If there are no linefeeds in the file, then Perl will read the entire file as a single record, and printing it will overlay all lines on top of one another.
As long as the files are relatively small, the easiest way to read either file format with the same Perl code is to read the whole file and split it on either CR or LF. Anything else will need different code according to the source of the input files.
Try this version of your code.
use strict;
use warnings;
my #contents = do {
open my $fh, '<:encoding(utf8)', $ARGV[0];
local $/;
my $contents = <$fh>;
split /[\r\n]+/, $contents;
}
print "$_\n" for #contents;
Update
One alternative you might try is to use the PerlIO::eol module, which provides a PerlIO layer that translates any line ending to LF when the record is read. I'm not certain that it plays nice with UTF-8, but as long as you add it after the encoding layer it should be fine.
It is not a core module so you will probably need to install it, but after that the program becomes just
use strict;
use warnings;
open my $fh, '<:encoding(UTF-8):eol(LF)', $ARGV[0];
binmode STDOUT, ':encoding(utf8)';
print while <$fh>;
I have created Windows, Linux, and Mac-style text files and this program works fine wioth all of them, but I have been unable to check whether a UTF-8 character that has 0x0D or 0x0A as part of its encoding are passed through properly, so be careful.
Update 2
After thinking briefly about this, of course there are no UTF-8 encodings that contain CR or LF apart from those characters themselves. All characters outside the ASCII range contain only bytes with the top bit set, so they are over 0x80 and can never be 0x0D or 0x0A.
I am scraping a site based on German language , I am trying to store the content of the site in a CSV using Perl , but i am facing garbage value in the csv, the code i use is
open my $fh, '>> :encoding(UTF-8)', 'output.csv';
print {$fh} qq|"$title"\n|;
close $fh;
For example :I expect Weiß ,Römersandalen , but i get Weiß, Römersandalen
Update :
Code
use strict;
use warnings;
use utf8;
use WWW::Mechanize::Firefox;
use autodie qw(:all);
my $m = WWW::Mechanize::Firefox->new();
print "\n\n *******Program Begins********\n\n";
$m->get($url) or die "unable to get $url";
my $Home_Con=$m->content;
my $title='';
if($Home_Con=~m/<span id="btAsinTitle">([^<]*?)<\/span>/is){
$title=$1;
print "title ::$1\n";
}
open my $fh, '>> :encoding(UTF-8)', 's.txt'; #<= (Weiß)
print {$fh} qq|"$title"\n|;
close $fh;
open $fh, '>> :encoding(UTF-8)', 's1.csv'; #<= (Weiß)
print {$fh} qq|"$title"\n|;
close $fh;
print "\n\n *******Program ends********";
<>;
This is the part of code. The method works fine in text files, but not in csv.
You've shown us the code where you're encoding the data correctly as you write it to the file.
What we also need to see is how the data gets into your program. Are you decoding it correctly at that point?
Update:
If the code was really just my $title='Weiß ,Römersandalen' as you say in the comments, then the solution would be as simple as adding use utf8 to your code.
The point is that Perl needs to know how to interpret the stream of bytes that it's dealing with. Outside your program, data exists as bytes in various encodings. You need to decode that data as it enters your program (decoding turns a stream of bytes into a string of characters) and encode it again as it leaves your program. You're doing the encoding step correctly, but not the decoding step.
The reason that use utf8 fixes that in the simple example you've given is that use utf8 tells Perl that your source code should be interpreted as a stream of bytes encoded as utf8. It then converts that stream of bytes into a string of characters containing the correct characters for 'Weiß ,Römersandalen'. It can then successfully encode those characters into bytes representing those characters encoded as utf8 as they are written to the file.
Your data is actually coming from a web page. I assume you're using LWP::Simple or something like that. That data might be encoded as utf8 (I doubt it, given the problems you're having) but it might also be encoded as ISO-8859-1 or ISO-8859-9 or CP1252 or any number of other encodings. Unless you know what the encoding is and correctly decode the incoming data, you will see the results that you are getting.
Check if there are any weird characters at start or anywhere in the file using commands like head or tail
I'm entering large amounts of data into a PostgreSQL database using Perl and the Perl DBI. I have been getting errors as my file is improperly encoded. I have the PostgreSQL encoding set to 'utf8' and used the debian 'file' command to determine that my file has "Non-ISO extended-ASCII text, with very long lines, with CRLF line terminators", and when I run my program the DBI fails due to an "invalid byte sequence". I already added a line in my Perl program to sub the '\r' carriage returns for nothing, but how can I convert my files to 'utf8' or get PostgreSQL to accept my file encoding. Thanks.
When you connect to PostgreSQL using DBI->connect(..., { pg_enable_utf8 => 1}) then the data used in all modifying DBI calls (SQL INSERT, UPDATE, DELETE, everywhere you use placeholders in queries etc) has to be encoded in Perl's internal encoding so that DBI itself can convert to the wire protocol correctly.
There are tons of ways how you can achieve that, and they all depend on how you read the file in the first place. The most basic one is if you use open (or one of the methods based directly on it like IO::File->open). You can then use Perl's I/O layers (see the open link above) and let Perl do that for you. Assuming your file is encoded in UTF-8 already you'll get away with:
open(my $fh, "<:encoding(UTF-8)", "filename");
while (my $line = <$fh>) {
# process query
}
This is basically equivalent to opening the file without an encoding layer and converting manually using Encode::decode, e.g. like this:
open(my $fh, "<", "filename");
while (my $line = <$fh>) {
$line = Encode::decode('UTF-8', $line);
# process query
}
A lot of other modules that receive data from external sources and return it (think of HTTP downloads with LWP, for example) return values that have already been converted into Perl's internal encoding.
So what you have to do is:
Figure out which encoding your file actually uses (try using iconv on the shell for that)
Tell DBI to enable UTF-8
Open the file with the correct encoding
Read line(s), process query, repeat
I'm reading a text file via CGI in, in perl, and noticing that when the file is saved in mac's textEdit the line separator is recognized, but when I upload a CSV that is exported straight from excel, they are not. I'm guessing it's a \n vs. \r issue, but it got me thinking that I don't know how to specify what I would like the line terminator token to be, if I didn't want the one it's looking for by default.
Yes. You'll want to overwrite the value of $/. From perlvar
$/
The input record separator, newline by default. This influences Perl's idea of what a "line" is. Works like awk's RS variable, including treating empty lines as a terminator if set to the null string. (An empty line cannot contain any spaces or tabs.) You may set it to a multi-character string to match a multi-character terminator, or to undef to read through the end of file. Setting it to "\n\n" means something slightly different than setting to "", if the file contains consecutive empty lines. Setting to "" will treat two or more consecutive empty lines as a single empty line. Setting to "\n\n" will blindly assume that the next input character belongs to the next paragraph, even if it's a newline. (Mnemonic: / delimits line boundaries when quoting poetry.)
local $/; # enable "slurp" mode
local $_ = <FH>; # whole file now here
s/\n[ \t]+/ /g;
Remember: the value of $/ is a string, not a regex. awk has to be better for something. :-)
Setting $/ to a reference to an integer, scalar containing an integer, or scalar that's convertible to an integer will attempt to read records instead of lines, with the maximum record size being the referenced integer. So this:
local $/ = \32768; # or \"32768", or \$var_containing_32768
open my $fh, "<", $myfile or die $!;
local $_ = <$fh>;
will read a record of no more than 32768 bytes from FILE. If you're not reading from a record-oriented file (or your OS doesn't have record-oriented files), then you'll likely get a full chunk of data with every read. If a record is larger than the record size you've set, you'll get the record back in pieces. Trying to set the record size to zero or less will cause reading in the (rest of the) whole file.
On VMS, record reads are done with the equivalent of sysread, so it's best not to mix record and non-record reads on the same file. (This is unlikely to be a problem, because any file you'd want to read in record mode is probably unusable in line mode.) Non-VMS systems do normal I/O, so it's safe to mix record and non-record reads of a file.
See also "Newlines" in perlport. Also see $..
The variable has multiple names:
$/
$RS
$INPUT_RECORD_SEPARATOR
For the longer names, you need:
use English;
Remember to localize carefully:
{
local($/) = "\r\n";
...code to read...
}
If you are reading in a file with CRLF line terminators, you can open it with the CRLF discipline, or set the binmode of the handle to do automatic translation.
open my $fh, '<:crlf', 'the_csv_file.csv' or die "Oh noes $!";
This will transparently convert \r\n sequences into \n sequences.
You can also apply this translation to an existing handle by doing:
binmode( $fh, ':crlf' );
:crlf mode is typically default in Win32 Perl environments and works very well in practice.
For reading a CSV file, follow Robert-P's advice in his comment, and use a CSV module.
But for the general case of reading lines from a file with different line-endings, what I generally do is slurp the file whole and split it on \R. If it's not a multi-gigabytes file, that should be the safest and easiest way.
So:
perl -ln -0777 -e 'my #lines = split /\R/;
print length($_), " bytes split into ", scalar(#lines), " lines."' $YOUR_FILE
or in your script:
{
local $/ = undef;
open F, $YOUR_FILE or die;
#lines = split /\R/, <F>;
close F;
}
\R works with Unix LF (\x0A), Windows/Internet CRLF, and also with CR (\x0D) which was used by Macs in the nineties, but is in fact still used by some Mac programs.
From the perldoc :
\R matches a generic newline; that is, anything considered a linebreak
sequence by Unicode. This includes all characters matched by \v
(vertical whitespace), and the multi character sequence "\x0D\x0A"
(carriage return followed by a line feed, sometimes called the network
newline; it's the end of line sequence used in Microsoft text files
opened in binary mode)
Or see this much nicer and exhaustive explanation about \R in Brian D Foy's article : The \R generic line ending which even has a couple of fun videos.
I already know how to convert the non-utf8-encoded content of a file line by line to UTF-8 encode, using something like the following code:
# outfile.txt is in GB-2312 encode
open my $filter,"<",'c:/outfile.txt';
while(<$filter>){
#convert each line of outfile.txt to UTF-8 encoding
$_ = Encode::decode("gb2312", $_);
...}
But I think Perl can directly encode the whole input file to UTF-8 format, so I've tried something like
#outfile.txt is in GB-2312 encode
open my $filter,"<:utf8",'c:/outfile.txt';
(Perl says something like "utf8 "\xD4" does not map to Unicode" )
and
open my $filter,"<",'c:/outfile.txt';
$filter = Encode::decode("gb2312", $filter);
(Perl says "readline() on unopened filehandle!)
They don't work. But is there some way to directly convert the input file to UTF-8 encode?
Update:
Looks like things are not as simple as I thought. I now can convert the input file to UTF-8 code in a roundabout way. I first open the input file and then encode the content of it to UTF-8 and then output to a new file and then open the new file for further processing. This is the code:
open my $filter,'<:encoding(gb2312)','c:/outfile.txt';
open my $filter_new, '+>:utf8', 'c:/outfile_new.txt';
print $filter_new $_ while <$filter>;
while (<$filter_new>){
...
}
But this is too much work and it is even more troublesome than simply encode the content of $filter line by line.
I think I misunderstood your question. I think what you want to do is read a file in a non-UTF-8 encoding, then play with the data as UTF-8 in your program. That's something much easier. After you read the data with the right encoding, Perl represents it internally as UTF-8. So, just do what you have to do.
When you write it back out, use whatever encoding you want to save it as. However, you don't have to put it back in a file to use it.
old answer
The Perl I/O layers only read the data assuming it's already properly encoded. It's not going to convert encoding for you. By telling open to use utf8, you're telling it that it already is utf8.
You have to use the Encode module just as you've shown (unless you want to write your own I/O layer). You can convert bytes to UTF-8, or if you know the encoding, you can convert from one encoding to another. Since it looks like you already know the encoding, you might want the from_to() function.
If you're just starting out with Perl and Unicode, go through Juerd's Perl Unicode Advice before you do anything.
The :encoding layer will return UTF-8, suitable for perl's use. That is, perl will recognize each character as a character, even if they are multiple bytes. Depending on what you are going to do next with the data, this may be adequate.
But if you are doing something with the data where perl will try to downgrade it from utf8, you either need to tell perl not to (for instance, doing a binmode(STDOUT, ":utf8") to tell perl that output to stdout should be utf8), or you need to have perl treat your utf8 as binary data (interpreting each byte separately, and knowing nothing about the utf8 characters.)
To do that, all you need is to apply an additional layer to your open:
open my $foo, "<:encoding(gb2312):bytes", ...;
Note that the output of the following will be the same:
perl -we'open my $foo, "<:encoding(gb2312):bytes", "foo"; $bar = <$foo>; print $bar'
perl -CO -we'open my $foo, "<:encoding(gb2312)", "foo"; $bar = <$foo>; print $bar'
but in one case, perl knows that data read is utf8 (and so length($bar) will report the number of utf8 characters) and has to be explicitly told (by -CO) that STDOUT will accept utf8, and in the other, perl makes no assumptions about the data (and so length($bar) will report the number of bytes), and just prints it out as is.