setting BOM to Unicode U code UTF8 perl - perl

This question is similar to others that have been posted before. however trying all combinations nothing is working.
I need to have my excel file read in Unicode Utf8, I am attempting to set my bom:
my $csv = Text::CSV->new ({binary=>1, eol =>$/})
or die "cannot use CSV: ".Text::CSV->error_diag ();
open my $csvFile, ">:encoding(UTF-8)", "teht.csv" or die "teht.csv: $!";
print($csvFile "\x{FEBBBF}");
however this gets an errror and says that "0xFEBBBF is not Unicode..."
all information that I have found indicates that the code for utf8 should read
print($csvFile "\N{U+FEBBBF}") or ... "\xFE\xBB\xBF" or similar.
Is it possible to force Excel recognize UTF-8 CSV files automatically? is one source which says this many times.
https://stackoverflow.com/a/22711105/6557829 is another source.
So far I have actually been able to get UTF-16 to work with the same print statement: print($csvFile "\N{U+FEFF}"); however that is more space than I mean to use.
Thanks in advance for any help you can give me.

The BOM is U+FEFF, not U+FEBBBF. Replace
"\x{FEBBBF}"
with any one of following:
chr(0xFEFF)
"\x{FEFF}"
"\N{U+FEFF}"
"\N{BOM}"
This will create a string with a single character (FEFF), which print will encode using UTF-8 as requested (EF BB BF).

Related

How to fix UTF-8 encoding error with Russian words

My Perl script reads from an text file which contains mainly English ANSI words.
But there are Russian words sometimes, which I can not convert back to UTF-8.
See same example (the words in brackets are the English translations):
Êîìïîíåíò (Component)
Àâòîð (Author)
Ãýíäàëüô (Gandalf)
Äàòà ñîçäàíèÿ (Create date): 20-ìàé(may)-2003
Äàòà êîððåêöèè (Last correction Date): 25-ìàð(mar)-2003
Âåðñèÿ (Version): 0.92
Áëàãîäàðíîñòè (Thanks):
Íîâîå â (New in):
Ïîääåðæêà (Support)
Î÷åíü ìíîãî (Very much)
I've read the UTF-8 Encoding Debugging Chart and tried also the following
$s='Àâòîð';
from_to($s, "iso-8859-5","utf-8");
print "$s\n";
my $s = Encode::decode( 'iso-8859-5', 'Àâòîð' );
from_to($s, "iso-8859-5","utf-8");
print "$s\n";
I've tried also cp1252 instead of iso-8859-5, but nothing helps.
I've tried also Encode::Guess, but the result is not helpful: iso-8859-5 or cp1251 or koi8-r or iso-8859-1.
Any idea how to convert 'Àâòîð' back to the Cyrillic text 'автор'?
After some tries, I get the expected output Автор when switching the (Windows) console code page to 65001 (UTF-8) and decoding the input data from Windows-1251:
perl -MEncode -wle "print encode('UTF-8',decode('Windows-1251',shift))" "Àâòîð"
This suggests that the input data is encoded as Windows-1251 and decoding from that should give you the cyrrillic letters in Unicode. To output the data to a file, make sure you either set the encoding when opening the file (easiest) or encode each string to the target encoding on output (hard to keep track of):
my $octets = <$input_file>;
my $data = decode('Windows-1251', $octets;
open my $fh, '>:encoding(UTF-8)', $filename
or die "Couldn't write to $filename: $!";
print $fh decode('Windows-1251', $data);
Your bytes sequence is 0xc0 0xe2 0xf2 0xee 0xf0. This is russian word 'author' in cp1251. Representation given by you can be get if your application assumes that this is cp1252 encoding. Now the question is here what codepage do you like to have? Or, what codepage needed to your application?
To read file in cp1251 in correct way you have to use construction like this:
open (my $tmp_h,"<:encoding(cp-1251)", $ARGV[0]) or die $!;
That allows perl to know what codepage do you use in your file. And then when you will read file into string it allows perl to correctly convert values from cp1251 to Perl's internal form (UTF-8) and use these string as you want without any problems.
For internal form perl set UTF8 flag you can check using Devel::Peek module.
I think, that using internal form also will give you chance to use any string operation correctly and will help avoid mistakes.
I would recommend to use "use utf8" pragma in our source code. Now, all literals in the source code will be threated as utf8 and automatically converted into internal form correctly. Now, we know that our source code is in UTF8 (and it would also better if with BOM, because detecting BOM usualy is the first thing different IDE and editor will typical do). Later, we can open other files in any encoding using "<:encoding(....)" construction get data from the web, from the databases and again make sure that data were converted into internal form correctly checking utf8 flag. Having all this, we would be able to work with all this data in one manner, correcly compare string, use regular expression and so on.

making sure that my handling of utf8 is correct

I am using Perl for a module that involves processing a lot of Unicode documents. I started getting nervous because I'm not opening and closing files with the utf8 layers like open (OUT, '>:utf8', $textfile). However, I have been thoroughly testing and the output was still as expected. So I want to better understand why.
In a nutshell, my Perl module passes a document to an external service and gets a response. The response will be in Utf8. It uses LWP::UserAgent for this. When it gets the response it just writes it to a file:
my $fh;
open($fh, '>', $outputpath) or die "Could not open file '$outputpath' $!";
print $fh $response->content;
close $fh;
I have diffed these files against Unicode files representing the "expected" output and it is fine. And yet, you can see in my open command that I was not using the utf8 layer. So why is that?
What if I just returned $response->content to some other process, instead of printing it? Would it still be proper Unicode then?
I also have a separate process that I would like to ask about, very similar question. In this case I am trying to build a new service which replaces an old one. The old one read from a file like open(my $fh, '<:utf8', $inputfile) and wrote to a new file like open(my $fh, '>:utf8', $outputfile). The new service will still read the same way, but will not write to the output file anymore. It will send the string to another server using HTTP, and on that server it will be printed to a file using open(my $fh, '>', $outputfile) so no utf8 layer. I can't change that code immediately.
I want the file contents to be the exact same as they would otherwise have been (none of the other processing rules are changing). Should I be nervous about losing the layer?
I think maybe it would help if I understood better what these layers are doing.
There is no "handling of utf8" in the main question and that in itself isn't right.
The whole thing works, as the server is sending utf8 as you say, in the following way.
The content method used on $response is from HTTP::Message
The content() method sets the raw content if an argument is given. If no argument is given the content is not touched. In either case the original raw content is returned.
Since you don't specify layers† in open the default is used, likely :unix:perlio for Unix, with no encoding (see PerlIO). So you are dumping the original bytes to the disk, unchanged.
Looking further down the page, at decoded_content( %options ), we see the default
default_charset
This override the default charset guessed by content_charset() or if that fails "ISO-8859-1".
and can establish what you are getting by printing it
say 'Content type: ', $response->content_charset;
where you should get Content type: UTF-8. But when you receive a different encoding from the server then that will wind up in the file and any code that expects utf8 will break.
One should always decode all input and encode all output. Then we know exactly what is going on. As input is decoded the program carries on with character strings (not bytes in whatever encoding was sent). In the end encode suitably for output. This Effective Perler article should be useful. Here you'd use decoded_content and write files opened with :encoding(UTF-8).
With use open ":std", ":encoding(UTF-8)"; all I/O via standard streams in the lexical scope of this pragma will be handled as utf8. (This can be overriden for other specific uses, say by specifying layers in the three argument open.)
See open pragma.
As for the other question, you need to properly encode what you intend to "send to another server." How to do that depends on how you are "sending" it.
†   With PerlIO the I/O "layers" can be set so that encoding of input and output is done as needed behind the scenes, as data is read or written. The work is done by Encode. For a nice explanation of the process see Encode::PerlIO.
Also see perlunitut, perlunifaq, and perluniitro.

i am not getting kannada text when i run the perl script on a file

i am having following code for extracting the text from the html files and writing to a text file. in html it contain kannada text(utf-8) when programs runs i am getting a text file in that i am getting text but its not in proper formate. text is in unreadable formate
enter code here
use utf8;
use HTML::FormatText;
my $string = HTML::FormatText->format_file(
'a.html',
leftmargin => 0, rightmargin => 50
);
open mm,">t1.txt";
print mm "$string";
so please do help me.how to handle the file formates while we are processing it.
If I understand you correctly, you want the output file to be UTF-8 encoded so that the characters from the Kannada language are encoded in the output correctly. Your code is probably trying (and failing) to encode incorrectly into ISO-8859-1 instead.
If so, then what you can do is make sure your file is opened with a UTF-8 encoding filter.
use HTML::FormatText;
open my $htmlfh, '<:encoding(UTF-8)', 'a.html' or die "cannot open a.html: $!";
my $content = do { local $/; <$htmlfh> }; # read all content from file
close $htmlfh;
my $string = HTML::FormatText->format_string(
$content,
leftmargin => 0, rightmargin => 50
);
open my $mm, '>:encoding(UTF-8)', 't1.txt' or die "cannot open t1.txt: $!";
print $mm $string;
For further reading, I recommend checking out these docs:
perlunitut
perlunifaq
perlunicode
A few other notes:
The use utf8 line only makes it so that your Perl script/library may contain UTF formatting. It does not make any changes to how you read or write files.
Avoid using two-argument forms of open() like in your example. It may allow a malicious user to compromise your system in certain cases. (Though, your usage in this example happens to safe.
When opening a file, you need to add an or die afterwards or failures to read or write the file will be silently ignored.
Update 3/12: I changed it to read the file in UTF-8 and send that to HTML::FormatText. If your a.html file is saved with a BOM character at the start, it may have done the right thing anyway, but this should make it always assume UTF-8 for the incoming file.

Remove or completely supress null character \0

I have a script, MM.pl, which is the “workhorse”, and a simple “patchfile” that it reads from. In this case, the patch file is targeting an .ini file for search and replace. Simple enough. It took me 5 days to realize the ini is encoded with null (\0) characters between each letter. Since then, I have tried every option I could find both in code snippets, use:: functions, and regular expressions.
The only reason I found it was I used use Data::Printer; to dump several values. In Notepad++, the ini appears to be encoded as USC-2 LE. It is important that MM.pl handles the task instead of asking the user to “fix” the issue.
Update: This may provide a clue \xFF\xFE are the first 2 characters in the ini file. They appear after processing. The swap is not actually changing anything else like it is supposed to, but "reveals" 2 hidden characters.
As you noticed, those nulls aren't just junk to be stripped; they're part of the file's character encoding. So decode it:
open my $fh, '<:encoding(UCS-2)', 'file.ini';
Write it back out the same way once you're done.
When you read the file set the encoding
my $fh = IO::File->open( "< something.ini" );
binmode( $fh, ":encoding(UTF-16LE)" );
And when you output, you can write back whichever enoding you like. e.g.
my $out = IO::File->open( "> something-new.ini" );
binmode( $out, ":encoding(UTF-8)" );
Or even if you're dumping to the terminal
binmode( STDOUT, ":encoding(UTF-8)" );
To be honest this really is not a solution but a copout. After 4 weeks of trying and retrying methods, and reading and reading and reading, I have put it in park and switched to python to build the app. Several references in the perldocs mention UTF16 is "problematic" and also in mention situations it is treated differently.

Print other language character in csv using perl file handling

I am scraping a site based on German language , I am trying to store the content of the site in a CSV using Perl , but i am facing garbage value in the csv, the code i use is
open my $fh, '>> :encoding(UTF-8)', 'output.csv';
print {$fh} qq|"$title"\n|;
close $fh;
For example :I expect Weiß ,Römersandalen , but i get Weiß, Römersandalen
Update :
Code
use strict;
use warnings;
use utf8;
use WWW::Mechanize::Firefox;
use autodie qw(:all);
my $m = WWW::Mechanize::Firefox->new();
print "\n\n *******Program Begins********\n\n";
$m->get($url) or die "unable to get $url";
my $Home_Con=$m->content;
my $title='';
if($Home_Con=~m/<span id="btAsinTitle">([^<]*?)<\/span>/is){
$title=$1;
print "title ::$1\n";
}
open my $fh, '>> :encoding(UTF-8)', 's.txt'; #<= (Weiß)
print {$fh} qq|"$title"\n|;
close $fh;
open $fh, '>> :encoding(UTF-8)', 's1.csv'; #<= (Weiß)
print {$fh} qq|"$title"\n|;
close $fh;
print "\n\n *******Program ends********";
<>;
This is the part of code. The method works fine in text files, but not in csv.
You've shown us the code where you're encoding the data correctly as you write it to the file.
What we also need to see is how the data gets into your program. Are you decoding it correctly at that point?
Update:
If the code was really just my $title='Weiß ,Römersandalen' as you say in the comments, then the solution would be as simple as adding use utf8 to your code.
The point is that Perl needs to know how to interpret the stream of bytes that it's dealing with. Outside your program, data exists as bytes in various encodings. You need to decode that data as it enters your program (decoding turns a stream of bytes into a string of characters) and encode it again as it leaves your program. You're doing the encoding step correctly, but not the decoding step.
The reason that use utf8 fixes that in the simple example you've given is that use utf8 tells Perl that your source code should be interpreted as a stream of bytes encoded as utf8. It then converts that stream of bytes into a string of characters containing the correct characters for 'Weiß ,Römersandalen'. It can then successfully encode those characters into bytes representing those characters encoded as utf8 as they are written to the file.
Your data is actually coming from a web page. I assume you're using LWP::Simple or something like that. That data might be encoded as utf8 (I doubt it, given the problems you're having) but it might also be encoded as ISO-8859-1 or ISO-8859-9 or CP1252 or any number of other encodings. Unless you know what the encoding is and correctly decode the incoming data, you will see the results that you are getting.
Check if there are any weird characters at start or anywhere in the file using commands like head or tail