“ character are shown in the csv - perl

I am parsing a site and writing the content in a csv file using Perl where I see †,“ junk values in the content on the csv.
use utf8;
my $csv = Text::CSV->new ( { binary => 1, eol => "\n" } ) # should set binary attribute.
or die "Cannot use CSV: ".Text::CSV->error_diag ();
open my $fh, ">>:encoding(utf8)", "Test.csv" or die "Test.csv: $!";
$csv->print($fh, [$title,$content]);
$csv->eol();
the site is encoded with utf8.
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
How can I solve this issue?
Update :
#ikegami : Thanks ,the output of your code gives the same character as
\x{201c}HexTab\x{201d}
Update 2:
Thanks
if i use ">>:encoding(cp1252)" it solves the quotes characters issue, but it throws some warning,
"\x{03bc}" does not map to cp1252 at c:/Perl/lib/IO/Handle.pm line 417
"\x{ff1c}" does not map to cp1252 at c:/Perl/lib/IO/Handle.pm line 417

I take it you expect to see the following:
“HexTab”
And you see the following instead:
“HexTab�
You're saving the file as UTF-8, but the program reading the file is decoding it using cp1252. These two have to match!
Two options:
Encode the text using cp1252 (:encoding(cp1252)) if the reader is going to continue decoding it using cp1252.
Have the reader decode the file using UTF-8 if you're going to encode it as UTF-8 (:encoding(UTF-8)).
Generally speaking, the latter is the better option as it allows the file to contain any Unicode character rather than an abysmally small subset.

There a program called iconv on most Unix systems that can re-encode files from one encoding to another. You need to determine the original encoding of your file.
You would run iconv as:
$ iconv -f utf8 -t cp1252 $file_name.csv > $new_file_name.csv
This would translate a file written in Windows using the default Code Page 1252 and convert it into UTF-8 encoding. I would first try cp1252 and see if that works. If not, try cp1250, latin1, and macintosh (It could have been a file created with MacRoman.
See if iconv can get rid of the issue.

Related

How do I find a 4 digit unicode character using this perl one liner?

I have a file with this unicode character ỗ
File saved in notepad as UTF-8
I tried this line
C:\blah>perl -wln -e "/\x{1ed7}/ and print;" blah.txt
But it's not picking it up. If the file has a letter like 'a'(unicode hex 61), then \x{61} picks it up. But for a 4 digit unicode character, I have an issue picking up the character.
You had the right idea with using /\x{1ed7}/. The problem is that your regex wants to match characters but you're giving it bytes. You'll need to tell Perl to decode the bytes from UTF-8 when it reads them and then encode them to UTF-8 when it writes them:
perl -CiO -ne "/\x{1ed7}/ and print" blah.txt
The -C option controls how Unicode semantics are applied to input and output filehandles. So for example -CO (capital 'o' for 'output') is equivalent to adding this before the start of your script:
binmode(STDOUT, ":utf8")
Similarly, -CI is equivalent to:
binmode(STDIN, ":utf8")
But in your case, you're not using STDIN. Instead, the -n is wrapping a loop around your code that opens each file listed on the command-line. So you can instead use -Ci to add the ':utf8' I/O layer to each file Perl opens for input. You can combine the -Ci and the -CO as: -CiO
Your script works fine. The problem is the unicode you're using for searching. Since your file is utf-8 then your unique search parameters need to be E1, BB, or 97. Check the below file encoding and how that changes the search criteria.
UTF-8 Encoding: 0xE1 0xBB 0x97
UTF-16 Encoding: 0x1ED7
UTF-32 Encoding: 0x00001ED7
Resource https://www.compart.com/en/unicode/U+1ED7

Perl - copy to clipboard in cp1251

Trying to copy to clipboard text in cp1251.
#!/usr/bin/perl -w
use Clipboard;
use Encode;
my $ClipboardOut = "A bunch of cyrillic characters - а-б-в-г \n";
Encode::from_to($ClipboardOut, 'utf-8', 'cp1251');
Clipboard->copy($ClipboardOut);
Instead of Cyrillic letters "?" are pasted in any Windows apps. If I remove line with Encode - Cyrillic letters produce "a'-s with different modifiers:
A bunch of cyrillic characters à-á-â-ã
I guess I miss something extra-simple but I'm stuck on it. Can somebody help me, please?
In Windows, Clipboard expects text encoded using the system's Active Code Page. That's because Clipboard is just a wrapper for Win32::Clipboard. And while Win32::Clipboard allows you to receive arbitrary Unicode text from the clipboard, it doesn't allow you to place arbitrary Unicode text on the clipboard. So using that module directly doesn't help.
This is limiting. For example, my machine's ACP is cp1252, so I wouldn't be able to place Cyrillic characters on the clipboard using this module.
Assuming your system's ACP supports the Cyrillic characters in question, here are two solutions: (I used Win32::Clipboard directly, but you could use Clipboard the same way.)
Source code encoded using UTF-8 (This is normally ideal)
use utf8;
use Encode qw( encode );
use Win32 qw( );
use Win32::Clipboard qw( );
# String of decoded text aka Unicode Code Points because of `use utf8;`
my $text_ucp = "а-б-в-г\n";
my $acp = "cp" . Win32::GetACP();
my $clip = Win32::Clipboard();
$clip->Set(encode($acp, $text_ucp));
Source code encoded as per Active Code Page
Perl expects source code to be encoded using ASCII (no utf8;, the default) or UTF-8 (with use utf8;). However, string and regex literals are "8-bit clean" when no utf8; is in effect (the default), meaning that any byte that doesn't correspond to an ASCII character will result in a character with the same value as the byte.
use Win32::Clipboard qw( );
# Text encoded using source's encoding (because of lack of `use utf8`),
# which is expected to be the Active Code Page.
my $text_acp = "а-б-в-г\n";
my $clip = Win32::Clipboard();
$clip->Set($text_acp);
Found temporary solution: script generates .bat file with echo blah-blah-blah | clip, run it and then delete.

How to convert ASCII format into UTF8 in Perl

eg: é into é
Sometimes user getting ascii format character set rather than french character set... So can any one assist Me is there any function in perl that can convert ascii to UTF-8
It sounds like you want to convert HTML entities into UTF-8. To do this, use HTML::Entities and the decode_entities function.
This will give you a Perl string with no specific encoding attached. To output the string in UTF-8 encoding:
print Encode::encode_utf8(decode_entities($html_string));
Alternatively, set the UTF-8 PerlIO layer on STDOUT and Perl will encode everything in UTF-8 for you - useful if outputting multiple strings.
binmode STDOUT, ':utf8';
print decode_entities($html_string);
This is best handled by Perl's built in Encode module. Here is a simple example of how to convert a string:
my $standard_string = decode("ascii", $ascii_string);
($standard_string will then be in whatever Perl's standard encoding is on your system. In other words, you shouldn't have to worry about it from that point on).
The linked documentation gives many other examples of things you can do--such as setting the encoding of an input file. A related useful module is Encode::Guess, which helps you determine the character encoding if it is unknown.

Forcing a mixed ISO-8859-1 and UTF-8 multi-line string into UTF-8 in Perl

Consider the following problem:
A multi-line string $junk contains some lines which are encoded in UTF-8 and some in ISO-8859-1. I don't know a priori which lines are in which encoding, so heuristics will be needed.
I want to turn $junk into pure UTF-8 with proper re-encoding of the ISO-8859-1 lines. Also, in the event of errors in the processing I want to provide a "best effort result" rather than throwing an error.
My current attempt looks like this:
$junk = force_utf8($junk);
sub force_utf8 {
my $input = shift;
my $output = '';
foreach my $line (split(/\n/, $input)) {
if (utf8::valid($line)) {
utf8::decode($line);
}
$output .= "$line\n";
}
return $output;
}
Obviously the conversion will never be perfect since we're lacking information about the original encoding of each line. But is this the "best effort result" we can get?
How would you improve the heuristics/functionality of the force_utf8(...) sub?
I have no useful advice to offer except that I would have tried using Encode::Guess first.
You might be able to fix it up using a bit of domain knowledge. For example, é is not a likely character combination in ISO-8859-1; it is much more likely to be UTF-8 é.
If your input is limited to a restricted pool of characters, you can also use a heuristic such as assuming à will never occur in your input stream.
Without this kind of domain knowledge, your problem is in general intractable.
Just by looking at a character it will be hard to tell if it is ISO-8859-1 or UTF-8 encoded. The problem is that both are 8-bit encodings, so simply looking at the MSb is not sufficient. For every line, then, I would transcode the line assuming it is UTF-8. When an invalid UTF-8 encoding is found re-transcode the line assuming that the line is really ISO-8859-1. The problem with this heuristic is that you might transcode ISO-8859-1 lines that are also well-formed UTF-8 lines; however without external information about $junk there is no way to tell which is appropriate.
Take a look at this article. UTF-8 is optimised to represent Western language characters in 8 bits but it's not limited to 8-bits-per-character. The multibyte characters use common bit patterns to indicate if they are multibyte, and how many bytes the character uses. If you can safely assume only the two encodings in your string, the rest should be simple.
In short, I opted to solve my problem with "file -bi" and "iconv -f ISO-8859-1 -t UTF-8".
I recently ran across a similar problem in trying to normalize the encoding of file names. I had a mixture of ISO-8859-1, UTF-8, and ASCII. As I realized wile processing the files I had added complications caused by the directory name having one encoding that was different then the file's encoding.
I originally tried to use Perl but it could not properly differentiate between UTF-8 and ISO-8859-1 resulting in garbled UTF-8.
In my case it was a one time conversion on a reasonable file count, so I opted for a slow method that I knew about and worked with no errors for me (mostly because only 1-2 non-adjacent chars per line used special ISO-8859-1 codes)
Option #1 converts ISO-8859-1 to UTF-8
cat mixed_text.txt |
while read i do
type=${"$(echo "$i" | file -bi -)"#*=}
if [[ $type == 'iso-8859-1' ]]; then
echo "$i" | iconv -f ISO-8859-1 -t UTF-8
else
echo "$i"
fi
done > utf8_text.txt
Option #2 converts to ISO-8859-1 to ASCII
cat mixed_text.txt |
while read i do
type=${"$(echo "$i" | file -bi -)"#*=}
if [[ $type == 'iso-8859-1' ]]; then
echo "$i" | iconv -f ISO-8859-1 -t ASCII//TRANSLIT
else
echo "$i"
fi
done > utf8_text.txt

Windows-1252 to UTF-8 encoding

I've copied certain files from a Windows machine to a Linux machine. So all the Windows encoded (windows-1252) files need to be converted to UTF-8. The files which are already in UTF-8 should not be changed. I'm planning to use the recode utility for that. How can I specify that the recode utility should only convert windows-1252 encoded files and not the UTF-8 files?
Example usage of recode:
recode windows-1252.. myfile.txt
This would convert myfile.txt from windows-1252 to UTF-8. Before doing this, I would like to know that myfile.txt is actually windows-1252 encoded and not UTF-8 encoded. Otherwise, I believe this would corrupt the file.
iconv -f WINDOWS-1252 -t UTF-8 filename.txt
How would you expect recode to know that a file is Windows-1252? In theory, I believe any file is a valid Windows-1252 file, as it maps every possible byte to a character.
Now there are certainly characteristics which would strongly suggest that it's UTF-8 - if it starts with the UTF-8 BOM, for example - but they wouldn't be definitive.
One option would be to detect whether it's actually a completely valid UTF-8 file first, I suppose... again, that would only be suggestive.
I'm not familiar with the recode tool itself, but you might want to see whether it's capable of recoding a file from and to the same encoding - if you do this with an invalid file (i.e. one which contains invalid UTF-8 byte sequences) it may well convert the invalid sequences into question marks or something similar. At that point you could detect that a file is valid UTF-8 by recoding it to UTF-8 and seeing whether the input and output are identical.
Alternatively, do this programmatically rather than using the recode utility - it would be quite straightforward in C#, for example.
Just to reiterate though: all of this is heuristic. If you really don't know the encoding of a file, nothing is going to tell you it with 100% accuracy.
Here's a transcription of another answer I gave to a similar question:
If you apply utf8_encode() to an already UTF8 string it will return a garbled UTF8 output.
I made a function that addresses all this issues. It´s called Encoding::toUTF8().
You dont need to know what the encoding of your strings is. It can be Latin1 (iso 8859-1), Windows-1252 or UTF8, or the string can have a mix of them. Encoding::toUTF8() will convert everything to UTF8.
I did it because a service was giving me a feed of data all messed up, mixing UTF8 and Latin1 in the same string.
Usage:
$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);
$latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);
Download:
https://github.com/neitanod/forceutf8
Update:
I've included another function, Encoding::fixUFT8(), wich will fix every UTF8 string that looks garbled.
Usage:
$utf8_string = Encoding::fixUTF8($garbled_utf8_string);
Examples:
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂédÃÂération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
will output:
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Update: I've transformed the function (forceUTF8) into a family of static functions on a class called Encoding. The new function is Encoding::toUTF8().
There's no general way to tell if a file is encoded with a specific encoding. Remember that an encoding is nothing more but an "agreement" how the bits in a file should be mapped to characters.
If you don't know which of your files are actually already encoded in UTF-8 and which ones are encoded in windows-1252, you will have to inspect all files and find out yourself. In the worst case that could mean that you have to open every single one of them with either of the two encodings and see whether they "look" correct -- i.e., all characters are displayed correctly. Of course, you may use tool support in order to do that, for instance, if you know for sure that certain characters are contained in the files that have a different mapping in windows-1252 vs. UTF-8, you could grep for them after running the files through 'iconv' as mentioned by Seva Akekseyev.
Another lucky case for you would be, if you know that the files actually contain only characters that are encoded identically in both UTF-8 and windows-1252. In that case, of course, you're done already.
If you want to rename multiple files in a single command ‒ let's say you want to convert all *.txt files ‒ here is the command:
find . -name "*.txt" -exec iconv -f WINDOWS-1252 -t UTF-8 {} -o {}.ren \; -a -exec mv {}.ren {} \;
Use the iconv command.
To make sure the file is in Windows-1252, open it in Notepad (under Windows), then click Save As. Notepad suggests current encoding as the default; if it's Windows-1252 (or any 1-byte codepage, for that matter), it would say "ANSI".
You can change the encoding of a file with an editor such as notepad++. Just go to Encoding and select what you want.
I always prefer the Windows 1252
If you are sure your files are either UTF-8 or Windows 1252 (or Latin1), you can take advantage of the fact that recode will exit with an error if you try to convert an invalid file.
While utf8 is valid Win-1252, the reverse is not true: win-1252 is NOT valid UTF-8. So:
recode utf8..utf16 <unknown.txt >/dev/null || recode cp1252..utf8 <unknown.txt >utf8-2.txt
Will spit out errors for all cp1252 files, and then proceed to convert them to UTF8.
I would wrap this into a cleaner bash script, keeping a backup of every converted file.
Before doing the charset conversion, you may wish to first ensure you have consistent line-endings in all files. Otherwise, recode will complain because of that, and may convert files which were already UTF8, but just had the wrong line-endings.
this script worked for me on Win10/PS5.1 CP1250 to UTF-8
Get-ChildItem -Include *.php -Recurse | ForEach-Object {
$file = $_.FullName
$mustReWrite = $false
# Try to read as UTF-8 first and throw an exception if
# invalid-as-UTF-8 bytes are encountered.
try
{
[IO.File]::ReadAllText($file,[Text.Utf8Encoding]::new($false, $true))
}
catch [System.Text.DecoderFallbackException]
{
# Fall back to Windows-1250
$content = [IO.File]::ReadAllText($file,[Text.Encoding]::GetEncoding(1250))
$mustReWrite = $true
}
# Rewrite as UTF-8 without BOM (the .NET frameworks' default)
if ($mustReWrite)
{
Write "Converting from 1250 to UTF-8"
[IO.File]::WriteAllText($file, $content)
}
else
{
Write "Already UTF-8-encoded"
}
}
As said, you can't reliably determine whether a file is Windows-1252 because Windows-1252 maps almost all bytes to a valid code point. However if the files are only in Windows-1252 and UTF-8 and no other encodings then you can try to parse a file in UTF-8 and if it contains invalid bytes then it's a Windows-1252 file
if iconv -f UTF-8 -t UTF-16 $FILE 1>/dev/null 2>&1; then
# Conversion succeeded
echo "$FILE is in UTF-8"
else
# iconv returns error if there are invalid characters in the byte stream
echo "$FILE is in Windows-1252. Converting to UTF-8"
iconv -f WINDOWS-1252 -t UTF-8 -o ${FILE}_utf8.txt $FILE
fi
This is similar to many other answers that try to treat the file as UTF-8 and check if there are errors. It works 99% of the time because most Windows-1252 texts will be invalid in UTF-8, but there will still be rare cases when it won't work. It's heuristic after all!
There are also various libraries and tools to detect the character set, such as chardet
$ chardet utf8.txt windows1252.txt iso-8859-1.txt
utf8.txt: utf-8 with confidence 0.99
windows1252.txt: Windows-1252 with confidence 0.73
iso-8859-1.txt: ISO-8859-1 with confidence 0.73
It can't be completely reliable due to the heuristic nature, so it outputs a confidence value for people to judge. The more human text in the file, the more confident it'll be. If you have very specific texts then more trainings for the library will be needed. For more information read How do browsers determine the encoding used?
Found this documentation for the TYPE command:
Convert an ASCII (Windows1252) file into a Unicode (UCS-2 le) text file:
For /f "tokens=2 delims=:" %%G in ('CHCP') do Set _codepage=%%G
CHCP 1252 >NUL
CMD.EXE /D /A /C (SET/P=ÿþ)<NUL > unicode.txt 2>NUL
CMD.EXE /D /U /C TYPE ascii_file.txt >> unicode.txt
CHCP %_codepage%
The technique above (based on a script by Carlos M.) first creates a file with a Byte Order Mark (BOM) and then appends the content of the original file. CHCP is used to ensure the session is running with the Windows1252 code page so that the characters 0xFF and 0xFE (ÿþ) are interpreted correctly.
UTF-8 does not have a BOM as it is both superfluous and invalid. Where a BOM is helpful is in UTF-16 which may be byte swapped as in the case of Microsoft. UTF-16 if for internal representation in a memory buffer. Use UTF-8 for interchange. By default both UTF-8, anything else derived from US-ASCII and UTF-16 are natural/network byte order. The Microsoft UTF-16 requires a BOM as it is byte swapped.
To covert Windows-1252 to ISO8859-15, I first convert ISO8859-1 to US-ASCII for codes with similar glyphs. I then convert Windows-1252 up to ISO8859-15, other non-ISO8859-15 glyphs to multiple US-ASCII characters.