Forcing UTF-8 over cp1252 (Python3)

Forcing UTF-8 over cp1252 (Python3) - encoding

I've written some code that makes use of the Biopython Entrez wrapper. Code was working fine on my previous Win10 laptop (Python 3.5.1), but I've just ported the code to a new Win10 laptop with the same versions of every package and Python installed and I'm now getting a decode error.
The traceback error leads to a function that fetches text - it's attempting to decode the text using cp1252 when it should be using UTF-8. I know that similar questions have been asked, but none have dealt with this problem happening inside a package (Biopython in my case). Copying the UTF-8 encoding file in Python/lib and renaming it to cp1252.py solves the problem, but this obviously is not a long term solution.
File "C:\Users\arjun\AppData\Local\Programs\Python\Python35-32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 21715: character maps to <undefined>

Use the io module for reading if you're using Python 3.x (https://docs.python.org/2/library/io.html#io.open).
By default, it will use the encoding specified on its running platform. You can also specify your own encoding as explained in the docs.

Related

Getting following error on generating language scorer on Deepspeech

File "generate_scorer_package", line 1
SyntaxError: Non-UTF-8 code starting with '\xea' in file generate_scorer_package on line 2, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

Before answering this question, I am going to make some assumptions:
Firstly, I believe you are following the DeepSpeech Playbook and are at the step in generating a kenlm.scorer file, as documented here
Secondly, I am going to assume that you are using a Python editor of some descrition, like PyCharm.
The error SyntaxError: Non-UTF-8 code starting with '\xea' in file generate_scorer_package on line 2, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details is not related to DeepSpeech; it is related the Python encoding of the file that is being executed.
Python 3 assumes that the encoding of the .py file is UTF-8; however some editors - particularly editors in other locales - can override this setting.
To force the file to UTF-8 encoding, add the following code to the top of the generate_scorer_package.py file:
# coding: utf8
NOTE: It MUST be at the top of the file
Alternatively, identify where in your editor the encoding is set, and change it.
See also these Stack Overflow questions that are similar:
SyntaxError: Non-UTF-8 code starting with '\x92' in file D:\AIAssistant\build\gui.py on line 92, but no encoding declared;
SyntaxError: Non-UTF-8 code starting with '\x82'

How to save to file non-ascii output of program in Powershell?

I want to run program in Powershell and write output to file with UTF-8 encoding.
However I can't write non-ascii characters properly.
I already read many similar questions on Stack overflow, but I still can't find answer.
I tried both PowerShell 5.1.19041.1023 and PowerShell Core 7.1.3, they differently encode output file, but content is broken in the same way.
I tried simple programs in Python and Golang:
(Please assume that I can't change source code of programs)
Python
print('Hello ąćęłńóśźż world')
Results:
python hello.py
Hello ąćęłńóśźż world
python hello.py > file1.txt
Hello ╣Šŕ│˝ˇťč┐ world
python hello.py | out-file -encoding utf8 file2.ext
Hello ╣Šŕ│˝ˇťč┐ world
On cmd:
python hello.py > file3.txt
Hello ����󜟿 world
Golang
package main
import "fmt"
func main() {
fmt.Printf("Hello ąćęłńóśźż world\n")
}
Results:
go run hello.go:
Hello ąćęłńóśźż world
go run hello.go > file4.txt
Hello ─ů─ç─Ö┼é┼ä├│┼Ť┼║┼╝ world
go run hello.go | out-file -encoding utf8 file5.txt
Hello ─ů─ç─Ö┼é┼ä├│┼Ť┼║┼╝ world
On cmd it works ok:
go run hello.go > file6.txt
Hello ąćęłńóśźż world

You should set the OutputEncoding property of the console first.
In PowerShell, enter this line before running your programs:
[Console]::OutputEncoding = [Text.Encoding]::Utf8
You can then use Out-File with your encoding type:
py hello.py | Out-File -Encoding UTF8 file2.ext
go run hello.go | Out-File -Encoding UTF8 file5.txt

Note: These character-encoding problems only plague PowerShell on Windows, in both editions. On Unix-like platforms, UTF-8 is consistently used.[1]
Quicksilver's answer is fundamentally correct:
It is the character encoding stored in [Console]::OutputEncoding that determines how PowerShell decodes text received from external programs[2] - and note that it invariably interprets such output as text (strings).
[Console]::OutputEncoding by default reflects a console's active code page, which itself defaults to the system's active OEM code page, such as 437 (CP437) on US-English systems.
The standard chcp program also reports the active OEM code page, and while it can in principle also be used to change it for the active console (e.g., chcp 65001), this does not work from inside PowerShell, due to .NET caching the encodings.
Therefore, you may have to (temporarily) set [Console]::OutputEncoding to match the actual character encoding used by a given external console program:
While many console programs respect the active console code page (in which case no workarounds are required), some do not, typically in order to provide full Unicode support. Note that you may not notice a problem until you programmatically process such a program's output (meaning: capturing in a variable, sending through the pipeline to another command, redirection to a file), because such a program may detect the case when its stdout is directly connected to the console and may then selectively use full Unicode support for display.
Notable CLIs that do not respect the active console code page:
Python exhibits nonstandard behavior in that it uses the active ANSI code page by default, i.e. the code page normally only used by non-Unicode GUI-subsystem applications.
However, you can use $env:PYTHONUTF8=1 before invoking Python scripts to instruct Python to use UTF-8 instead (which then applies to all Python calls made from the same process); in v3.7+, you can alternatively pass command-line option -X utf8 (case-sensitive) as a per-call opt-in.
Go and also Node.js invariably use UTF-8 encoding.
The following snippet shows how to set [Console]::OutputEncoding temporarily as needed:
# Save the original encoding.
$orig = [Console]::OutputEncoding
# Work with console programs that use UTF-8 encoding,
# such as Go and Node.js
[Console]::OutputEncoding = [System.Text.UTF8Encoding]::new()
# Piping to Write-Output is a dummy operation that forces
# decoding of the external program's output, so that encoding problems would show.
go run hello.go | Write-Output
# Work with console programs that use ANSI encoding, such as Python.
# As noted, the alternative is to configure Python to use UTF-8.
[Console]::OutputEncoding = [System.Text.Encoding]::GetEncoding([int] (Get-ItemPropertyValue HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage ACP))
python hello.py | Write-Output
# Restore the original encoding.
[Console]::OutputEncoding = $orig
Your own answer provides an effective alternative, but it comes with caveats:
Activating the Use Unicode UTF-8 for worldwide language support feature via Control Panel (or the equivalent registry settings) changes the code pages system-wide, which affects not only all console windows and console applications, but also legacy (non-Unicode) GUI-subsystem applications, given that both the OEM and the ANSI code pages are being set.
Notable side effects include:
Windows PowerShell's default behavior changes, because it uses the ANSI code page both to read source code and as the default encoding for the Get-Content and Set-Content cmdlets.
For instance, existing Windows PowerShell scripts that contain non-ASCII range characters such as é will then misbehave, unless they were saved as UTF-8 with a BOM (or as "Unicode", UTF-16LE, which always has a BOM).
By contrast, PowerShell (Core) v6+ consistently uses (BOM-less) UTF-8 to begin with.
Old console applications may break with 65001 (UTF-8) as the active OEM code page, as they may not be able to handle the variable-length encoding aspect of UTF-8 (a single character can be encoded by up to 4 bytes).
See this answer for more information.
[1] The cross-platform PowerShell (Core) v6+ edition uses (BOM-less) UTF-8 consistently. While it is possible to configure Unix terminals and thereby console (terminal) applications to use a character encoding other than UTF-8, doing so is rare these days - UTF-8 is almost universally used.
[2] By contrast, it is the $OutputEncoding preference variable that determines the encoding used for sending text to external programs, via the pipeline.

Solution is to enable Beta: Use Unicode UTF-8 for worldwide language support as described in What does "Beta: Use Unicode UTF-8 for worldwide language support" actually do?
Note: this solution may cause problems with legacy programs. Please read answer by mklement0 and answer by Quciksilver for details and alternative solutions.
Also I found explanation written by Ghisler helpful (source):
If you check this option, Windows will use codepage 65001 (Unicode
UTF-8) instead of the local codepage like 1252 (Western Latin1) for
all plain text files. The advantage is that text files created in e.g.
Russian locale can also be read in other locale like Western or
Central Europe. The downside is that ANSI-Only programs (most older
programs) will show garbage instead of accented characters.
Also Powershell before version 7.1 has a bug when this option is enabled. If you enable it , you may want to upgrade to version 7.1 or later.
I like this solution because it's enough to set it once and it's working. It brings consistent Unix-like UTF-8 behaviour to Windows. I hope I will not see any issues.
How to enable it:
Win+R → intl.cpl
Administrative tab
Click the Change system locale button
Enable Beta: Use Unicode UTF-8 for worldwide language support
Reboot
or alternatively via reg file:
Windows Registry Editor Version 5.00
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage]
"ACP"="65001"
"OEMCP"="65001"
"MACCP"="65001"

The combination PostScript and UTF-8 is not supported

I am new to linux and antiword.
I want to convert word to PostScript.
I am able to create the file but the PS file is not openning.
And at the time of conversion it is giving error like
"The combination PostScript and UTF-8 is not supported"

This utility 'antiword', the latest version of which I found to be 0.37 (21 Oct. 2005) ships with a few Readme files.
It explains itself in its own History file as such:
The name comes from: "The antidote against people who send Microsoft(R) Word files to everybody, because they believe that everybody runs Windows(R) and therefore runs Word".
The same file says about its current version:
Beta release, for evaluation by the public.
And it also says, under the header Known Limitations:
[....]
2) Antiword doesn't show all the images included in a Word document.
[....]
7) PostScript ouput will not work in combination with UTF-8. It only works in combination with character sets ISO-8859-1, ISO-8859-2 and ISO-8859-5.
8) Antiword's error messages are not very helpful.
So here you are. The behavior you are experiencing is exactly as is documented. :-)
#Tushar: It's also recommended to read the included FAQ document.

python 3.0, how to make print() output unicode?

I'm working in WinXP 5.1.2600, writing a Python application involving Chinese pinyin, which has involved me in endless Unicode problems. Switching to Python 3.0 has solved many of them. But the print() function for console output is not Unicode-aware for some odd reason. Here's a teeny program.
print('sys.stdout encoding is "' + sys.stdout.encoding + '"')
str1 = 'lüelā'
print(str1)
Output is (changing angle brackets to square brackets for readability):
sys.stdout encoding is "cp1252"
Traceback (most recent call last):
File "TestPrintEncoding.py", line 22, in [module]
print(str1)
File "C:\Python30\lib\io.py", line 1491, in write
b = encoder.encode(s)
File "C:\Python30\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0101'
in position 4: character maps to [undefined]
Note that ü = \xfc = 252 gives no problem since it's upper ASCII. But ā = \u0101 is beyond 8-bits.
Anyone have an idea how to change the encoding of sys.stdout to 'utf-8'? Bear in mind that Python 3.0 no longer uses the codecs module, if I understand the documentation right.
Apologies, I gave you the program without the preamble. Before the 3 lines given, it starts like this:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
Unfortunately, the coding specified by the "coding:" line is the coding of the source code, not of the console output. But thank you for your thoughts!

The Windows command prompt (cmd.exe) cannot display the Unicode characters you are using, even though Python is handling it in a correct manner internally. You need to use IDLE, Cygwin, or another program that can display Unicode correctly.
See this thread for a full explanation:
http://www.nabble.com/unable-to-print-Unicode-characters-in-Python-3-td21670662.html

You may want to try changing the environment variable "PYTHONIOENCODING" to "utf_8." I have written a page on my ordeal with this problem.

Check out the question and answer here, I think they have some valuable clues. Specifically, note the setdefaultencoding in the sys module, but also the fact that you probably shouldn't use it.

Here's a dirty hack:
# works
import os
os.system("chcp 65001 &")
print("юникод")
However everything breaks it:
simple muting first line already breaks it:
# doesn't work
import os
os.system("chcp 65001 >nul &")
print("юникод")
checking for OS type breaks it:
# doesn't work
import os
if os.name == "nt":
os.system("chcp 65001 &")
print("юникод")
it doesn't even works under if block:
# doesn't work
import os
if os.name == "nt":
os.system("chcp 65001 &")
print("юникод")
But one can print with cmd's echo:
# works
import os
os.system("chcp 65001 & echo {0}".format("юникод"))
and here's a simple way to make this cross-platform:
# works
import os
def simple_cross_platrofm_print(obj):
if os.name == "nt":
os.system("chcp 65001 >nul & echo {0}".format(obj))
else:
print(obj)
simple_cross_platrofm_print("юникод")
but the window's echo trailing empty line can't be suppressed.

The problem of displaying Unicode charaters in Python in Windows is known. There is no official solution yet. The right thing to do is to use winapi function WriteConsoleW. It is nontrivial to build a working solution as there are other related issues. However, I have developed a package which tries to fix Python regarding this issue. See https://github.com/Drekin/win-unicode-console. You can also read there a deeper explanation of the problem. The package is also on pypi (https://pypi.python.org/pypi/win_unicode_console) and can be installed using pip.

Convert GB2312 to UTF-8

I have a text file that contains localized language strings that is currently encoded in GB2312 (simplified Chinese), but all of my other language files are in UTF-8. I am finding it very difficult to work with this file, as none of my text editors will work properly with it and keep corrupting it. Are there any tools to convert this to UTF-8, and are there any downsides to doing this? Would it be better to just keep it as GB2312 and use a different editor (if so, can you recommend one)?
Update: I'm using Windows XP (English install).
Update #2: I've tried using Notepad++ and Notepad2 to edit the GB2312 files, but both are unable to read the files and corrupt them.

You can try this online service that uses the Open Source iconv utility.
You can also install Charco, a command-line version of it on your machine.
For GB2312, you can use CP936 as the encoding.
If you are a .Net developer you can make a small tool that does just that.
I've struggled with this as well and found that it was actually simple to solve from a programmatic point of view.
All you need is something like this (I tested it and it works):
In C#
static void Main(string[] args) {
string infile = args[0];
string outfile = args[1];
using (StreamReader sr = new StreamReader(infile, Encoding.GetEncoding(936))) {
using (StreamWriter sw = new StreamWriter(outfile, false, Encoding.UTF8)) {
sw.Write(sr.ReadToEnd());
sw.Close();
}
sr.Close();
}
}
In VB.Net
Private Shared Sub Main(ByVal args() As String)
Dim infile As String = args(0)
Dim outfile As String = args(1)
Dim sr As StreamReader = New StreamReader(infile, Encoding.GetEncoding(936))
Dim sw As StreamWriter = New StreamWriter(outfile, false, Encoding.UTF8)
sw.Write(sr.ReadToEnd)
sw.Close
sr.Close
End Sub

I might be thinking a bit too simple here, but if it's just this one plain text file, you could try the following:
Replace all & by &, all < by < and all > by > (to be on the safe side)
Prepend the following to the text file:
<html><head><meta http-equiv="Content-Type" content="text/html; charset=gb2312" /></head><body><pre>
Open the file in your favorite browser
Select and copy all text
Paste it in Notepad and save as UTF-8.
You'd be done with this before you could have written any code to do the conversion or downloaded any programs that would do the conversion for you.
Of course, I'm not a hundred percent sure this'll work, and your browser would need the correct fonts and everything, but considering you're working with these kinds of files I'm assuming you already have those.

GB 2312 is mostly compatible with GB 18030, so any tool able to deal with the latter should treat GB 2312 correctly as well. There are many tools for converting GB 18030 to UTF-8 (or some other Unicode encoding form), but I can't recommend any specific one for Windows, because I work on Unix. If you're wanting to write a bit of code, the iconv library, or ICU, springs to mind: you'll find all the conversion data readily available in these libraries.
Conversion from GB 2312 to UTF-8 is completely safe and lossless, you shouldn't worry about it.

I agree on the currently chosen answer in that "found that it was actually simple to solve from a programmatic point of view", especially when your source file contains sensitive information that you do not want to expose to an unknown 3rd-party online service.
And, nowadays Python is available out-of-box in most Linux environment, and also easy to install on a Windows environment (easier than installing C# stack, IMHO). So, without further ado, this is the 2-liner Python script that can convert GB2312 to UTF8. I tested it, it works.
# Usage: python this_script.py your_input.txt your_output.txt
import io, sys
io.open(sys.argv[2], "w", encoding="utf-8").write(io.open(sys.argv[1], encoding="gb2312").read())

If there is command line tool iconv in your OS, you can achieve this by running the one-line scirpt:
# From GB18030
iconv -f gb18030 -t utf8 -o output.txt input.txt
# From GB2313
iconv -f gb2313 -t utf8 -o output.txt input.txt
Check whether your OS have iconv:
$ iconv --version
iconv (Debian GLIBC 2.31-13+deb11u3) 2.31
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Ulrich Drepper.