Getting following error on generating language scorer on Deepspeech - mozilla-deepspeech

File "generate_scorer_package", line 1
SyntaxError: Non-UTF-8 code starting with '\xea' in file generate_scorer_package on line 2, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

Before answering this question, I am going to make some assumptions:
Firstly, I believe you are following the DeepSpeech Playbook and are at the step in generating a kenlm.scorer file, as documented here
Secondly, I am going to assume that you are using a Python editor of some descrition, like PyCharm.
The error SyntaxError: Non-UTF-8 code starting with '\xea' in file generate_scorer_package on line 2, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details is not related to DeepSpeech; it is related the Python encoding of the file that is being executed.
Python 3 assumes that the encoding of the .py file is UTF-8; however some editors - particularly editors in other locales - can override this setting.
To force the file to UTF-8 encoding, add the following code to the top of the generate_scorer_package.py file:
# coding: utf8
NOTE: It MUST be at the top of the file
Alternatively, identify where in your editor the encoding is set, and change it.
See also these Stack Overflow questions that are similar:
SyntaxError: Non-UTF-8 code starting with '\x92' in file D:\AIAssistant\build\gui.py on line 92, but no encoding declared;
SyntaxError: Non-UTF-8 code starting with '\x82'

Related

Writing accented characters from user input to a text file Python 3.7

Hello I have the following code snippet:
while True:
try:
entry = input("Input element: ")
print (entry)
with open(fileName,'a',encoding='UTF-8') as thisFile:
thisFile.write(entry)
except KeyboardInterrupt:
break
This one basically continuously gets input and writes it to a file until manually interrupted. However, when the user inputs something like a Ñ. It outputs: UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed I explicity put the utf-8 encoding and even tried latin-1 but still the same error. I have also put the # -*- coding: utf-8 -*- on top of my code and tried thisFile.write(entry.encode('utf-8') but it still gives me the error.
Setting the following environment variables fixed it for me.
export LANG=C.UTF-8
export LC_ALL=C.UTF-8
or another method is running it via:
PYTHONIOENCODING="UTF-8" python3 writetest.py

Forcing UTF-8 over cp1252 (Python3)

I've written some code that makes use of the Biopython Entrez wrapper. Code was working fine on my previous Win10 laptop (Python 3.5.1), but I've just ported the code to a new Win10 laptop with the same versions of every package and Python installed and I'm now getting a decode error.
The traceback error leads to a function that fetches text - it's attempting to decode the text using cp1252 when it should be using UTF-8. I know that similar questions have been asked, but none have dealt with this problem happening inside a package (Biopython in my case). Copying the UTF-8 encoding file in Python/lib and renaming it to cp1252.py solves the problem, but this obviously is not a long term solution.
File "C:\Users\arjun\AppData\Local\Programs\Python\Python35-32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 21715: character maps to <undefined>
Use the io module for reading if you're using Python 3.x (https://docs.python.org/2/library/io.html#io.open).
By default, it will use the encoding specified on its running platform. You can also specify your own encoding as explained in the docs.

wiki dump encoding

I'm using WikiPrep to process the latest wiki dump enwiki-20121101-pages-articles.xml.bz2. Instead of "use Parse::MediaWikiDump;" I replaced that by "use MediaWiki::DumpFile::Compat;" and did the proper changes in the code. Then, I ran
perl wikiprep.pl -f enwiki-20121101-pages-articles.xml.bz2
I got an error
enwiki-20121101-pages-articles.xml.bz2:1: parser error : Document is empty
BZh91AY&SY±H¦ÂOÿ~Ð`ÿÿÿ¿ÿÿÿ¿ÿÿÿÿÿÿÿÿÿÿ½ÿýþdß8õEnÞ¶zëJ¨Eà®mEÓP|f÷Ô
^
I guess there are some non-utf8 characters contained in the dump. So I ran
iconv -f utf8 -t utf8 enwiki-20121101-pages-articles.xml.bz2
And indeed, I got some errors
BZh91AY&SYiconv: illegal input sequence at position 10
So, my question is what's the encoding format of wiki dump and if I wish to convert it to utf-8, what shall I do? Or how should modify wikiprep.pl to avoid such problems.
Many thanks
-- [solved] I should first unzip the file first.
You are running iconv on the compressed (bz2) version of the file, rather than the XML file itself. Uncompress it first.
(Posting borrible's answer so that this resolved question is not listed as unanswered.)

How to debug Postgres copy command failure

I have around 75k records which I am loading to a Postgres table using copy command which is failing. I get an exception
ERROR: invalid byte sequence for encoding "UTF8": 0xbd
Now i need to find which line is having this entry. Is there any way to do this? I am thinking in lines of enabling some postgres logging that might help or any other solution
Note: I am getting the issue with only one particular file. Other files are getting loaded without issues
I always seem to get a line-number in my error, no matter whether I use COPY or \copy and feed a file via redirection or -f.
ERROR: invalid byte sequence for encoding "UTF8": 0xa3
CONTEXT: COPY z, line 3
If there are only a couple of bad chars and you just want to strip them you can use iconv (assuming you're on a unix-like system).
iconv -c --from=utf8 --to=utf8 /tmp/badchars.txt > /tmp/stripped.txt
You could always run diff against the before + after versions if you wanted to see what was stripped out.

Emacs opening and saving encoding

I have a Perl source file in utf-8 encoding, LF ending. It contains English and Chinese characters. The questions are:
1.When I open file, the encoding is windows-1251-unix. I have to run these commands:
Alt-x revert-buffer-with-coding-system
> Coding system for visited file (default nil):
utf-8-auto-unix
> Revert buffer from file file_name.pl?
y
How to automatically open it in utf-8-auto-unix?
2.When I edit the file and try to save it, Emacs gives me a question:
> Select coding system (default raw-text):
utf-8-auto-unix
How to automatically save the file in utf-8-auto-unix? And get rid of the question.
You could add this comment to the top of the file:
# -*- coding: utf-8 -*-
Use describe-variable(C-h v) to examine the variable current-language-environment; follow the customize link and set it to "UTF-8".