Writing accented characters from user input to a text file Python 3.7 - python-3.7

Hello I have the following code snippet:
while True:
try:
entry = input("Input element: ")
print (entry)
with open(fileName,'a',encoding='UTF-8') as thisFile:
thisFile.write(entry)
except KeyboardInterrupt:
break
This one basically continuously gets input and writes it to a file until manually interrupted. However, when the user inputs something like a Ñ. It outputs: UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed I explicity put the utf-8 encoding and even tried latin-1 but still the same error. I have also put the # -*- coding: utf-8 -*- on top of my code and tried thisFile.write(entry.encode('utf-8') but it still gives me the error.

Setting the following environment variables fixed it for me.
export LANG=C.UTF-8
export LC_ALL=C.UTF-8
or another method is running it via:
PYTHONIOENCODING="UTF-8" python3 writetest.py

Related

Getting following error on generating language scorer on Deepspeech

File "generate_scorer_package", line 1
SyntaxError: Non-UTF-8 code starting with '\xea' in file generate_scorer_package on line 2, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
Before answering this question, I am going to make some assumptions:
Firstly, I believe you are following the DeepSpeech Playbook and are at the step in generating a kenlm.scorer file, as documented here
Secondly, I am going to assume that you are using a Python editor of some descrition, like PyCharm.
The error SyntaxError: Non-UTF-8 code starting with '\xea' in file generate_scorer_package on line 2, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details is not related to DeepSpeech; it is related the Python encoding of the file that is being executed.
Python 3 assumes that the encoding of the .py file is UTF-8; however some editors - particularly editors in other locales - can override this setting.
To force the file to UTF-8 encoding, add the following code to the top of the generate_scorer_package.py file:
# coding: utf8
NOTE: It MUST be at the top of the file
Alternatively, identify where in your editor the encoding is set, and change it.
See also these Stack Overflow questions that are similar:
SyntaxError: Non-UTF-8 code starting with '\x92' in file D:\AIAssistant\build\gui.py on line 92, but no encoding declared;
SyntaxError: Non-UTF-8 code starting with '\x82'

Cannot COPY UTF-8 data to ScyllaDB with cqlsh

I'm trying to copy a large data set from Postgresql to ScyllaDB, which is supposed to be compatible with Cassandra.
This is what I'm trying:
psql <db_name> -c "COPY (SELECT row_number() OVER () as id, * FROM ds.my_data_set LIMIT 20) TO stdout WITH (FORMAT csv, HEADER, DELIMITER ';');" \
| \
CQLSH_HOST=172.17.0.3 cqlsh -e 'COPY test.mytable (id, "Ist Einpöster", [....]) FROM STDIN WITH DELIMITER = $$;$$ AND HEADER = TRUE;'
I get an obscure error without a stack trace:
:1:'ascii' codec can't decode byte 0xc3 in position 9: ordinal not in range(128)
My data, and column names, including the ones already in the created table in ScyllaDB, contain values with German text. It's not ASCII, but I haven't found anywhere to set the encoding, and everywhere I looked it seemed to be using utf-8 already. I tried this as well, and saw in the vicinity of line 1135 that, and changed it in my local cqlsh (using vim $(which cqlsh)), but it had no effect.
I'm using cqlsh 5.0.1, installed using pip. (weirdly it was pip install cqlsh==5.0.4)
I also tried the cqlsh from the docker image that I used to install ScyllaDB, and it has the exact same error.
<Update>
As suggested, I piped the data to a file:
psql <db_name> -c "COPY (SELECT row_number() OVER (), * FROM ds.my_data_set ds) TO stdout WITH (FORMAT csv, HEADER);" | head -n 1 > test.csv
I thinned it down to the first row (CSV header). Piping it to cqlsh made it cry with the same error. Then, using python3.5 interactive shell, I did this:
>>> with open('test.csv', 'rb') as fp:
... data = fp.read()
>>> data
b'row_number,..... Ist Einp\xc3\xb6ster ........`
So there we are, \xc3 in the flesh. Is it UTF-8?
>>> data.decode('utf-8')
'row_number,....... Ist Einpöster ........`
Yes, it's utf-8. So how does the error happen?
>>> data.decode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 336: ordinal not in range(128)
Same error text, so it's probably Python as well, but without a stack trace, I have no idea where this is happening, and default encodings are utf-8. I tried overriding the default with utf-8 but nothing changed. Still, somewhere, something is trying to decode a stream using ASCII.
This is the locale on the server/client:
LANG=
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8
Someone on Slack suggested this answer UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 2: ordinal not in range(128)
Once I added the last 2 lines in cqlsh.py at the beginning, it got past the decoding issue, but the same column was reported as invalid with another error:
:1:Invalid column name Ist Einpöster
side note:
I lost interest in this test at this point, and I'm just trying to not have an unanswered question, so please excuse the wait time. As I was trying it out as an analytical engine, coupled with Spark, as a data source for Tableau, I found "better" alternatives, like Vertica and ClickHouse. "Better" because both of them have limitations.
</Update>
How can I complete this import?
What was it?
The query passed in as an argument, contained the column list, which contained that column with a non-ASCII character. At some point, cqlsh parsed those as ascii and not utf-8, which lead to this error.
How it was fixed?
First attempt was to add these 2 lines in cqlsh:
reload(sys)
sys.setdefaultencoding('utf-8')
but that still made the script unable to work with that column.
Second attempt was to simply pass the query from a file. If you can't, know that bash supports process substitution, so instead of this:
cqlsh -f path/to/query.cql
you can have
cqlsh -f <(echo "COPY .... FROM STDIN;")
And it's all great, except that it doesn't work either. cqlsh understands stdin as "interactive", from a prompt, and not piped in. The result is that it doesn't import anything. One could just create a file, and load it from the file, but that's an extra step that might take minutes or hours, depending on the data size.
Thankfully, POSIX systems have these virtual files like '/dev/stdin', so the above command is equivalent to this:
cqlsh -f <(echo "COPY .... FROM '/dev/stdin';")
except that cqlsh now thinks that you actually have a file, and it reads it like a file, so you can pipe your data and be happy.
This would probably work, but for some reason I got the last kick:
cqlsh.sql:2:Failed to import 15 rows: InvalidRequest - Error from server: code=2200 [Invalid query] message="Batch too large", will retry later, attempt 4 of 5
I think it's funny that 15 rows are too much for a distributed storage engine. And it's likely that it's again some limitation from the engine related to unicode and just a wrong error message. Or I'm wrong. Nevertheless, the initial question was answered, with some BIG help from the guys in Slack.
I don't see that you ever got an answer to this. UTF-8 should be the default.
Did you try --encoding?
Docs: https://docs.scylladb.com/getting-started/cqlsh/
If you didn't get an answer here, would you wish to ask it on our slack channel?
I would try to eliminate all the extra complexity you have in there first. Try to dump a few rows into a CSV, and then load it into Scylla using COPY
Update: utf8: Print invalid UTF-8 character position
Add new validate_with_error_position function
which returns -1 if data is a valid UTF-8 string
or otherwise a byte position of first invalid
character. The position is added to exception
messages of all UTF-8 parsing errors in Scylla.
validate_with_error_position is done in two
passes in order to preserve the same performance
in common case when the string is valid.
https://github.com/scylladb/scylla/commit/ffd8c8c505b92a71df7e34d5196c7545f11cb12f

Forcing UTF-8 over cp1252 (Python3)

I've written some code that makes use of the Biopython Entrez wrapper. Code was working fine on my previous Win10 laptop (Python 3.5.1), but I've just ported the code to a new Win10 laptop with the same versions of every package and Python installed and I'm now getting a decode error.
The traceback error leads to a function that fetches text - it's attempting to decode the text using cp1252 when it should be using UTF-8. I know that similar questions have been asked, but none have dealt with this problem happening inside a package (Biopython in my case). Copying the UTF-8 encoding file in Python/lib and renaming it to cp1252.py solves the problem, but this obviously is not a long term solution.
File "C:\Users\arjun\AppData\Local\Programs\Python\Python35-32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 21715: character maps to <undefined>
Use the io module for reading if you're using Python 3.x (https://docs.python.org/2/library/io.html#io.open).
By default, it will use the encoding specified on its running platform. You can also specify your own encoding as explained in the docs.

Emacs opening and saving encoding

I have a Perl source file in utf-8 encoding, LF ending. It contains English and Chinese characters. The questions are:
1.When I open file, the encoding is windows-1251-unix. I have to run these commands:
Alt-x revert-buffer-with-coding-system
> Coding system for visited file (default nil):
utf-8-auto-unix
> Revert buffer from file file_name.pl?
y
How to automatically open it in utf-8-auto-unix?
2.When I edit the file and try to save it, Emacs gives me a question:
> Select coding system (default raw-text):
utf-8-auto-unix
How to automatically save the file in utf-8-auto-unix? And get rid of the question.
You could add this comment to the top of the file:
# -*- coding: utf-8 -*-
Use describe-variable(C-h v) to examine the variable current-language-environment; follow the customize link and set it to "UTF-8".

python 3.0, how to make print() output unicode?

I'm working in WinXP 5.1.2600, writing a Python application involving Chinese pinyin, which has involved me in endless Unicode problems. Switching to Python 3.0 has solved many of them. But the print() function for console output is not Unicode-aware for some odd reason. Here's a teeny program.
print('sys.stdout encoding is "' + sys.stdout.encoding + '"')
str1 = 'lüelā'
print(str1)
Output is (changing angle brackets to square brackets for readability):
sys.stdout encoding is "cp1252"
Traceback (most recent call last):
File "TestPrintEncoding.py", line 22, in [module]
print(str1)
File "C:\Python30\lib\io.py", line 1491, in write
b = encoder.encode(s)
File "C:\Python30\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0101'
in position 4: character maps to [undefined]
Note that ü = \xfc = 252 gives no problem since it's upper ASCII. But ā = \u0101 is beyond 8-bits.
Anyone have an idea how to change the encoding of sys.stdout to 'utf-8'? Bear in mind that Python 3.0 no longer uses the codecs module, if I understand the documentation right.
Apologies, I gave you the program without the preamble. Before the 3 lines given, it starts like this:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
Unfortunately, the coding specified by the "coding:" line is the coding of the source code, not of the console output. But thank you for your thoughts!
The Windows command prompt (cmd.exe) cannot display the Unicode characters you are using, even though Python is handling it in a correct manner internally. You need to use IDLE, Cygwin, or another program that can display Unicode correctly.
See this thread for a full explanation:
http://www.nabble.com/unable-to-print-Unicode-characters-in-Python-3-td21670662.html
You may want to try changing the environment variable "PYTHONIOENCODING" to "utf_8." I have written a page on my ordeal with this problem.
Check out the question and answer here, I think they have some valuable clues. Specifically, note the setdefaultencoding in the sys module, but also the fact that you probably shouldn't use it.
Here's a dirty hack:
# works
import os
os.system("chcp 65001 &")
print("юникод")
However everything breaks it:
simple muting first line already breaks it:
# doesn't work
import os
os.system("chcp 65001 >nul &")
print("юникод")
checking for OS type breaks it:
# doesn't work
import os
if os.name == "nt":
os.system("chcp 65001 &")
print("юникод")
it doesn't even works under if block:
# doesn't work
import os
if os.name == "nt":
os.system("chcp 65001 &")
print("юникод")
But one can print with cmd's echo:
# works
import os
os.system("chcp 65001 & echo {0}".format("юникод"))
and here's a simple way to make this cross-platform:
# works
import os
def simple_cross_platrofm_print(obj):
if os.name == "nt":
os.system("chcp 65001 >nul & echo {0}".format(obj))
else:
print(obj)
simple_cross_platrofm_print("юникод")
but the window's echo trailing empty line can't be suppressed.
The problem of displaying Unicode charaters in Python in Windows is known. There is no official solution yet. The right thing to do is to use winapi function WriteConsoleW. It is nontrivial to build a working solution as there are other related issues. However, I have developed a package which tries to fix Python regarding this issue. See https://github.com/Drekin/win-unicode-console. You can also read there a deeper explanation of the problem. The package is also on pypi (https://pypi.python.org/pypi/win_unicode_console) and can be installed using pip.