Unicode Decode ErrorL Ordinal not in range - unicode

so I am trying to make the values in a csv file Unicode so a program I am using is able to read them. Here is my code, in Python 2.7, which I HAVE to use:
TEST_SENTENCES = []
with open('Book2.csv', 'rb') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
encoded_tweet = row["Tweet"].encode('utf-8')
TEST_SENTENCES.append(encoded_tweet)
I continue to receive the same error message, and have not been able to find anything that works. Here is the error message. I am sure someone out there can make a really easy solution.
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 127: ordinal not in range(128)
As an update, I changed encode to decode, and the program says it is running the predictions, given it has been a while but it would not have started if it was not encoded in Unicode so let us pray.

Related

Error decoding ciphertext with RSA using pycryptodome

I'm using pycryptodome to encrypt a text in this way:
def encrypt_with_rsa(plain_text):
#First Public Key Encryption
cipher_pub_obj = PKCS1_OAEP.new(RSA.importKey(my_public_key))
#encrypt
_secret_byte_obj = cipher_pub_obj.encrypt(plain_text.encode())
return _secret_byte_obj
After that, I want to divide all the string in 8 parts to get a char of each one. What I receive has this form:
b"gi?\xf4\xa8{\xe8\x1b\xec8\xd5\x96\*,t\xad\xb8D=\rCGq\xc5\xed........"
So I Try to decode that with utf-8, throws error:
** 'utf-8' codec can't decode byte 0x81 in position 5: invalid start byte **
I tried with utf-16 and 32, but no one works. I tried with latin1 too, but it ignores some parts and I don't want that. The text encoded above is changed to:
"gi?ô¨{è\x1bì8Õ\x96\*,t\xad¸D=\rCGqÅíh\x1b\x84\x0eí ó=#ÉîKô4B......"
Some parts are the same, i don´t know what type of encoding it uses, what can I do to use the encrypted text?

unable to open html file with Chinese character

everyone, i run into a trouble when trying to open a HTML file containing Chinese characters, here is the code
#problem with chinese character
file =wget.download("http://nba.stats.qq.com/player/list.htm#teamId=1")
with open(file,encoding ='utf-8') as f:
html = f.read()
print(html)
However in the output I get error as follows
319 # decode input (taking the buffer into account)
320 data = self.buffer + input
--> 321 (result, consumed) = self._buffer_decode(data, self.errors, final)
322 # keep undecoded input until the next call
323 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 535: invalid continuation byte
I searched for a while , and i saw some similar issues, but the solutions seems to use latin-1, which is obvious not the case here, I'm not sure how which encoding to use?
any suggestions? thanks ~
The page you are referring to is not encoded in UTF-8 encoding, but in GBK. You can tell by looking at the header:
<meta charset="GBK">
If you specify encoding='gbk' it'll work.
On another note, I would opt for not using wget unless you have to, and instead going with urllib which comes with the Python Standard Library. It also saved the disk write, and the code is simpler:
import urllib.request
with urllib.request.urlopen("http://nba.stats.qq.com/player/list.htm") as file:
html = file.read()
print(html.decode('gbk'))

How to save non ASCII Characters in Mongo DB

This question is repeated, but I can not find answer to problem in my context. I am trying to save Aéropostale as string in mongo DB:
name='Aéropostale'
obj=Mongo_Object()
obj.name=name
obj.save()
When I save the object, I get following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 2: ordinal not in range(128)
How to proceed to save the string in original format and retrieve in same format?
As you are using Python 2.7, you need to do a few things:
Specify the file encoding, by adding a string similar to this to the top of your file:
#coding: utf8
Use a unicode string, as your string is not ASCII, and specify the encoding. I am using utf8 here which includes é:
name = unicode('Aéropostale', 'utf8')

Python 3 CGI: how to output raw bytes

I decided to use Python 3 for making my website, but I encountered a problem with Unicode output.
It seems like plain print(html) #html is astr should be working, but it's not. I get UnicodeEncodeError: 'ascii' codec can't encode characters[...]: ordinal not in range(128). This must be because the webserver doesn't support unicode output.
The next thing I tried was print(html.encode('utf-8')), but I got something like repr output of the byte string: it is placed inside b'...' and all the escape characters are in raw form (e.g. \n and \xd0\x9c)
Please show me the correct way to output a Unicode (str) string as a raw UTF-8 encoded bytes string in Python 3.1
The problem here is that you stdout isn't attached to an actual terminal and will use the ASCII encoding by default. Therefore you need to write to sys.stdout.buffer, which is the "raw" binary output of sys.stdout. This can be done in various ways, the most common one seems to be:
import codecs, sys
writer = codecs.getwriter('utf8')(sys.stdout.buffer)
And the use writer. In a CGI script you may be able to replace sys.stdout with the writer so:
sys.stdout = codecs.getwriter('utf8')(sys.stdout.buffer)
Might actually work so you can print normally. Try that!

unicode().decode('utf-8', 'ignore') raising UnicodeEncodeError

Here is the code:
>>> z = u'\u2022'.decode('utf-8', 'ignore')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2022' in position 0: ordinal not in range(256)
Why is UnicodeEncodeError raised when I am using .decode?
Why is any error raised when I am using 'ignore'?
When I first started messing around with python strings and unicode, It took me awhile to understand the jargon of decode and encode too, so here's my post from here that may help:
Think of decoding as what you do to go from a regular bytestring to unicode and encoding as what you do to get back from unicode. In other words:
You de-code a str to produce a unicode string (in Python 2)
and en-code a unicode string to produce a str (in Python 2)
So:
unicode_char = u'\xb0'
encodedchar = unicode_char.encode('utf-8')
encodedchar will contain your unicode character, displayed in the selected encoding (in this case, utf-8).
The same principle applies to Python 3. You de-code a bytes object to produce a str object. And you en-code a str object to produce a bytes object.
From http://wiki.python.org/moin/UnicodeEncodeError
Paradoxically, a UnicodeEncodeError may happen when
decoding. The cause of it seems to be the
coding-specific decode() functions that normally expect
a parameter of type str. It appears that on seeing a
unicode parameter, the decode() functions "down-convert"
it into str, then decode the result assuming it to be of
their own coding. It also appears that the
"down-conversion" is performed using the ASCII encoder.
Hence an encoding failure inside a decoder.
You're trying to decode a unicode. The implicit encoding to make the decode work is what's failing.