unicode().decode('utf-8', 'ignore') raising UnicodeEncodeError

unicode().decode('utf-8', 'ignore') raising UnicodeEncodeError - unicode

Here is the code:
>>> z = u'\u2022'.decode('utf-8', 'ignore')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2022' in position 0: ordinal not in range(256)
Why is UnicodeEncodeError raised when I am using .decode?
Why is any error raised when I am using 'ignore'?

When I first started messing around with python strings and unicode, It took me awhile to understand the jargon of decode and encode too, so here's my post from here that may help:
Think of decoding as what you do to go from a regular bytestring to unicode and encoding as what you do to get back from unicode. In other words:
You de-code a str to produce a unicode string (in Python 2)
and en-code a unicode string to produce a str (in Python 2)
So:
unicode_char = u'\xb0'
encodedchar = unicode_char.encode('utf-8')
encodedchar will contain your unicode character, displayed in the selected encoding (in this case, utf-8).
The same principle applies to Python 3. You de-code a bytes object to produce a str object. And you en-code a str object to produce a bytes object.

From http://wiki.python.org/moin/UnicodeEncodeError
Paradoxically, a UnicodeEncodeError may happen when
decoding. The cause of it seems to be the
coding-specific decode() functions that normally expect
a parameter of type str. It appears that on seeing a
unicode parameter, the decode() functions "down-convert"
it into str, then decode the result assuming it to be of
their own coding. It also appears that the
"down-conversion" is performed using the ASCII encoder.
Hence an encoding failure inside a decoder.

You're trying to decode a unicode. The implicit encoding to make the decode work is what's failing.

Related

Unicode Decode ErrorL Ordinal not in range

so I am trying to make the values in a csv file Unicode so a program I am using is able to read them. Here is my code, in Python 2.7, which I HAVE to use:
TEST_SENTENCES = []
with open('Book2.csv', 'rb') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
encoded_tweet = row["Tweet"].encode('utf-8')
TEST_SENTENCES.append(encoded_tweet)
I continue to receive the same error message, and have not been able to find anything that works. Here is the error message. I am sure someone out there can make a really easy solution.
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 127: ordinal not in range(128)
As an update, I changed encode to decode, and the program says it is running the predictions, given it has been a while but it would not have started if it was not encoded in Unicode so let us pray.

Encoding emoji in Erlang

Assuming I have a binary
Message = <<"string containing emoji">>.
How do I properly encode it in Unicode? I tried doing:
Encoded = <<Message/utf16>>.
I get this warning when compiling the file:
Warning: binary construction will fail with a 'badarg' exception
(invalid Unicode code point in a utf8/utf16/utf32 segment)
I tried this with /utf8 as well. Same warning.

Assuming that the binary you start with is encoded according to UTF-8, and you need to encode it as little-endian UTF-16, this should work:
unicode:characters_to_binary(<<"string containing emoji">>, utf8, {utf16, little})
See the documentation for the Unicode module for more information.
The reason why <<Message/utf16>> fails is that the utf8, utf16 and utf32 specifiers in bit syntax encode a single codepoint, not an entire string. So to encode the character U+1F64C, you could use:
2> <<16#1f64c/utf8>>.
<<240,159,153,140>>
3> <<16#1f64c/utf16>>.
<<"\330=\336L">>
4> <<16#1f64c/utf32>>.
<<0,1,246,76>>

You may need to add -*- coding: utf8 -*- as the first line of your module, and use /utf8.
My guess is that you are using Erlang/OTP < 17, meaning files are considered latin-1 unless specified otherwise.

Extract the first letter of a UTF-8 string with Lua

Is there any way to extract the first letter of a UTF-8 encoded string with Lua?
Lua does not properly support Unicode, so string.sub("ÆØÅ", 2, 2) will return "?" rather than "Ø".
Is there a relatively simple UTF-8 parsing algorithm I could use on the string byte per byte, for the sole purpose of getting the first letter of the string, be it a Chinese character or an A?
Or is this way too complex, requiring a huge library, etc.?

You can easily extract the first letter from a UTF-8 encoded string with the following code:
function firstLetter(str)
return str:match("[%z\1-\127\194-\244][\128-\191]*")
end
Because a UTF-8 code point either begins with a byte from 0 to 127, or with a byte from 194 to 244 followed by one or several bytes from 128 to 191.
You can even iterate over UTF-8 code points in a similar manner:
for code in str:gmatch("[%z\1-\127\194-\244][\128-\191]*") do
print(code)
end
Note that both examples return a string value for each letter, and not the Unicode code point numerical value.

Lua 5.3 provide a UTF-8 library.
You can use utf8.codes to get each code point, and then use utf8.char to get the character:
local str = "ÆØÅ"
for _, c in utf8.codes(str) do
print(utf8.char(c))
end
This also works:
local str = "ÆØÅ"
for w in str:gmatch(utf8.charpattern ) do
print(w)
end
where utf8.charpattern is just the string "[\0-\x7F\xC2-\xF4][\x80-\xBF]*" for the pattern to match one UTF-8 byte sequence.

python encoding conversion

Here's my problem, I have a variable wrongly encoded that I want to fix. Long story short, I end up with:
myVar=u'\xc3\xa9'
which is wrong because it's the character 'é' or \u00e9 UTF-8 encoded, not unicode.
None of the combinations of encode/decode I tried seem to solve the problem. I looked towards the bytearray object, but you must provide an encoding, and obviously none of them fits.
Basically I need to reinterpret the byte array into the correct encoding. Any ideas on how to do that?
Thanks.

What you should have done.
>>> b='\xc3\xa9'
>>> b
'\xc3\xa9'
>>> b.decode("UTF-8")
u'\xe9'
Since you didn't show the broken code that caused the problem, all we can do is make a complex problem more complex.
This appears to be what you're seeing.
>>> c
u'\xc3\xa9'
>>> c.decode("UTF-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
Here's a workaround.
>>> [ chr(ord(x)) for x in c ]
['\xc3', '\xa9']
>>> ''.join(_)
'\xc3\xa9'
>>> _.decode("UTF-8")
u'\xe9'
Fix the code that produced the wrong stuff to begin with.

The hacky solution: pull out the codepoints with ord, then build characters (length-one strings) out of these with chr, then paste the lot back together and decode.
>>> u = u'\xc3\xa9'
>>> s = ''.join(chr(ord(c)) for c in u)
>>> unicode(s, encoding='utf-8')
u'\xe9'

Python 3 CGI: how to output raw bytes

I decided to use Python 3 for making my website, but I encountered a problem with Unicode output.
It seems like plain print(html) #html is astr should be working, but it's not. I get UnicodeEncodeError: 'ascii' codec can't encode characters[...]: ordinal not in range(128). This must be because the webserver doesn't support unicode output.
The next thing I tried was print(html.encode('utf-8')), but I got something like repr output of the byte string: it is placed inside b'...' and all the escape characters are in raw form (e.g. \n and \xd0\x9c)
Please show me the correct way to output a Unicode (str) string as a raw UTF-8 encoded bytes string in Python 3.1

The problem here is that you stdout isn't attached to an actual terminal and will use the ASCII encoding by default. Therefore you need to write to sys.stdout.buffer, which is the "raw" binary output of sys.stdout. This can be done in various ways, the most common one seems to be:
import codecs, sys
writer = codecs.getwriter('utf8')(sys.stdout.buffer)
And the use writer. In a CGI script you may be able to replace sys.stdout with the writer so:
sys.stdout = codecs.getwriter('utf8')(sys.stdout.buffer)
Might actually work so you can print normally. Try that!

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

unicode().decode('utf-8', 'ignore') raising UnicodeEncodeError - unicode

You're trying to decode a unicode. The implicit encoding to make the decode work is what's failing.

Related

Unicode Decode ErrorL Ordinal not in range

Encoding emoji in Erlang

Extract the first letter of a UTF-8 string with Lua

python encoding conversion

Python 3 CGI: how to output raw bytes

Categories

Resources