python encoding conversion - unicode

Here's my problem, I have a variable wrongly encoded that I want to fix. Long story short, I end up with:
myVar=u'\xc3\xa9'
which is wrong because it's the character 'é' or \u00e9 UTF-8 encoded, not unicode.
None of the combinations of encode/decode I tried seem to solve the problem. I looked towards the bytearray object, but you must provide an encoding, and obviously none of them fits.
Basically I need to reinterpret the byte array into the correct encoding. Any ideas on how to do that?
Thanks.

What you should have done.
>>> b='\xc3\xa9'
>>> b
'\xc3\xa9'
>>> b.decode("UTF-8")
u'\xe9'
Since you didn't show the broken code that caused the problem, all we can do is make a complex problem more complex.
This appears to be what you're seeing.
>>> c
u'\xc3\xa9'
>>> c.decode("UTF-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
Here's a workaround.
>>> [ chr(ord(x)) for x in c ]
['\xc3', '\xa9']
>>> ''.join(_)
'\xc3\xa9'
>>> _.decode("UTF-8")
u'\xe9'
Fix the code that produced the wrong stuff to begin with.

The hacky solution: pull out the codepoints with ord, then build characters (length-one strings) out of these with chr, then paste the lot back together and decode.
>>> u = u'\xc3\xa9'
>>> s = ''.join(chr(ord(c)) for c in u)
>>> unicode(s, encoding='utf-8')
u'\xe9'

Related

Uploading csv to postgresql using psycopg2

hI I am trying to upload a csv file to postgresql database using python
A table called "userspk" is already created in the database called "DVD"
below are the codes
import pandas as pd
import psycopg2 as pg2
conn = pg2.connect(database='DVD', user=xxx,password=xxx)
cur = conn.cursor()
def upload_data():
with open('/Users/Downloads/DVDlist.csv', 'r') as f:
next(f) #skips the header row
cur.copy_from(f, 'userspk', sep=',')
conn.commit()
upload_data()
keep getting this error. I would have thought it should be fairly straightforward. Something wrong with codes?
/Users/pk/.conda/envs/Pk/bin/python /Users/pk/PycharmProjects/Pk/SQL_upload_file.py
Traceback (most recent call last):
File "/Users/pk/PycharmProjects/Pk/SQL_upload_file.py", line 44, in <module>
upload_data()
File "/Users/pk/PycharmProjects/Pk/SQL_upload_file.py", line 37, in upload_data
next(f) # Skip the header row.
File "/Users/pk/.conda/envs/Pk/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 5718: invalid start byte
The error seems to be coming from the next(f) function, and so has nothing to do with psycopg2 or PostgreSQL. It looks like your file has characters which python consider to be invalid as utf-8 characters.
This file is probably in Latin-1, and that is the British pound sterling sign.
You might be able to fix it by specifying the encoding when you open the file.
open('/Users/Downloads/DVDlist.csv', 'r',encoding="latin-1")
But the rows after the header might also have some issues.

Trying to upload specific characters in Python 3 using Windows Powershell

I'm running this code in Windows Powershell and it includes this file called languages.txt where I'm trying to convert between bytes to strings:
Here is languages.txt:
Afrikaans
አማርኛ
Аҧсшәа
العربية
Aragonés
Arpetan
Azərbaycanca
Bamanankan
বাংলা
Bân-lâm-gú
Беларуская
Български
Boarisch
Bosanski
Буряад
Català
Чӑвашла
Čeština
Cymraeg
Dansk
Deutsch
Eesti
Ελληνικά
Español
Esperanto
فارسی
Français
Frysk
Gaelg
Gàidhlig
Galego
한국어
Հայերեն
हिन्दी
Hrvatski
Ido
Interlingua
Italiano
עברית
ಕನ್ನಡ
Kapampangan
ქართული
Қазақша
Kreyòl ayisyen
Latgaļu
Latina
Latviešu
Lëtzebuergesch
Lietuvių
Magyar
Македонски
Malti
मराठी
მარგალური
مازِرونی
Bahasa Melayu
Монгол
Nederlands
नेपाल भाषा
日本語
Norsk bokmål
Nouormand
Occitan
Oʻzbekcha/ўзбекча
ਪੰਜਾਬੀ
پنجابی
پښتو
Plattdüütsch
Polski
Português
Română
Romani
Русский
Seeltersk
Shqip
Simple English
Slovenčina
کوردیی ناوەندی
Српски / srpski
Suomi
Svenska
Tagalog
தமிழ்
ภาษาไทย
Taqbaylit
Татарча/tatarça
తెలుగు
Тоҷикӣ
Türkçe
Українська
اردو
Tiếng Việt
Võro
文言
吴语
ייִדיש
中文
Then, here is the code I used:
import sys
script, input_encoding, error = sys.argv
def main(language_file, encoding, errors):
line = language_file.readline()
if line:
print_line(line, encoding, errors)
return main(language_file, encoding, errors)
def print_line(line, encoding, errors):
next_lang = line.strip()
raw_bytes = next_lang.encode(encoding, errors=errors)
cooked_string = raw_bytes.decode(encoding, errors=errors)
print(raw_bytes, "<===>", cooked_string)
languages = open("languages.txt", encoding="utf-8")
main(languages, input_encoding, error)
Here's the output (credit: Learn Python 3 the Hard Way by Zed A. Shaw):
I don't know why it doesn't upload the characters and shows question blocks instead. Can anyone help me?
The first string which fails is አማርኛ. The first character, አ is in unicode 12A0 (see here). In UTF-8, that is b'\xe1\x8a\xa0'. So, that part is obviously fine. The file really is UTF-8.
Printing did not raise an exception, so your output encoding can handle all of the characters. Everything is fine.
The only remaining reason I see for it to fail is that the font used in the console does not support all of the characters.
If it is just for play, you should not worry about it. Consider it working correctly.
On the other hand, I would suggest changing some things in your code:
You are running main recursively for each line. There is absolutely no need for that and it would run into recursion depth limit on a longer file. User a for loop instead.
for line in lines:
print_line(line, encoding, errors)
You are opening the file as UTF-8, so reading from it automatically decodes UTF-8 into Unicode, then you encode it back into row_bytes and then encode again into cooked_string, which is the same as line. It would be a better exercise to read the file as raw binary, split it on newlines and then decode. Then you'd have a clearer picture of what is going on.
with open("languages.txt", 'rb') as f:
raw_file_contents = f.read()

Unicode Decode ErrorL Ordinal not in range

so I am trying to make the values in a csv file Unicode so a program I am using is able to read them. Here is my code, in Python 2.7, which I HAVE to use:
TEST_SENTENCES = []
with open('Book2.csv', 'rb') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
encoded_tweet = row["Tweet"].encode('utf-8')
TEST_SENTENCES.append(encoded_tweet)
I continue to receive the same error message, and have not been able to find anything that works. Here is the error message. I am sure someone out there can make a really easy solution.
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 127: ordinal not in range(128)
As an update, I changed encode to decode, and the program says it is running the predictions, given it has been a while but it would not have started if it was not encoded in Unicode so let us pray.

Extract the first letter of a UTF-8 string with Lua

Is there any way to extract the first letter of a UTF-8 encoded string with Lua?
Lua does not properly support Unicode, so string.sub("ÆØÅ", 2, 2) will return "?" rather than "Ø".
Is there a relatively simple UTF-8 parsing algorithm I could use on the string byte per byte, for the sole purpose of getting the first letter of the string, be it a Chinese character or an A?
Or is this way too complex, requiring a huge library, etc.?
You can easily extract the first letter from a UTF-8 encoded string with the following code:
function firstLetter(str)
return str:match("[%z\1-\127\194-\244][\128-\191]*")
end
Because a UTF-8 code point either begins with a byte from 0 to 127, or with a byte from 194 to 244 followed by one or several bytes from 128 to 191.
You can even iterate over UTF-8 code points in a similar manner:
for code in str:gmatch("[%z\1-\127\194-\244][\128-\191]*") do
print(code)
end
Note that both examples return a string value for each letter, and not the Unicode code point numerical value.
Lua 5.3 provide a UTF-8 library.
You can use utf8.codes to get each code point, and then use utf8.char to get the character:
local str = "ÆØÅ"
for _, c in utf8.codes(str) do
print(utf8.char(c))
end
This also works:
local str = "ÆØÅ"
for w in str:gmatch(utf8.charpattern ) do
print(w)
end
where utf8.charpattern is just the string "[\0-\x7F\xC2-\xF4][\x80-\xBF]*" for the pattern to match one UTF-8 byte sequence.

unicode().decode('utf-8', 'ignore') raising UnicodeEncodeError

Here is the code:
>>> z = u'\u2022'.decode('utf-8', 'ignore')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2022' in position 0: ordinal not in range(256)
Why is UnicodeEncodeError raised when I am using .decode?
Why is any error raised when I am using 'ignore'?
When I first started messing around with python strings and unicode, It took me awhile to understand the jargon of decode and encode too, so here's my post from here that may help:
Think of decoding as what you do to go from a regular bytestring to unicode and encoding as what you do to get back from unicode. In other words:
You de-code a str to produce a unicode string (in Python 2)
and en-code a unicode string to produce a str (in Python 2)
So:
unicode_char = u'\xb0'
encodedchar = unicode_char.encode('utf-8')
encodedchar will contain your unicode character, displayed in the selected encoding (in this case, utf-8).
The same principle applies to Python 3. You de-code a bytes object to produce a str object. And you en-code a str object to produce a bytes object.
From http://wiki.python.org/moin/UnicodeEncodeError
Paradoxically, a UnicodeEncodeError may happen when
decoding. The cause of it seems to be the
coding-specific decode() functions that normally expect
a parameter of type str. It appears that on seeing a
unicode parameter, the decode() functions "down-convert"
it into str, then decode the result assuming it to be of
their own coding. It also appears that the
"down-conversion" is performed using the ASCII encoder.
Hence an encoding failure inside a decoder.
You're trying to decode a unicode. The implicit encoding to make the decode work is what's failing.