Unicode confusion #3423435 - unicode

Once again I enter that goddamn unicode-hell ... sigh =(
There are two files:
$ file *
kreise_tmp.geojson: ASCII text
pandas_tmp.csv: UTF-8 Unicode text
I read the first file like this:
with open('kreise_tmp.geojson') as f:
jdata = json.loads(f.read())
I read the second file like this:
pandas_data = pd.read_csv(r'pandas_tmp.csv', sep=";")
Now check out what's inside the strings:
>>> jdata['features'][0]['properties']['name']
u'Kreis Euskirchen' # a unicode string?
>>> pandas_data['kreis'][0]
'Kreis D\xc3\xbcren' # not a unicode string?
Why are the strings from the "UTF-8 Unicode text" file just normal strings and the strings from the "ASCII text" file unicode strings?

JSON strings are always Unicode.
~$ python2
>>> import json
>>> json.loads('"\xc3\xbc"')
u'\xfc'
But they are often serialized with \u escapes, so file will only see ASCII.
>>> json.dumps(_)
'"\\u00fc"'

add encoding='utf-8' to the opening of files to decode them with utf-8
pandas_data = pd.read_csv(r'pandas_tmp.csv', sep=";", encoding='utf8')
you can also do the same with the JSON
with open('kreise_tmp.geojson', encoding='utf8') as f:
jdata = json.loads(f.read())
Also in Python 2.7, you can add this to the top of the file..
#!/usr/bin/env python
# -*- coding: utf-8 -*-

Related

How to convert utf8 to text in dart

I want to convert special character utf8 value to it's text.
For example, if input is %20, the output will be whitespace
if input is %23, the output will be #
void main() {
var raw = 'Hello%20Bebop%23yahoo';
var parsed = Uri.decodeComponent(raw);
print(parsed);
}
Result:
Hello Bebop#yahoo
Looks like you converting to an ASCCI like encoding. What function you are using for that result?
Try finding the encoding table to your %20 and %23 output, so you will see where you heading at the moment.

Decode a string with both Unicode and Utf-8 codes in Python 2.x

Say we have a string:
s = '\xe5\xaf\x92\xe5\x81\x87\\u2014\\u2014\xe5\x8e\xa6\xe9\x97\xa8'
Somehow two symbols, '—', whose Unicode is \u2014 was not correctly encoded as '\xe2\x80\x94' in UTF-8. Is there an easy way to decode this string? It should be decoded as 寒假——厦门
Manually using the replace function is OK:
t = u'\u2014'
s.replace('\u2014', t.encode('utf-8')
print s
However, it is not automatic. If we extract the Unicode,
index = s.find('\u')
t = s[index : index+6]
then t = '\\u2014'. How to convert it to UTF-8 code?
You're missing extra slashes in your replace()
It should be:
s.replace("\\u2014", u'\u2014'.encode("utf-8") )
Check my warning in the comments of the question. You should not end up in this situation.

unicode error preventing creation of text file

What is causing this error and how can I fix it?
(unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
I have also tried reading different files in the same directory an get this same unicode error as well.
file1 = open("C:\Users\Cameron\Desktop\newtextdocument.txt", "w")
for i in range(1000000):
file1.write(str(i) + "\n")
You should escape backslashes inside the string literal. Compare:
>>> print("\U00000023") # single character
#
>>> print(r"\U00000023") # raw-string literal with
\U00000023
>>> print("\\U00000023") # 10 characters
\U00000023
>>> print("a\nb") # three characters (literal newline)
a
b
>>> print(r"a\nb") # four characters (note: `r""` prefix)
a\nb
\U is being treated as the start of a Unicode literal. Use a raw string (a preceding r) to prevent this translation:
>>> 'C:\Users'
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
>>> r'C:\Users'
'C:\\Users'

UnicodeEncodeErrors while using DictWriter for utf-8

I am trying to write a dictionary containing utf-8 strings to a CSV. I'm following the instructions from here. However, despite meticulously encoding and decoding these utf-8 strings, I am getting a UnicodeEncodeErrors involving 'ascii' sets.
I have a list of dictionaries which contain strings and ints as values related to changes to Wikipedia articles. The list below corresponds to this change, for example:
edgelist = [{'articleName': 'Barack Obama', 'editorName': 'Schonbrunn', 'revID': '121844749', 'bytesAdded': '183'},
{'articleName': 'Barack Obama', 'editorName': 'Eep\xc2\xb2', 'revID': '121862749', 'bytesAdded': '107'}]
The problem is list[1]['editorName']. It has type 'str' and el[1]['editorName'].decode('utf-8') is u'Eep\xb2'
The code I am attempting is:
_ENCODING = 'utf-8'
def dictToCSV(edgelist,output_file):
with codecs.open(output_file,'wb',encoding=_ENCODING) as f:
w = csv.DictWriter(f,sorted(edgelist[0].keys()))
w.writeheader()
for d in edgelist:
for k,v in d.items():
if type(v) == int:
d[k]=str(v).encode(_ENCODING)
w.writerow({k:v.decode(_ENCODING) for k,v in d.items()})
This returns:
dictToCSV(edgelist,'test2.csv')
File "csv_to_charts.py", line 129, in dictToCSV
w.writerow({k:v.decode(_ENCODING,'ignore') for k,v in d.items()})
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/csv.py", line 148, in writerow
return self.writer.writerow(self._dict_to_list(rowdict))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb2' in position 3: ordinal not in range(128)
Other permutations such as swapping decode for encode or nothing in the final problematic line also return errors:
w.writerow({k:v.encode(_ENCODING) for k,v in d.items()}) returns 'UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 56: ordinal not in range(128)
w.writerow({k:v for k,v in d.items()}) returns UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 56: ordinal not in range(128)
Following this, I changed with codecs.open(output_file,'wb',encoding=_ENCODING) as f: to with open(output_file,'wb') as f: and still receive the same error.
Excluding the list element(s) or the keys containing this problematic string, the script works fine otherwise.
I just edited your code as follows and the csv was written successfully.
from django.utils.encoding import smart_str
import csv
def dictToCSV(edgelist, output_file):
f = open(output_file, 'wb')
w = csv.DictWriter(f, fieldnames=sorted(edgelist[0].keys()))
w.writeheader()
for d in edgelist:
w.writerow(dict(k=smart_str(v)) for k, v in d.items())
f.close()
Copy the Django code and customize it to your need.
A strict interpretation of ASCII encoding only allows ordinals 0-127. Any value outside that range is not ASCII by definition. Since both \xc2 & \xb2 have ordinals higher than 127, they cannot be interpreted as ASCII.
I'm not a Python user, the RFC for CSV mentions ASCII as a common usage but defines an optional 'charset' parameter for the MIME type; I wonder if the writer you're using also might have an 'encoding' setting?
Your strings are already in UTF-8, and DictWriter doesn't work with codecs.open. Following that example:
# coding: utf-8
import csv
edgelist = [
{'articleName': 'Barack Obama', 'editorName': 'Schonbrunn', 'revID': '121844749', 'bytesAdded': '183'},
{'articleName': 'Barack Obama', 'editorName': 'Eep\xc2\xb2', 'revID': '121862749', 'bytesAdded': '107'}]
with open('out.csv','wb') as f:
f.write(u'\ufeff'.encode('utf8')) # BOM (optional...Excel needs it to open UTF-8 file properly)
w = csv.DictWriter(f,sorted(edgelist[0].keys()))
w.writeheader()
for d in edgelist:
w.writerow(d)
Output:
articleName,bytesAdded,editorName,revID
Barack Obama,183,Schonbrunn,121844749
Barack Obama,107,Eep²,121862749
Note, you can use 'editorName': 'Eep²' directly instead of 'editorName': 'Eep\xc2\xb2'. The byte string will be UTF-8-encoded per the # coding: utf-8 and if you save the source file in UTF-8.

Base64ing Unicode characters

Can Unicode characters be encoded and decoded with Base64?
I have attempted to encode the string 'الله', but when I decoded it all I got was '????'.
Base64 converts binary to text. If you want to convert text to a base64 format, you'll need to convert the text to binary using some appropriate encoding (e.g. UTF-8, UTF-16) first.
Of course they can. It depends on how your language or Base64 routine handles Unicode input. For example, Python's b64 routines expect an encoded string (as Base64 encodes binary to text, not Unicode codepoints to text).
Python 2.5.1 (r251:54863, Jul 31 2008, 22:53:39)
[GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> a = 'ûñö'
>>> import base64
>>> base64.b64encode(a)
'w7vDscO2'
>>> base64.b64decode('w7vDscO2')
'\xc3\xbb\xc3\xb1\xc3\xb6'
>>> print '\xc3\xbb\xc3\xb1\xc3\xb6'
ûñö
>>>
>>> u'üñô'
u'\xfc\xf1\xf4'
>>> base64.b64encode(u'\xfc\xf1\xf4')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.5/base64.py", line 53, in b64encode
encoded = binascii.b2a_base64(s)[:-1]
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-2: ordinal not in range(128)
>>> base64.b64encode(u'\xfc\xf1\xf4'.encode('utf-8'))
'w7zDscO0'
>>> base64.b64decode('w7zDscO0')
'\xc3\xbc\xc3\xb1\xc3\xb4'
>>> print base64.b64decode('w7zDscO0')
üñô
>>> a = 'الله'
>>> a
'\xd8\xa7\xd9\x84\xd9\x84\xd9\x87'
>>> base64.b64encode(a)
'2KfZhNmE2Yc='
>>> b = base64.b64encode(a)
>>> print base64.b64decode(b)
الله
You didn't specify which language(s) you're using, but try converting the string to a byte array (however that's done in your language of choice) and then base64 encoding that byte array.
In .NET you can try this (encode):
byte[] encbuf;
encbuf = System.Text.Encoding.Unicode.GetBytes(input);
string encoded = Convert.ToBase64String(encbuf);
...and to decode:
byte[] decbuff;
decbuff = Convert.FromBase64String(this.ToString());
string decoded = System.Text.Encoding.Unicode.GetString(decbuff);