I am parsing multiple worksheets of unicode data and creating a dictionary for specific cells in each sheet but I am having trouble decoding the unicode data. The small snippet of the code is below
for key in shtDict:
sht = wb[key]
for row in sht.iter_rows('A:A',row_offset = 1):
for cell in row:
if isinstance(cell.value,unicode):
if "INC" in cell.value:
shtDict[key] = cell.value
The output of this section is:
{'60071508': u'\ufeffReason: INC8595939', '60074426': u'\ufeffReason. Ref INC8610481', '60071539': u'\ufeffReason: INC8603621'}
I tried to properly decode the data based on u'\ufeff' in Python string, by changing the last line to:
shtDict[key] = cell.value.decode('utf-8-sig')
But I get the following error:
Traceback (most recent call last):
File "", line 55, in <module>
shtDict[key] = cell.value.decode('utf-8-sig')
File "C:\Python27\lib\encodings\utf_8_sig.py", line 22, in decode
(output, consumed) = codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range(128)
Not sure what the issue is, I have also tried decoding with 'utf-16', but I get the same error. Can anyone help with this?
Just make it simpler: you can ignore BOF, so just ignore BOF characters.
shtDict[key] = cell.value.replace(u'\ufeff', '', 1)
Note: cell.value is already unicode type (you just checked it), so you cannot decode it again.
Once again I enter that goddamn unicode-hell ... sigh =(
There are two files:
$ file *
kreise_tmp.geojson: ASCII text
pandas_tmp.csv: UTF-8 Unicode text
I read the first file like this:
with open('kreise_tmp.geojson') as f:
jdata = json.loads(f.read())
I read the second file like this:
pandas_data = pd.read_csv(r'pandas_tmp.csv', sep=";")
Now check out what's inside the strings:
>>> jdata['features'][0]['properties']['name']
u'Kreis Euskirchen' # a unicode string?
>>> pandas_data['kreis'][0]
'Kreis D\xc3\xbcren' # not a unicode string?
Why are the strings from the "UTF-8 Unicode text" file just normal strings and the strings from the "ASCII text" file unicode strings?
JSON strings are always Unicode.
~$ python2
>>> import json
>>> json.loads('"\xc3\xbc"')
u'\xfc'
But they are often serialized with \u escapes, so file will only see ASCII.
>>> json.dumps(_)
'"\\u00fc"'
add encoding='utf-8' to the opening of files to decode them with utf-8
pandas_data = pd.read_csv(r'pandas_tmp.csv', sep=";", encoding='utf8')
you can also do the same with the JSON
with open('kreise_tmp.geojson', encoding='utf8') as f:
jdata = json.loads(f.read())
Also in Python 2.7, you can add this to the top of the file..
#!/usr/bin/env python
# -*- coding: utf-8 -*-
Say we have a string:
s = '\xe5\xaf\x92\xe5\x81\x87\\u2014\\u2014\xe5\x8e\xa6\xe9\x97\xa8'
Somehow two symbols, '—', whose Unicode is \u2014 was not correctly encoded as '\xe2\x80\x94' in UTF-8. Is there an easy way to decode this string? It should be decoded as 寒假——厦门
Manually using the replace function is OK:
t = u'\u2014'
s.replace('\u2014', t.encode('utf-8')
print s
However, it is not automatic. If we extract the Unicode,
index = s.find('\u')
t = s[index : index+6]
then t = '\\u2014'. How to convert it to UTF-8 code?
You're missing extra slashes in your replace()
It should be:
s.replace("\\u2014", u'\u2014'.encode("utf-8") )
Check my warning in the comments of the question. You should not end up in this situation.
What is causing this error and how can I fix it?
(unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
I have also tried reading different files in the same directory an get this same unicode error as well.
file1 = open("C:\Users\Cameron\Desktop\newtextdocument.txt", "w")
for i in range(1000000):
file1.write(str(i) + "\n")
You should escape backslashes inside the string literal. Compare:
>>> print("\U00000023") # single character
#
>>> print(r"\U00000023") # raw-string literal with
\U00000023
>>> print("\\U00000023") # 10 characters
\U00000023
>>> print("a\nb") # three characters (literal newline)
a
b
>>> print(r"a\nb") # four characters (note: `r""` prefix)
a\nb
\U is being treated as the start of a Unicode literal. Use a raw string (a preceding r) to prevent this translation:
>>> 'C:\Users'
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
>>> r'C:\Users'
'C:\\Users'
I am trying to write a dictionary containing utf-8 strings to a CSV. I'm following the instructions from here. However, despite meticulously encoding and decoding these utf-8 strings, I am getting a UnicodeEncodeErrors involving 'ascii' sets.
I have a list of dictionaries which contain strings and ints as values related to changes to Wikipedia articles. The list below corresponds to this change, for example:
edgelist = [{'articleName': 'Barack Obama', 'editorName': 'Schonbrunn', 'revID': '121844749', 'bytesAdded': '183'},
{'articleName': 'Barack Obama', 'editorName': 'Eep\xc2\xb2', 'revID': '121862749', 'bytesAdded': '107'}]
The problem is list[1]['editorName']. It has type 'str' and el[1]['editorName'].decode('utf-8') is u'Eep\xb2'
The code I am attempting is:
_ENCODING = 'utf-8'
def dictToCSV(edgelist,output_file):
with codecs.open(output_file,'wb',encoding=_ENCODING) as f:
w = csv.DictWriter(f,sorted(edgelist[0].keys()))
w.writeheader()
for d in edgelist:
for k,v in d.items():
if type(v) == int:
d[k]=str(v).encode(_ENCODING)
w.writerow({k:v.decode(_ENCODING) for k,v in d.items()})
This returns:
dictToCSV(edgelist,'test2.csv')
File "csv_to_charts.py", line 129, in dictToCSV
w.writerow({k:v.decode(_ENCODING,'ignore') for k,v in d.items()})
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/csv.py", line 148, in writerow
return self.writer.writerow(self._dict_to_list(rowdict))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb2' in position 3: ordinal not in range(128)
Other permutations such as swapping decode for encode or nothing in the final problematic line also return errors:
w.writerow({k:v.encode(_ENCODING) for k,v in d.items()}) returns 'UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 56: ordinal not in range(128)
w.writerow({k:v for k,v in d.items()}) returns UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 56: ordinal not in range(128)
Following this, I changed with codecs.open(output_file,'wb',encoding=_ENCODING) as f: to with open(output_file,'wb') as f: and still receive the same error.
Excluding the list element(s) or the keys containing this problematic string, the script works fine otherwise.
I just edited your code as follows and the csv was written successfully.
from django.utils.encoding import smart_str
import csv
def dictToCSV(edgelist, output_file):
f = open(output_file, 'wb')
w = csv.DictWriter(f, fieldnames=sorted(edgelist[0].keys()))
w.writeheader()
for d in edgelist:
w.writerow(dict(k=smart_str(v)) for k, v in d.items())
f.close()
Copy the Django code and customize it to your need.
A strict interpretation of ASCII encoding only allows ordinals 0-127. Any value outside that range is not ASCII by definition. Since both \xc2 & \xb2 have ordinals higher than 127, they cannot be interpreted as ASCII.
I'm not a Python user, the RFC for CSV mentions ASCII as a common usage but defines an optional 'charset' parameter for the MIME type; I wonder if the writer you're using also might have an 'encoding' setting?
Your strings are already in UTF-8, and DictWriter doesn't work with codecs.open. Following that example:
# coding: utf-8
import csv
edgelist = [
{'articleName': 'Barack Obama', 'editorName': 'Schonbrunn', 'revID': '121844749', 'bytesAdded': '183'},
{'articleName': 'Barack Obama', 'editorName': 'Eep\xc2\xb2', 'revID': '121862749', 'bytesAdded': '107'}]
with open('out.csv','wb') as f:
f.write(u'\ufeff'.encode('utf8')) # BOM (optional...Excel needs it to open UTF-8 file properly)
w = csv.DictWriter(f,sorted(edgelist[0].keys()))
w.writeheader()
for d in edgelist:
w.writerow(d)
Output:
articleName,bytesAdded,editorName,revID
Barack Obama,183,Schonbrunn,121844749
Barack Obama,107,Eep²,121862749
Note, you can use 'editorName': 'Eep²' directly instead of 'editorName': 'Eep\xc2\xb2'. The byte string will be UTF-8-encoded per the # coding: utf-8 and if you save the source file in UTF-8.