Openpyxl Unicode decode error cannot remove \ufeff from cell value - unicode

I am parsing multiple worksheets of unicode data and creating a dictionary for specific cells in each sheet but I am having trouble decoding the unicode data. The small snippet of the code is below
for key in shtDict:
sht = wb[key]
for row in sht.iter_rows('A:A',row_offset = 1):
for cell in row:
if isinstance(cell.value,unicode):
if "INC" in cell.value:
shtDict[key] = cell.value
The output of this section is:
{'60071508': u'\ufeffReason: INC8595939', '60074426': u'\ufeffReason. Ref INC8610481', '60071539': u'\ufeffReason: INC8603621'}
I tried to properly decode the data based on u'\ufeff' in Python string, by changing the last line to:
shtDict[key] = cell.value.decode('utf-8-sig')
But I get the following error:
Traceback (most recent call last):
File "", line 55, in <module>
shtDict[key] = cell.value.decode('utf-8-sig')
File "C:\Python27\lib\encodings\utf_8_sig.py", line 22, in decode
(output, consumed) = codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range(128)
Not sure what the issue is, I have also tried decoding with 'utf-16', but I get the same error. Can anyone help with this?

Just make it simpler: you can ignore BOF, so just ignore BOF characters.
shtDict[key] = cell.value.replace(u'\ufeff', '', 1)
Note: cell.value is already unicode type (you just checked it), so you cannot decode it again.

Related

Reading and writing error pajek file in Networkx

I am receiving error when I write to a pajek file and then read back the same file using Networkx library python
>>> G=nx.read_pajek("eatRS.net")
>>> nx.write_pajek(G,"temp.net")
>>> G1=nx.read_pajek("temp.net")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<string>", line 2, in read_pajek
File "/usr/local/lib/python2.7/dist-packages/networkx/utils/decorators.py", line 193, in _open_file
result = func(*new_args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/networkx/readwrite/pajek.py", line 132, in read_pajek
return parse_pajek(lines)
File "/usr/local/lib/python2.7/dist-packages/networkx/readwrite/pajek.py", line 168, in parse_pajek
splitline=shlex.split(str(next(lines)))
File "/usr/lib/python2.7/shlex.py", line 279, in split
return list(lex)
File "/usr/lib/python2.7/shlex.py", line 269, in next
token = self.get_token()
File "/usr/lib/python2.7/shlex.py", line 96, in get_token
raw = self.read_token()
File "/usr/lib/python2.7/shlex.py", line 172, in read_token
r aise ValueError, "No closing quotation"
ValueError: No closing quotation
Creating a graph within networkx, writing in pajek format and then back again works fine for me. E.g. with gnm_random_graph:
import matplotlib.pyplot as np
n = 10
m = 20
G = nx.gnm_random_graph(n,m)
nx.write_pajek(G, "temp.net")
G1 = nx.read_pajek("temp.net")
Only if I edit the intermediate graph to have, say,
"vertex one 0.3456 0.1234 box ic White fos 20
do I get the ValueError: No closing quotation error you have. Node labels can be numeric or string, but if they include spaces, the name must be quoted.From the Pajek manual:
label - if label starts with character A..Z or 0..9 first blank determines end of the label
(example: vertex1), labels consisting of more words must be enclosed in pair of special
characters (example: "vertex 1")
Thus, I suggest that you inspect your input file "eatRS.net". Perhaps there is an issue with character encoding, mismatched quotes (e.g. opening with " and closing with '), or a line break within the node label?

Error while writing unicode data to file

I am trying to write unicode data(The actual data contains german characters) to a file but I am getting error:
Traceback (most recent call last):
File "C:\Python27\extract_osm_road_nw.py", line 76, in <module>
file.write(str(list_way_id[index][2][i][1]))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xdf' in position 12: ordinal not in range(128)
The code is:
## writing the data in a file
## The data format is:
## A list of: [tuple(way ref id,list[tuple(node id, lat, long)],list[tuple(key,value)]),.....]
## For example: [(u'72439830', [(u'298094414', u'52.4626304', u'10.5579578'), (u'860126050', u'52.4626762', u'10.5576574')], [(u'name', u'General-BeckStra\xdfe')]),.....]
with codecs.open("extracted_osm_file.csv", "w", encoding="utf-8") as file:
for index in range(len(list_way_id)):
file.write("Way ID Ref No: ")
file.write(str(list_way_id[index][0]))## points to the way id ref no
file.write("\n")
file.write("Node reference id, latitude, longitude: ")
file.write("\n")
for i in range(len(list_way_id[index][1])):
file.write(str(list_way_id[index][1][i][0]))
file.write(",")
file.write(str(list_way_id[index][1][i][1]))
file.write(",")
file.write(str(list_way_id[index][1][i][2]))
file.write("\n")
for i in range(len(list_way_id[index][2])):
file.write(str(list_way_id[index][2][i][0]))
file.write(",")
file.write(str(list_way_id[index][2][i][1]))
file.write("\n")
file.close()
Remove the calls to str(). These try to convert the unicode data you have to bytecode, using the default ASCII encoding.
You probably want to use the csv module here and save yourself the grief of having to write all those commas, but if you cannot, at least use loops properly:
with codecs.open("extracted_osm_file.csv", "w", encoding="utf-8") as fileobj:
for line in list_way_id:
fileobj.write(u'Way ID Ref No: {}\n'.format(line[0]))
fileobj.write(u'Node reference id, latitude, longitude: \n')
fileobj.write(u'\n'.join([u','.join(subentry)
for entry in line[1:] for subentry in entry]))
which, for your partial example, would write:
Way ID Ref No: 72439830
Node reference id, latitude, longitude:
298094414,52.4626304,10.5579578
860126050,52.4626762,10.5576574
name,General-BeckStra\xc3\x9fe

Python with Gtk3 not setting unicode properly

I have some simple code that isn't working as expected. First, the docs say that Gtk.Clipboard.get(Gdk.SELECTION_PRIMARY).set_text() should be able to accept only one argument with the length argument option, but it doesn't work (see below). Finally, pasting a unicode ° symbol breaks setting the text when trying to retrieve it from the clipboard (and won't paste into other programs). It gives this warning:
Gdk-WARNING **: Error converting selection from UTF8_STRING
>>> from gi.repository.Gtk import Clipboard
>>> from gi.repository.Gdk import SELECTION_PRIMARY
>>> d='\u00B0'
>>> print(d)
°
>>> cb=Clipboard
Clipboard
>>> cb=Clipboard.get(SELECTION_PRIMARY)
>>> cb.set_text(d) #this should work
Traceback (most recent call last):
File "<ipython-input-6-b563adc3e800>", line 1, in <module>
cb.set_text(d)
File "/usr/lib/python3/dist-packages/gi/types.py", line 43, in function
return info.invoke(*args, **kwargs)
TypeError: set_text() takes exactly 3 arguments (2 given)
>>> cb.set_text(d, len(d))
>>> cb.wait_for_text()
(.:13153): Gdk-WARNING **: Error converting selection from UTF8_STRING
'\\Uffffffff\\Uffffffff'
From the documentation for Gtk.Clipboard
It looks like the method set_text needs a second argument. The first is the text, the second is the length of the text. Or if you don't want to provide the length, you can use -1 to let it calculate the length itself.
gtk.Clipboard.set_text
def set_text(text, len=-1)
text : a string.
len : the length of text, in bytes, or -1, to calculate the length.
I've tested it on Python 3 and it works with cb.set_text(d, -1).
Since GTK version 3.16 there is a easier way of getting the clipboard. You can get it with the get_default() method:
import gi
gi.require_version('Gtk', '3.0')
from gi.repository import Gtk, Gdk, GLib, Gio
display = Gdk.Display.get_default()
clipboard = Gtk.Clipboard.get_default(display)
clipboard.set_text(string, -1)
also for me it worked without
clipboard.store()
Reference: https://lazka.github.io/pgi-docs/Gtk-3.0/classes/Clipboard.html#Gtk.Clipboard.get_default
In Python 3.4. this is only needed for GtkEntryBuffers. In case of GtkTextBuffer set_text works without the second parameter.
example1 works as usual:
settinginfo = 'serveradres = ' + server + '\n poortnummer = ' + poort
GtkTextBuffer2.set_text(settinginfo)
example2 needs extra parameter len:
ErrorTextDate = 'choose earlier date'
GtkEntryBuffer1.set_text(ErrorTextDate, -1)

UnicodeEncodeErrors while using DictWriter for utf-8

I am trying to write a dictionary containing utf-8 strings to a CSV. I'm following the instructions from here. However, despite meticulously encoding and decoding these utf-8 strings, I am getting a UnicodeEncodeErrors involving 'ascii' sets.
I have a list of dictionaries which contain strings and ints as values related to changes to Wikipedia articles. The list below corresponds to this change, for example:
edgelist = [{'articleName': 'Barack Obama', 'editorName': 'Schonbrunn', 'revID': '121844749', 'bytesAdded': '183'},
{'articleName': 'Barack Obama', 'editorName': 'Eep\xc2\xb2', 'revID': '121862749', 'bytesAdded': '107'}]
The problem is list[1]['editorName']. It has type 'str' and el[1]['editorName'].decode('utf-8') is u'Eep\xb2'
The code I am attempting is:
_ENCODING = 'utf-8'
def dictToCSV(edgelist,output_file):
with codecs.open(output_file,'wb',encoding=_ENCODING) as f:
w = csv.DictWriter(f,sorted(edgelist[0].keys()))
w.writeheader()
for d in edgelist:
for k,v in d.items():
if type(v) == int:
d[k]=str(v).encode(_ENCODING)
w.writerow({k:v.decode(_ENCODING) for k,v in d.items()})
This returns:
dictToCSV(edgelist,'test2.csv')
File "csv_to_charts.py", line 129, in dictToCSV
w.writerow({k:v.decode(_ENCODING,'ignore') for k,v in d.items()})
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/csv.py", line 148, in writerow
return self.writer.writerow(self._dict_to_list(rowdict))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb2' in position 3: ordinal not in range(128)
Other permutations such as swapping decode for encode or nothing in the final problematic line also return errors:
w.writerow({k:v.encode(_ENCODING) for k,v in d.items()}) returns 'UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 56: ordinal not in range(128)
w.writerow({k:v for k,v in d.items()}) returns UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 56: ordinal not in range(128)
Following this, I changed with codecs.open(output_file,'wb',encoding=_ENCODING) as f: to with open(output_file,'wb') as f: and still receive the same error.
Excluding the list element(s) or the keys containing this problematic string, the script works fine otherwise.
I just edited your code as follows and the csv was written successfully.
from django.utils.encoding import smart_str
import csv
def dictToCSV(edgelist, output_file):
f = open(output_file, 'wb')
w = csv.DictWriter(f, fieldnames=sorted(edgelist[0].keys()))
w.writeheader()
for d in edgelist:
w.writerow(dict(k=smart_str(v)) for k, v in d.items())
f.close()
Copy the Django code and customize it to your need.
A strict interpretation of ASCII encoding only allows ordinals 0-127. Any value outside that range is not ASCII by definition. Since both \xc2 & \xb2 have ordinals higher than 127, they cannot be interpreted as ASCII.
I'm not a Python user, the RFC for CSV mentions ASCII as a common usage but defines an optional 'charset' parameter for the MIME type; I wonder if the writer you're using also might have an 'encoding' setting?
Your strings are already in UTF-8, and DictWriter doesn't work with codecs.open. Following that example:
# coding: utf-8
import csv
edgelist = [
{'articleName': 'Barack Obama', 'editorName': 'Schonbrunn', 'revID': '121844749', 'bytesAdded': '183'},
{'articleName': 'Barack Obama', 'editorName': 'Eep\xc2\xb2', 'revID': '121862749', 'bytesAdded': '107'}]
with open('out.csv','wb') as f:
f.write(u'\ufeff'.encode('utf8')) # BOM (optional...Excel needs it to open UTF-8 file properly)
w = csv.DictWriter(f,sorted(edgelist[0].keys()))
w.writeheader()
for d in edgelist:
w.writerow(d)
Output:
articleName,bytesAdded,editorName,revID
Barack Obama,183,Schonbrunn,121844749
Barack Obama,107,Eep²,121862749
Note, you can use 'editorName': 'Eep²' directly instead of 'editorName': 'Eep\xc2\xb2'. The byte string will be UTF-8-encoded per the # coding: utf-8 and if you save the source file in UTF-8.

Python 3.2 lxml fill and submit form, select multiple, how to do it? value not working

Great page this one, coming from the perl world and after several years of doing nothing, I've re-started to program again (this web page didn't exist, how things change). And now, after a 2 full-days of searching, I play the last card of asking here for help.
Working under mac environment, with python 3.2 and lxml 2.3 (installed following www.jtmoon.com/?p=21), what I am trying to do:
web: http://biodbnet.abcc.ncifcrf.gov/db/db2db.php
to fill the form that you find there
to submit it
My code. I put several attempts and the output code.
from lxml.html import parse, submit_form, tostring
page = parse('http://biodbnet.abcc.ncifcrf.gov/db/db2db.php').getroot()
page.forms[0].fields['input'] = 'GI Number'
page.forms[0].inputs['outputs[]'].value = 'Gene ID'
page.forms[0].fields['hasComma'] = 'no'
page.forms[0].fields['removeDupValues'] = 'yes'
page.forms[0].fields['request'] = 'db2db'
page.forms[0].action = 'http://biodbnet.abcc.ncifcrf.gov/db/db2dbRes.php'
page.forms[0].fields['idList'] = '86439006'
submit_form(page.forms[0])
Output:
File "/Users/gerard/Desktop/barbacue/MGFtoXML.py", line 30, in <module>
page.forms[0].inputs['outputs[]'].value = 'Gene ID'
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/lxml/html/__init__.py", line 1058, in _value__set
"You must pass in a sequence")
TypeError: You must pass in a sequence
So, since that element is a multi-select element, I understand that I have to give a list
page.forms[0].inputs['outputs[]'].value = list('Gene ID')
Output:
File "/Users/gerard/Desktop/barbacue/MGFtoXML.py", line 30, in <module>
page.forms[0].inputs['outputs[]'].value = list('Gene ID')
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/lxml/html/__init__.py", line 1059, in _value__set
self.value.clear()
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/lxml/html/_setmixin.py", line 115, in clear
self.remove(item)
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/lxml/html/__init__.py", line 1159, in remove
"The option %r is not currently selected" % item)
ValueError: The option 'Affy ID' is not currently selected
'Affy ID' is the first option value of the list, and it is not selected. But what's the problem with it?
Surprisingly, if I instead put
page.forms[0].inputs['outputs[]'].multiple = list('Gene ID')
#page.forms[0].inputs['outputs[]'].value = list('Gene ID')
Then, somehow lxml likes it, and move on. However, the multiple attribute should be a boolean (actually it is if I print the value), I shouldn't touch it, and the "value" of the item should actually point to the selected items, according to the lxml docs.
The new output
File "/Users/gerard/Desktop/barbacue/MGFtoXML.py", line 87, in <module>
submit_form(page.forms[0])
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/lxml/html/__init__.py", line 856, in submit_form
return open_http(form.method, url, values)
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/lxml/html/__init__.py", line 876, in open_http_urllib
return urlopen(url, data)
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/urllib/request.py", line 138, in urlopen
return opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/urllib/request.py", line 364, in open
req = meth(req)
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/urllib/request.py", line 1052, in do_request_
raise TypeError("POST data should be bytes"
TypeError: POST data should be bytes or an iterable of bytes. It cannot be str.
So, what can be done?? I am sure that with python 2.6 I could use mecanize, or that perhaps lxml could work? But I really don't want to code in a sort-of deprecated version. I am enjoying a lot python, but I am starting to consider going back to perl. Perhaps this could be a smart movement??
Any help will be hugely appreciated
Gerard
Reading in this forum, I find pythonpaste.org, could it be a replacement for lxml?
Passing in a sequence to list() will generate a list from that sequence. 'Gene ID' is sequence (namely a sequence of characters). So list('Gene ID') will generate a list of characters, like so:
>>> list('Gene ID')
['G', 'e', 'n', 'e', ' ', 'I', 'D']
That's not what you want. Try this:
>>> ['Gene ID']
['Gene ID']
In other words:
page.forms[0].inputs['outputs[]'].value = ['Gene ID']
That should take you a bit forward.