Error while writing unicode data to file - unicode

I am trying to write unicode data(The actual data contains german characters) to a file but I am getting error:
Traceback (most recent call last):
File "C:\Python27\extract_osm_road_nw.py", line 76, in <module>
file.write(str(list_way_id[index][2][i][1]))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xdf' in position 12: ordinal not in range(128)
The code is:
## writing the data in a file
## The data format is:
## A list of: [tuple(way ref id,list[tuple(node id, lat, long)],list[tuple(key,value)]),.....]
## For example: [(u'72439830', [(u'298094414', u'52.4626304', u'10.5579578'), (u'860126050', u'52.4626762', u'10.5576574')], [(u'name', u'General-BeckStra\xdfe')]),.....]
with codecs.open("extracted_osm_file.csv", "w", encoding="utf-8") as file:
for index in range(len(list_way_id)):
file.write("Way ID Ref No: ")
file.write(str(list_way_id[index][0]))## points to the way id ref no
file.write("\n")
file.write("Node reference id, latitude, longitude: ")
file.write("\n")
for i in range(len(list_way_id[index][1])):
file.write(str(list_way_id[index][1][i][0]))
file.write(",")
file.write(str(list_way_id[index][1][i][1]))
file.write(",")
file.write(str(list_way_id[index][1][i][2]))
file.write("\n")
for i in range(len(list_way_id[index][2])):
file.write(str(list_way_id[index][2][i][0]))
file.write(",")
file.write(str(list_way_id[index][2][i][1]))
file.write("\n")
file.close()

Remove the calls to str(). These try to convert the unicode data you have to bytecode, using the default ASCII encoding.
You probably want to use the csv module here and save yourself the grief of having to write all those commas, but if you cannot, at least use loops properly:
with codecs.open("extracted_osm_file.csv", "w", encoding="utf-8") as fileobj:
for line in list_way_id:
fileobj.write(u'Way ID Ref No: {}\n'.format(line[0]))
fileobj.write(u'Node reference id, latitude, longitude: \n')
fileobj.write(u'\n'.join([u','.join(subentry)
for entry in line[1:] for subentry in entry]))
which, for your partial example, would write:
Way ID Ref No: 72439830
Node reference id, latitude, longitude:
298094414,52.4626304,10.5579578
860126050,52.4626762,10.5576574
name,General-BeckStra\xc3\x9fe

Related

Openpyxl Unicode decode error cannot remove \ufeff from cell value

I am parsing multiple worksheets of unicode data and creating a dictionary for specific cells in each sheet but I am having trouble decoding the unicode data. The small snippet of the code is below
for key in shtDict:
sht = wb[key]
for row in sht.iter_rows('A:A',row_offset = 1):
for cell in row:
if isinstance(cell.value,unicode):
if "INC" in cell.value:
shtDict[key] = cell.value
The output of this section is:
{'60071508': u'\ufeffReason: INC8595939', '60074426': u'\ufeffReason. Ref INC8610481', '60071539': u'\ufeffReason: INC8603621'}
I tried to properly decode the data based on u'\ufeff' in Python string, by changing the last line to:
shtDict[key] = cell.value.decode('utf-8-sig')
But I get the following error:
Traceback (most recent call last):
File "", line 55, in <module>
shtDict[key] = cell.value.decode('utf-8-sig')
File "C:\Python27\lib\encodings\utf_8_sig.py", line 22, in decode
(output, consumed) = codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range(128)
Not sure what the issue is, I have also tried decoding with 'utf-16', but I get the same error. Can anyone help with this?
Just make it simpler: you can ignore BOF, so just ignore BOF characters.
shtDict[key] = cell.value.replace(u'\ufeff', '', 1)
Note: cell.value is already unicode type (you just checked it), so you cannot decode it again.

How to read a string containing a comma and an at sign with textread?

My prototype data line looks like this:
(1) 11 July England 0-0 Uruguay # Wembley Stadium, London
Currently I'm using this:
[no,dd,mm,t1,p1,p2,t2,loc]=textread('1966.txt','(%d) %d %s %s %d-%d %s # %[%s \n]');
But it gives me the following error:
Error using dataread
Trouble reading string from file (row 1, field 12) ==> Wembley Stadium, London\n
Error in textread (line 174)
[varargout{1:nlhs}]=dataread('file',varargin{:}); %#ok<REMFF1>
So it seems to have trouble with reading a string that contains a comma, or it's the at sign that causes trouble. I read the documentation thoroughly but nowhere does it mention what to do when you have special characters such as # or if you want to read a string that contains a delimiter even though it I don't want it recognized as a delimiter.
You want
[no,dd,mm,t1,p1,p2,t2,loc] = ...
textread('1966.txt','(%d) %d %s %s %d-%d %s # %[^\n]');

Reading and writing error pajek file in Networkx

I am receiving error when I write to a pajek file and then read back the same file using Networkx library python
>>> G=nx.read_pajek("eatRS.net")
>>> nx.write_pajek(G,"temp.net")
>>> G1=nx.read_pajek("temp.net")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<string>", line 2, in read_pajek
File "/usr/local/lib/python2.7/dist-packages/networkx/utils/decorators.py", line 193, in _open_file
result = func(*new_args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/networkx/readwrite/pajek.py", line 132, in read_pajek
return parse_pajek(lines)
File "/usr/local/lib/python2.7/dist-packages/networkx/readwrite/pajek.py", line 168, in parse_pajek
splitline=shlex.split(str(next(lines)))
File "/usr/lib/python2.7/shlex.py", line 279, in split
return list(lex)
File "/usr/lib/python2.7/shlex.py", line 269, in next
token = self.get_token()
File "/usr/lib/python2.7/shlex.py", line 96, in get_token
raw = self.read_token()
File "/usr/lib/python2.7/shlex.py", line 172, in read_token
r aise ValueError, "No closing quotation"
ValueError: No closing quotation
Creating a graph within networkx, writing in pajek format and then back again works fine for me. E.g. with gnm_random_graph:
import matplotlib.pyplot as np
n = 10
m = 20
G = nx.gnm_random_graph(n,m)
nx.write_pajek(G, "temp.net")
G1 = nx.read_pajek("temp.net")
Only if I edit the intermediate graph to have, say,
"vertex one 0.3456 0.1234 box ic White fos 20
do I get the ValueError: No closing quotation error you have. Node labels can be numeric or string, but if they include spaces, the name must be quoted.From the Pajek manual:
label - if label starts with character A..Z or 0..9 first blank determines end of the label
(example: vertex1), labels consisting of more words must be enclosed in pair of special
characters (example: "vertex 1")
Thus, I suggest that you inspect your input file "eatRS.net". Perhaps there is an issue with character encoding, mismatched quotes (e.g. opening with " and closing with '), or a line break within the node label?

Matlab Read Text File List Exclude first 34 characters

I am trying to read values from a text file. I want the value after ': '.
Here is a sample of the text file. All lines are formated the same.
There are 34 places before the start of the data.
File Name : IMG_1184.JPG
File Size : 2.1 MB
File Modification Date/Time : 2012:07:14 11:53:18-05:00
File Permissions : rw-rw-rw-
File Type : JPEG
MIME Type : image/jpeg
Exif Byte Order : Big-endian (Motorola, MM)
I tried to use this code:
fileID = fopen('Exif.txt');
Exif1 = textscan(fileID, '%s %s','delimiter', ':');
This worked on most of the data but some data also used ':' so that didn't work.
I tried to use this code:
fileID = fopen('Exif.txt');
Exif1 = textscan(fileID, '%s %s','delimiter', ': ');
This returned a mess. Not sure why. Everything was fragmented.
Can anyone explain how to just get the 35th value to the end of every string and put it into an array?
There is the function strtrim(string) in Matlab which will strip the leading and trailing spaces for you. Try reading the data in a line at the time into the textscan function after using strtrim?
Read the whole line into a variable then get the 35th and subsequent characters like this:
whole_line(35:end)

Python 3.2 lxml fill and submit form, select multiple, how to do it? value not working

Great page this one, coming from the perl world and after several years of doing nothing, I've re-started to program again (this web page didn't exist, how things change). And now, after a 2 full-days of searching, I play the last card of asking here for help.
Working under mac environment, with python 3.2 and lxml 2.3 (installed following www.jtmoon.com/?p=21), what I am trying to do:
web: http://biodbnet.abcc.ncifcrf.gov/db/db2db.php
to fill the form that you find there
to submit it
My code. I put several attempts and the output code.
from lxml.html import parse, submit_form, tostring
page = parse('http://biodbnet.abcc.ncifcrf.gov/db/db2db.php').getroot()
page.forms[0].fields['input'] = 'GI Number'
page.forms[0].inputs['outputs[]'].value = 'Gene ID'
page.forms[0].fields['hasComma'] = 'no'
page.forms[0].fields['removeDupValues'] = 'yes'
page.forms[0].fields['request'] = 'db2db'
page.forms[0].action = 'http://biodbnet.abcc.ncifcrf.gov/db/db2dbRes.php'
page.forms[0].fields['idList'] = '86439006'
submit_form(page.forms[0])
Output:
File "/Users/gerard/Desktop/barbacue/MGFtoXML.py", line 30, in <module>
page.forms[0].inputs['outputs[]'].value = 'Gene ID'
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/lxml/html/__init__.py", line 1058, in _value__set
"You must pass in a sequence")
TypeError: You must pass in a sequence
So, since that element is a multi-select element, I understand that I have to give a list
page.forms[0].inputs['outputs[]'].value = list('Gene ID')
Output:
File "/Users/gerard/Desktop/barbacue/MGFtoXML.py", line 30, in <module>
page.forms[0].inputs['outputs[]'].value = list('Gene ID')
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/lxml/html/__init__.py", line 1059, in _value__set
self.value.clear()
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/lxml/html/_setmixin.py", line 115, in clear
self.remove(item)
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/lxml/html/__init__.py", line 1159, in remove
"The option %r is not currently selected" % item)
ValueError: The option 'Affy ID' is not currently selected
'Affy ID' is the first option value of the list, and it is not selected. But what's the problem with it?
Surprisingly, if I instead put
page.forms[0].inputs['outputs[]'].multiple = list('Gene ID')
#page.forms[0].inputs['outputs[]'].value = list('Gene ID')
Then, somehow lxml likes it, and move on. However, the multiple attribute should be a boolean (actually it is if I print the value), I shouldn't touch it, and the "value" of the item should actually point to the selected items, according to the lxml docs.
The new output
File "/Users/gerard/Desktop/barbacue/MGFtoXML.py", line 87, in <module>
submit_form(page.forms[0])
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/lxml/html/__init__.py", line 856, in submit_form
return open_http(form.method, url, values)
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/lxml/html/__init__.py", line 876, in open_http_urllib
return urlopen(url, data)
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/urllib/request.py", line 138, in urlopen
return opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/urllib/request.py", line 364, in open
req = meth(req)
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/urllib/request.py", line 1052, in do_request_
raise TypeError("POST data should be bytes"
TypeError: POST data should be bytes or an iterable of bytes. It cannot be str.
So, what can be done?? I am sure that with python 2.6 I could use mecanize, or that perhaps lxml could work? But I really don't want to code in a sort-of deprecated version. I am enjoying a lot python, but I am starting to consider going back to perl. Perhaps this could be a smart movement??
Any help will be hugely appreciated
Gerard
Reading in this forum, I find pythonpaste.org, could it be a replacement for lxml?
Passing in a sequence to list() will generate a list from that sequence. 'Gene ID' is sequence (namely a sequence of characters). So list('Gene ID') will generate a list of characters, like so:
>>> list('Gene ID')
['G', 'e', 'n', 'e', ' ', 'I', 'D']
That's not what you want. Try this:
>>> ['Gene ID']
['Gene ID']
In other words:
page.forms[0].inputs['outputs[]'].value = ['Gene ID']
That should take you a bit forward.