Scraping french site and getting the UnicodeEncodeError

Scraping french site and getting the UnicodeEncodeError - unicode

I am scraping a site and getting info from links on that site, however, many of the links contain accents/french characters. I am unable to get the links for these pages therefore not able to scrape them.
This is the part of the code that gets URLs from start pages
def parse(self, response):
subURLs = []
partialURLs = response.css('.directory_name::attr(href)').extract()
for i in partialURLs:
yield response.follow('https://wheelsonline.ca/' + str(i), self.parse_dealers)
And This is the Error that I am getting in the log
yield response.follow('https://wheelsonline.ca/' + str(i), self.parse_dealers)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 58: ordinal not in range(128)
Any help is appreciated! Thank you!

Don't use str() to convert that value. Read more about that here: UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)
However, there is a better way to create URLs like that using Scrapy's built-in urljoin:
yield response.follow(response.urljoin(i), self.parse_dealers)
This will automatically create the full URL based on the current URL plus the relative path.

Related

Beautiful Soup lxml Character Encoding Issue

I'm trying to parse a web page that has non-printable characters on it and write that to a file in python. I'm using Python 2.7 with requests and Beautiful Soup.
I get the page with requests, and parse it with the following-
for option in recon:
data['opts'] = '/c' + option
print "Getting: ",
print option
r = requests.post(url, data)
print r.content
page = bs4.BeautifulSoup(r.content, "lxml", from_encoding='utf-8')
print page
tag = page.pre.contents
print tag[0]
When testing, the print r.content shows the page properly in all its unformatted glory. The page is a .cfm, and the text I'm looking for falls between "pre" tags. After running through bs though, bs interprets some of the non printable text into "br" tags, resulting in tags being a list of 2 items, instead of just all the text between the pre tags. Is there a way to either just get the text between the pre tags with requests, or do something differently with bs to get it to not misinterpret the characters?
I've read through the following trying to figure it out, plus requests and beautiful soup docs, but found no luck so far-
Joel on Software - Character Sets
SO utf-8 vs unicode
SO Getting text between tags

Overthought the problem. I just base64 encoded the data before transfer with certutil on windows, removed the first and last line, and then decoded on the far side.

UnicodeDecodeError: 'ascii' codec can't decode byte in position : ordinal not in range(128)

I have done a bit of research on this error and can't really get my head around what's going on. As far as I understand I am basically having problems because I am converting from one type of encoding to another.
def write_table_to_file(table, connection):
db_table = io.StringIO()
cur = connection.cursor()
#pdb.set_trace()
cur.copy_to(db_table, table)
cur.close()
return db_tabl
This is the method that is giving me head aches. The below error is output when I run this method
[u350932#config5290vm0 python3]$ python3 datamain.py
Traceback (most recent call last):
File "datamain.py", line 48, in <module>
sys.exit(main())
File "datamain.py", line 40, in main
t = write_table_to_file("cms_jobdef", con_tctmsv64)
File "datamain.py", line 19, in write_table_to_file
cur.copy_to(db_table, table)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 40: ordinal not in range(128)
The client encoding on the database im retrieving the table from is
tctmsv64=> SHOW CLIENT_ENCODING;
client_encoding
-----------------
sql_ascii
(1 row)
The database encoding is LATIN1
The encoding for the database I am putting them onto is
S104838=# SHOW CLIENT_ENCODING;
client_encoding
-----------------
WIN1252
(1 row)
The database encoding is UTF8
From the threads I have found they recommend to change the encoding
To correct your function, you'll have to know what encoding the byte
string is in, and convert it to unicode using the decode() method,
and compare that result to the unicode string.
http://www.thecodingforums.com/threads/unicodedecodeerror-ascii-codec-cant-decode-byte-0xa0-in-position-10-ordinal-not-in-range-128.336691/
The problem is when I try and use the decode methods I get complaints that its not a file type. I have had a look at the python 3.4 methods for class io.StringIO(initial_value='', newline='\n')¶ method but could not find anything on changing the encoding.
I also found this page which outlined the problem but I couldn't figure out what I needed to do to solve it
https://wiki.python.org/moin/UnicodeDecodeError
Basically I'm quite confused as to what is going on and not sure how to fix it. Any help would be greatly appreciated.
Cheers

Python 3 changed file I/O behaviour around text encodings - massively for the better, IMO. You may find Processing Text Files in Python 3 informative.
It looks like psycopg2 is seeing that you passed a raw file object and is trying to encode the strings it's working with into byte sequences for writing to the file, with the assumption (since you didn't specify anything else) that you want to use the ascii encoding for the file.
I'd use an io.BytesIO object instead of StringIO, and specify the source encoding when you do a copy_from into the new database.
I'll be surprised if you don't have problems due to invalid, mixed, or otherwise borked text from your SQL_ASCII source database, though.

First of thanks Craig for your response. It was very helpful in making me realise that I needed to find a good way of doing this otherwise the data in my new database would be corrupt. Not something we want! After a bit more googling this link was very useful
https://docs.python.org/3/howto/unicode.html
I ended up using the StreamRecorder module and it works very well. Below is a snippet of my working code
def write_table_to_file(table, connection):
db_table = io.BytesIO()
cur = connection.cursor()
cur.copy_to(codecs.StreamRecoder(db_table,codecs.getencoder('utf-8'), codecs.getdecoder('latin-1'),
codecs.getreader('utf-8'), codecs.getwriter('utf-8')), table)
cur.close()
return db_table
Long story short I convert from latin-1 to utf-8 on the fly and it all works and my data looks good. Thanks again for the feedback Craig :)

Special Characters are not converting using Apache FOP1.0

I have an xml document. I am converting it into pdf using Apache FOP 1.0. It is able to convert it into pdf but it is not converting some special charcter like Δ,μ,ρ,α and showing another output.
Expected Output = Δ INTRODUCTION length constant (λ)
Account Output = # INTRODUCTION length constant (#)
This is giving # symbol for unknown characters and when I try on other tools, I don't get these errors.
What is the problem with Apache FOP?

You may wish to try using the character in escaped format in your template e.g.
α
for the Greek alpha. Refer here: http://www.fileformat.info/info/unicode/char/03b1/index.htm

How to read in graphml file into networkx with weird characters?

I am trying to read in a graphml file of my facebook network into NetworkX. However, because some of my friends have unusual characters, such as accents, their names are unable to be read into networkx.
I ran the command:
g = nx.read_graphml("/Users/juliehui/Desktop/MyGraph.graphml")
I then get the error:
TypeError: int() argument must be a string or a number, not 'NoneType'
I looked at the graphml file in Sublime Text, and it seems to have trouble with names, such as Andrés
I then looked at the graphml file in Gephi to see what it looked like. The name, Andrés, in Gephi looks like:
Andr‚Äö√†√∂¬¨¬©s
When I export the data without making any edits into a separate graphml file, and try to read that file in, I get the error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 7-8: ordinal not in range(128)
When I delete the problem names in Gephi, then the file reads fine.
I am not sure if there is some way to edit my original graphml file to fix the names with unusual characters.
I have looked at this page: Graphml parse error
But, I could not figure out if my graphml file is in UTF-8 or needs to be in UTF-8 or needs to be in ASCII?
I have also tried:
data="/Users/juliehui/Desktop/MyGraph.graphml"
udata=data.decode("utf-8")
asciidata=udata.encode("ascii","ignore")
g = nx.read_graphml(asciidata)
But, this gave the error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 8-19: ordinal not in range(128)
How do I resolve this error?

This worked for me in Python 2.7. You have to specify the node type as unicode.
nx.read_graphml('/path/to/my/file.graphml', unicode)

I would suggest to use unidecode to remove all non ASCII character in the file:
from unidecode import unidecode
data_in="/Users/juliehui/Desktop/MyGraph.graphml"
data_ascii ="/Users/juliehui/Desktop/MyGraph_ASCII.graphml"
f_in = open(data_in, 'rb')
f_out = open(data_ascii, 'wb')
for line in f_in:
f_out.write(unidecode(line))
f_in.close()
f_out.close()
Then you can hopefully use:
g = nx.read_graphml(data_ascii)

PyQt and Unicode error

I have a problem related to Unicode in this line:
strToCompare = str(self.modelProxy.data(cellIndex, Qt.DisplayRole).toString()).lower()
The error is:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 5: ordinal not in range(128)
This is because the data is retrieve from a field in a database which may contain unicode characters. Even though I added the unicode() function to convert to Unicode, the error is still there.

I've my solution, I just get the string in the model instead of using the function data(). This way I don't have to convert the QVariant into String

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Scraping french site and getting the UnicodeEncodeError - unicode

Related

Beautiful Soup lxml Character Encoding Issue

UnicodeDecodeError: 'ascii' codec can't decode byte in position : ordinal not in range(128)

Special Characters are not converting using Apache FOP1.0

How to read in graphml file into networkx with weird characters?

PyQt and Unicode error

Categories

Resources