everyone, i run into a trouble when trying to open a HTML file containing Chinese characters, here is the code
#problem with chinese character
file =wget.download("http://nba.stats.qq.com/player/list.htm#teamId=1")
with open(file,encoding ='utf-8') as f:
html = f.read()
print(html)
However in the output I get error as follows
319 # decode input (taking the buffer into account)
320 data = self.buffer + input
--> 321 (result, consumed) = self._buffer_decode(data, self.errors, final)
322 # keep undecoded input until the next call
323 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 535: invalid continuation byte
I searched for a while , and i saw some similar issues, but the solutions seems to use latin-1, which is obvious not the case here, I'm not sure how which encoding to use?
any suggestions? thanks ~
The page you are referring to is not encoded in UTF-8 encoding, but in GBK. You can tell by looking at the header:
<meta charset="GBK">
If you specify encoding='gbk' it'll work.
On another note, I would opt for not using wget unless you have to, and instead going with urllib which comes with the Python Standard Library. It also saved the disk write, and the code is simpler:
import urllib.request
with urllib.request.urlopen("http://nba.stats.qq.com/player/list.htm") as file:
html = file.read()
print(html.decode('gbk'))
Related
I'm running this code in Windows Powershell and it includes this file called languages.txt where I'm trying to convert between bytes to strings:
Here is languages.txt:
Afrikaans
አማርኛ
Аҧсшәа
العربية
Aragonés
Arpetan
Azərbaycanca
Bamanankan
বাংলা
Bân-lâm-gú
Беларуская
Български
Boarisch
Bosanski
Буряад
Català
Чӑвашла
Čeština
Cymraeg
Dansk
Deutsch
Eesti
Ελληνικά
Español
Esperanto
فارسی
Français
Frysk
Gaelg
Gàidhlig
Galego
한국어
Հայերեն
हिन्दी
Hrvatski
Ido
Interlingua
Italiano
עברית
ಕನ್ನಡ
Kapampangan
ქართული
Қазақша
Kreyòl ayisyen
Latgaļu
Latina
Latviešu
Lëtzebuergesch
Lietuvių
Magyar
Македонски
Malti
मराठी
მარგალური
مازِرونی
Bahasa Melayu
Монгол
Nederlands
नेपाल भाषा
日本語
Norsk bokmål
Nouormand
Occitan
Oʻzbekcha/ўзбекча
ਪੰਜਾਬੀ
پنجابی
پښتو
Plattdüütsch
Polski
Português
Română
Romani
Русский
Seeltersk
Shqip
Simple English
Slovenčina
کوردیی ناوەندی
Српски / srpski
Suomi
Svenska
Tagalog
தமிழ்
ภาษาไทย
Taqbaylit
Татарча/tatarça
తెలుగు
Тоҷикӣ
Türkçe
Українська
اردو
Tiếng Việt
Võro
文言
吴语
ייִדיש
中文
Then, here is the code I used:
import sys
script, input_encoding, error = sys.argv
def main(language_file, encoding, errors):
line = language_file.readline()
if line:
print_line(line, encoding, errors)
return main(language_file, encoding, errors)
def print_line(line, encoding, errors):
next_lang = line.strip()
raw_bytes = next_lang.encode(encoding, errors=errors)
cooked_string = raw_bytes.decode(encoding, errors=errors)
print(raw_bytes, "<===>", cooked_string)
languages = open("languages.txt", encoding="utf-8")
main(languages, input_encoding, error)
Here's the output (credit: Learn Python 3 the Hard Way by Zed A. Shaw):
I don't know why it doesn't upload the characters and shows question blocks instead. Can anyone help me?
The first string which fails is አማርኛ. The first character, አ is in unicode 12A0 (see here). In UTF-8, that is b'\xe1\x8a\xa0'. So, that part is obviously fine. The file really is UTF-8.
Printing did not raise an exception, so your output encoding can handle all of the characters. Everything is fine.
The only remaining reason I see for it to fail is that the font used in the console does not support all of the characters.
If it is just for play, you should not worry about it. Consider it working correctly.
On the other hand, I would suggest changing some things in your code:
You are running main recursively for each line. There is absolutely no need for that and it would run into recursion depth limit on a longer file. User a for loop instead.
for line in lines:
print_line(line, encoding, errors)
You are opening the file as UTF-8, so reading from it automatically decodes UTF-8 into Unicode, then you encode it back into row_bytes and then encode again into cooked_string, which is the same as line. It would be a better exercise to read the file as raw binary, split it on newlines and then decode. Then you'd have a clearer picture of what is going on.
with open("languages.txt", 'rb') as f:
raw_file_contents = f.read()
so I am trying to make the values in a csv file Unicode so a program I am using is able to read them. Here is my code, in Python 2.7, which I HAVE to use:
TEST_SENTENCES = []
with open('Book2.csv', 'rb') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
encoded_tweet = row["Tweet"].encode('utf-8')
TEST_SENTENCES.append(encoded_tweet)
I continue to receive the same error message, and have not been able to find anything that works. Here is the error message. I am sure someone out there can make a really easy solution.
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 127: ordinal not in range(128)
As an update, I changed encode to decode, and the program says it is running the predictions, given it has been a while but it would not have started if it was not encoded in Unicode so let us pray.
Using requests to query the DarkSky API says it returns UTF-8 encoded document, but string is defaulting to ASCII with error. If I explicitly encode as UTF-8, there are no errors, but string contains extra characters and raw unicode. What's going on? I've set my py file to use UTF-8 encoding in Sublime.
# Fetch weather data from DarkSky, parse resulting JSON
try:
url = "https://api.darksky.net/forecast/" + API_KEY + "/" + LAT + "," + LONG + "?exclude=[minutely,hourly,alerts,flags]&units=us"
response = requests.get(url)
data = response.json()
print(response.headers['content-type'])
print(response.encoding)
which returns:
application/json; charset=utf-8
d_summary = data['daily']['summary']
print("Daily Summary: ", d_summary.encode('utf-8'))
which returns: Daily Summary: b'No precipitation throughout the week, with temperatures rising to 82\xc2\xb0F on Tuesday.'
What's going on with the extra characters in front and quoted substring with unicode text?
I don't see any problem here. Decoding the JSON doesn't cause an error, and encoding to UTF-8 produces a byte string literal repr b'...' as expected. Top-bit-set bytes are expected to look like \xXX in byte string literals.
string is defaulting to ASCII with error
What do you mean by that? Please show us the actual problem.
My guess is you are trying to print non-ASCII characters to the terminal on Windows and getting UnicodeEncodeError. If so that's because the Windows Console is broken and can't print Unicode properly. PEP 528 works around the problem in Python 3.6.
I use luasocket to GET a web page which contains Chinese characters "开奖结果" (the page itself is encoded in charset="gb2312"), as below:
require "socket"
host = '61.129.89.226'
fileformat = '/fcopen/cp_kjgg_dfw.jsp?lottery_type=ssq&lottery_issue=%s'
function getlottery(num)
c = assert(socket.connect(host, 80))
c:send('GET ' .. string.format(fileformat, num) .. " HTTP/1.0\r\n\r\n")
content = c:receive('*l')
while content do
if content and content:find('开奖结果') then -- failed
print(content)
end
content = c:receive('*l')
end
c:close()
end
--http://61.129.89.226/fcopen/cp_kjgg_dfw.jsp?lottery_type=ssq&lottery_issue=2012138
getlottery('2012138')
Unfortunately, it fails to match the expected characters:
content:find('开奖结果') -- failed
I know Lua is capable of finding unicode characters:
Lua 5.1.4 Copyright (C) 1994-2008 Lua.org, PUC-Rio
> if string.find("This is 开奖结果", "开奖结果") then print("found!") end
found!
Then I guess it might be caused by how luasocket retrieves data from the web. Could anyone shed some lights on this?
Thanks.
If the page is encoded in GB2312, and your script (the file itself) is encoded in utf-8, there's no way the match will work. Because .find() will look for utf-8 codepoints, and it will just slide over the characters you're looking for, because they're not encoded the same way...
开 奖 结 果
GB bfaa bdb1 bde1 b9fb
UTF-16 5f00 5956 7ed3 679c
UTF-8 e5bc80 e5a596 e7bb93 e69e9c
I decided to use Python 3 for making my website, but I encountered a problem with Unicode output.
It seems like plain print(html) #html is astr should be working, but it's not. I get UnicodeEncodeError: 'ascii' codec can't encode characters[...]: ordinal not in range(128). This must be because the webserver doesn't support unicode output.
The next thing I tried was print(html.encode('utf-8')), but I got something like repr output of the byte string: it is placed inside b'...' and all the escape characters are in raw form (e.g. \n and \xd0\x9c)
Please show me the correct way to output a Unicode (str) string as a raw UTF-8 encoded bytes string in Python 3.1
The problem here is that you stdout isn't attached to an actual terminal and will use the ASCII encoding by default. Therefore you need to write to sys.stdout.buffer, which is the "raw" binary output of sys.stdout. This can be done in various ways, the most common one seems to be:
import codecs, sys
writer = codecs.getwriter('utf8')(sys.stdout.buffer)
And the use writer. In a CGI script you may be able to replace sys.stdout with the writer so:
sys.stdout = codecs.getwriter('utf8')(sys.stdout.buffer)
Might actually work so you can print normally. Try that!