Flask - handling unicode text with werkzeug? - unicode

So I am trying to have a browser download a file with a certain name, which is stored in a database. To prevent filename conflicts the file is saved on disk with a GUID, and when it comes time to actually download it, the filename from the database is supplied for the browser. The name is written in Japanese, and when I display it on the page it comes out fine, so it is stored OK in the database. When I try to actually have the browser download it under that name:
return send_from_directory(app.config['FILE_FOLDER'], name_on_disk,
as_attachment=True, attachment_filename = filename)
Flask throws an error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 15-20:
ordinal not in range(128)
The error seems to originate not from my code, but from part of Werkzeug:
/werkzeug/http.py", line 150, in quote_header_value
value = str(value)
Why is this happening? According to their docs, Flask is "100% Unicode"
I actually had this problem before I rewrote my code, and fixed it by modifying numerous things actually in Werkzeug, but I really do not want to have to do this for the deployed app because it is a pain and bad practice.
Python 2.7.6 (default, Nov 26 2013, 12:52:49)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
filename = "[얼티메이트] [131225] TVアニメ「キルラキル」オリジナルサウンドトラック (FLAC).zip"
print repr(filename)
'[\xec\x96\xbc\xed\x8b\xb0\xeb\xa9\x94\xec\x9d\xb4\xed\x8a\xb8] [131225] TV\xe3\x82\xa2\xe3\x83\x8b\xe3\x83\xa1\xe3\x80\x8c\xe3\x82\xad\xe3\x83\xab\xe3\x83\xa9\xe3\x82\xad\xe3\x83\xab\xe3\x80\x8d\xe3\x82\xaa\xe3\x83\xaa\xe3\x82\xb8\xe3\x83\x8a\xe3\x83\xab\xe3\x82\xb5\xe3\x82\xa6\xe3\x83\xb3\xe3\x83\x89\xe3\x83\x88\xe3\x83\xa9\xe3\x83\x83\xe3\x82\xaf (FLAC).zip'
>>>

You should explictly pass unicode strings (type unicode) when dealing with non-ASCII data. Generally in Flask, bytestrings are assumed to have an ascii encoding.

I had a similar problem. I originally had this to send the file as attachment:
return send_file(dl_fd,
mimetype='application/pdf',
as_attachment=True,
attachment_filename=filename)
where dl_fd is a file descriptor for my file.
The unicode filename didn't work because the HTTP header doesn't support it. Instead, based on information from this Flask issue and these test cases for RFC 2231, I rewrote the above to encode the filename:
response = make_response(send_file(dl_fd,
mimetype='application/pdf'
))
response.headers["Content-Disposition"] = \
"attachment; " \
"filename*=UTF-8''{quoted_filename}".format(
quoted_filename=urllib.quote(filename.encode('utf8'))
)
return response
Based on the test cases, the above doesn't work with IE8 but works with the other browsers listed. (I personally tested Firefox, Safari and Chrome on Mac)

You should use something like:
#route('/attachment/<int:attachment_id>/<filename>', methods=['GET'])
def attachment(attachment_id, filename):
attachment_meta = AttachmentMeta(attachment_id, filename)
if not attachment_meta:
flask.abort(404)
return flask.send_from_directory(
directory=flask.current_app.config['UPLOAD_FOLDER'],
filename=attachment_meta.filepath,
)
This way url_for('attachment',1,u'Москва 北京 תֵּל־אָבִיב.pdf') would generate:
/attachment/1/%D0%9C%D0%BE%D1%81%D0%BA%D0%B2%D0%B0%20%E5%8C%97%E4%BA%AC%20%D7%AA%D6%B5%D6%BC%D7%9C%D6%BE%D7%90%D6%B8%D7%91%D6%B4%D7%99%D7%91.pdf
Browsers would display or save this file with correct unicode name. Don't use as_attachment=True, as this would not work.

Related

Special/accented UTF-8 characters not recognised by PhantomJS

I am currently having an issue with PhantomJS (version 2.1.1/Windows 7) not recognising UTF-8 characters. Prior to asking this question, I have found the following two articles useful to configuring the command prompt:
PhantomJS arguments.js example UTF-8 not working
Unicode characters in Windows command line - how?
As suggested by the second article, I used the command
chcp 65001
to change the code page to UTF-8. I then also set the command prompt's default font to Lucida console.
To test this had worked, I created the following UTF-8 text file
---------------------------------------------------------
San José
Cañon City
Przecław Lanckoroński
François Gérard Hollande
El Niño
vis-à-vis
---------------------------------------------------------
and then ran the following command to demonstrate whether characters were being recognised and correctly displayed by the command prompt:
type utf8Test.txt
After this worked, I turned my attention to PhantomJS. Following the instructions here i created the below settings json file to ensure that UTF-8 is the input and output character encoding (though this appears to be the default for according to the official documentation).
{
"outputEncoding: "utf8",
"scriptEncoding": "utf8"
}
I then ran the following JavaScript through PhantomJS using the aforementioned json settings file in the same command prompt window:
console.log("---------------------------------------------------------");
console.log("San José");
console.log("Cañon City");
console.log("Przecław Lanckoroński");
console.log("François Gérard Hollande");
console.log("El Niño");
console.log("vis-à-vis");
console.log("---------------------------------------------------------");
page = require('webpage').create();
// Display the initial requested URL
page.onResourceRequested = function(requestData, request) {
if(requestData.id === 1){
console.log(requestData.url);
}
};
// Display any initial requested URL response error
page.onResourceError = function(resourceError) {
if(resourceError.id === 1){
console.log(resourceError.status + " : " + resourceError.statusText);
}
};
page.open("https://en.wikipedia.org/wiki/San_José", function(status) {
console.log("---------------------------------------------------------");
phantom.exit();
});
The output from running this script is shown below:
From this I can see that PhantomJS is not able to understand UTF-8 special characters, and furthermore it passes the "unknown" character to websites when provided with a special or accented character as below:
URL passed to PhantomJS:
https://en.wikipedia.org/wiki/San_José
URL passed to remote host:
https://en.wikipedia.org/wiki/San_Jos%EF%BF%BD
----------------------------------------------
%EF%BF%BD
�
instead of:
%C3%A9
é
This causes websites to respond with '400 : Bad Request' errors, and in the case of Wikipedia specifically, requesting the URL https://en.wikipedia.org/wiki/San_Jos%EF%BF%BD results in an error message of:
Bad title - The requested page title contains an invalid UTF-8 sequence.
So, with all this being said, does anyone know how to remedy this? There are many websites these days that use UTF-8 special/accented characters in their page urls, and it would be great if PhantomJS could be used to access them.
I really appreciate any help or suggestions you can provide me with.
var url = 'https://en.wikipedia.org/wiki/San_José';
page.open(encodeURI(url), function(status) {
console.log("---------------------------------------------------------");
console.log(page.evaluate(function(){ return document.title }));
phantom.exit();
});
Yes, it's garbling those symbols on Windows (on Linux it works beautifully) but at least you will be able to open pages and process them.

Polish name (Wężarów) returned from json service as W\u0119\u017car\u00f3w, renders as Wężarów. Can't figure out encoding/charset.

I'm using DB-IP.com to get city names from IP addresses. Many of these are international cities, with special characters in the names.
As an example, one of these cities is Wężarów in Poland. Checking the JSON return in the console or opening the request URL directly, it's being returned from DB-IP as "W\u0119\u017car\u00f3w" with a Content-Type of text/javascript;charset=UTF-8. This is rendered in the browser as Wężarów - it is also saved in my mysql database as Wężarów (which I've tried with both utf8 and latin1 encoding).
I'm ok with saving it in the DB as another format, as long as I can convert it back to Wężarów for display in browser. I've tried encoding and decoding to/from several formats, even just to display directly on the screen (ignoring the DB entirely). I'm completely confused on what I need to do here to get it in readable format.
I'm working with PERL, however if I can figure out what I need to do with the encoding/decoding/charset (as I'm currently clueless), I'm sure I can figure it out from there.
It looks like the UTF-8 encoded string was interpreted by the browser as if it were Windows-1252. Here's how I deduced it:
% python3
>>> s = "W\u0119\u017car\u00f3w"
>>> b = bytes(s, encoding='utf-8')
>>> b
b'W\xc4\x99\xc5\xbcar\xc3\xb3w'
>>> str(b, encoding='utf-8')
'Wężarów'
>>> str(b, encoding='latin-1')
'WÄ\x99żarów'
>>> str(b, encoding='windows-1252')
'Wężarów'
If you're not good with Python, what I'm doing here is encoding the string "W\u0119\u017car\u00f3w" into UTF-8, yielding the byte sequence 'W\xc4\x99\xc5\xbcar\xc3\xb3w'. Decoding that with UTF-8 yielded 'Wężarów', confirming that this is the correct UTF-8 encoding of the string you want. So I took a guess that the browser is using the wrong encoding to render it, and decoded it using Latin-1. That gave me something very close, so I looked up Latin-1 and noticed that it's named as the basis for Windows-1252. Decoding again as Windows-1252 gives the result you saw.
What's gone wrong here is that the browser can't tell what encoding to use to render the page, and it's guessing wrong. You need to fix this by telling it explicitly to use UTF-8. Here's a page by the W3C that describes how to do that. Essentially what you need to do is add an HTML <meta> element to the document head. If you also set an HTTP header with the encoding name in it, make sure they are consistent.
(In Firefox, while you're debugging, you can go to View -> Character Encoding to set the encoding on a page-by-page basis. I assume other browsers have the same feature.)

Fixing file encoding

Today I ordered a translation for 7 different languages, and 4 of them appear to be great, but when I opened the other 3, namely Greek, Russian, and Korean, the text that was there wasn't related to any language at all. It looked like a bunch of error characters, like the kind you get when you have the wrong encoding on a file.
For instance, here is part of the output of the Korean translation:
½Ì±ÛÇ÷¹À̾î
¸ÖƼÇ÷¹À̾î
¿É¼Ç
I may not even speak a hint of Korean, but I can tell you with all certainty that is not Korean.
I assume this is a file encoding issue, and when I open the file in Notepad, the encoding is listed as ANSI, which is clearly a problem; the same can be said for the other two languages.
Does anyone have any ideas on how to fix the encoding of these 3 files; I requested the translators reupload in UTF-8, but in the meantime, I thought I might try to fix it myself.
If anyone is interested in seeing the actual files, you can get them from my Dropbox.
If you look at the byte stream as pairs of bytes, they look vaguely Korean but I cannot tell of they are what you would expect or not.
bash$ python3.4
Python 3.4.3 (v3.4.3:b4cbecbc0781, May 30 2015, 15:45:01)
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> buf = '½Ì±ÛÇ÷¹À̾î'
>>> [hex(ord(b)) for b in buf]
>>> ['0xbd', '0xcc', '0xb1', '0xdb', '0xc7', '0xc3', '0xb7', '0xb9', '0xc0', '0xcc', '0xbe', '0xee']
>>> u'\uBDCC\uB1DB\uC7C3\uB7B9\uC0CC\uBEEE'
'뷌뇛쟃랹샌뻮'
Your best bet is to wait for the translator to upload UTF-8 versions or have them tell you the encoding of the file. I wouldn't make the assumption that they bytes are simply 16 bit characters.
Update
I passed this through the chardet module and it detected the character set as EUC-KR.
>>> import chardet
>>> chardet.detect(b'\xBD\xCC\xB1\xDB\xC7\xC3\xB7\xB9\xC0\xCC\xBE\xEE')
{'confidence': 0.833333333333334, 'encoding': 'EUC-KR'}
>>> b'\xBD\xCC\xB1\xDB\xC7\xC3\xB7\xB9\xC0\xCC\xBE\xEE'.decode('EUC-KR')
'싱글플레이어'
According to Google translate, the first line is "Single Player". Try opening it with Notepad and using EUC-KR as the encoding.

Error with France character [duplicate]

Assume that I need to insert the following document:
{
title: 'Péter'
}
(note the é)
It gives me an error when I use the following PHP-code ... :
$db->collection->insert(array("title" => "Péter"));
... because it needs to be utf-8.
So I should use this line of code:
$db->collection->insert(array("title" => utf8_encode("Péter")));
Now, when I request the document, I still have to decode it ... :
$document = $db->collection->findOne(array("_id" => new MongoId("__someID__")));
$title = utf8_decode($document['title']);
Is there some way to automate this process? Can I change the character-encoding of MongoDB (I'm migrating a MySQL-database that's using cp1252 West Europe (latin1)?
I already considered changing the Content-Type-header, problem is that all static strings (hardcoded) aren't utf8...
Thanks in advance!
Tim
JSON and BSON can only encode / decode valid UTF-8 strings, if your data (included input) is not UTF-8 you need to convert it before passing it to any JSON dependent system, like this:
$string = iconv('UTF-8', 'UTF-8//IGNORE', $string); // or
$string = iconv('UTF-8', 'UTF-8//TRANSLIT', $string); // or even
$string = iconv('UTF-8', 'UTF-8//TRANSLIT//IGNORE', $string); // not sure how this behaves
Personally I prefer the first option, see the iconv() manual page. Other alternatives include:
mb_convert_encoding()
utf8_encode(utf8_decode($string))
You should always make sure your strings are UTF-8 encoded, even the user-submitted ones, however since you mentioned that you're migrating from MySQL to MongoDB, have you tried exporting your current database to CSV and using the import scripts that come with Mongo? They should handle this...
EDIT: I mentioned that BSON can only handle UTF-8, but I'm not sure if this is exactly true, I have a vague idea that BSON uses UTF-16 or UTF-32 to encode / decode data, but I can't check now.
As #gates said, all string data in BSON is encoded as UTF-8. MongoDB assumes this.
Another key point which neither answer addresses: PHP is not Unicode aware. As of 5.3, anyway. PHP 6 will supposedly be Unicode-aware. What this means is you have to know what encoding is used by your operating system by default and what encoding PHP is using.
Let's get back to your original question: "Is there some way to automate this process?" ... my suggestion is to make sure you are always using UTF-8 throughout your application. Configuration, input, data storage, presentation, everything. Then the "automated" part is that most of your PHP code will be simpler since it always assumes UTF-8. No conversions necessary. Heck, nobody said automation was cheap. :)
Here's kind of an aside. If you created a little PHP script to test that insert() code, figure out what encoding your file is, then convert to UTF-8 before inserting. For example, if you know the file is ISO-8859-1, try this:
$title = mb_convert_encoding("Péter", "UTF-8", "ISO-8859-1");
$db->collection->insert(array("title" => $title));
See also
http://www.phpwact.org/php/i18n/utf-8
http://www.ibm.com/developerworks/library/os-php-unicode/
http://htmlpurifier.org/docs/enduser-utf8.html
Can I change the character-encoding of MongoDB...
No data is stored in BSON. According to the BSON spec, all string are UTF-8.
Now, when I request the document, I still have to decode it ... :
Is there some way to automate this process?
It sounds like you are trying to output the data to web page. Needing to "decode" text that was already encoded seems incorrect.
Could this output problem be a configuration issue with Apache+PHP? UTF8+PHP is not automatic, a quick online search brought up several tutorials on this topic.

Getting a raw string from a unicode string in python

I have a Unicode string I'm retrieving from a web service in python.
I need to access a URL I've parsed from this string, that includes various diacritics.
However, if I pass the unicode string to urlllib2, it produces a unicode encoding error. The exact same string, as a "raw" string r"some string" works properly.
How can I get the raw binary representation of a unicode string in python, without converting it to the system locale?
I've been through the python docs, and every thing seems to come back to the codecs module. However, the documentation for the codecs module is sparse at best, and the whole thing seems to be extremely file oriented.
I'm on windows, if it's important.
You need to encode the URL from unicode to a bytestring. u'' and r'' produce two different kinds of objects; a unicode string and a bytestring.
You can encode a unicode string to bytecode with the .encode() method, but you need to know what encoding to use. Usually, for URLs, UTF-8 is great, but you do need to escape the bytes to fit the URL scheme as well:
import urlparse, urllib
parts = list(urlparse.urlsplit(url))
parts[2] = urllib.quote(parts[2].encode('utf8'))
url = urlparse.urlunsplit(parts)
The above example is based on an educated guess that the problem you are facing is due to non-ASCII characters in the path part of the URL, but without further details from you it has to remain a guess.
For domain names, you need to apply the IDNA RFC3490 encoding:
parts = list(urlparse.urlsplit(url))
parts[1] = parts[1].encode('idna')
parts = [p.encode('utf8') if isinstance(p, unicode) else p for p in parts]
url = urlparse.urlunsplit(parts)
See the Python Unicode HOWTO for more information. I also strongly recommend you read the Joel on Software Unicode article as a good primer on the subject of encodings.