Assume that I need to insert the following document:
{
title: 'Péter'
}
(note the é)
It gives me an error when I use the following PHP-code ... :
$db->collection->insert(array("title" => "Péter"));
... because it needs to be utf-8.
So I should use this line of code:
$db->collection->insert(array("title" => utf8_encode("Péter")));
Now, when I request the document, I still have to decode it ... :
$document = $db->collection->findOne(array("_id" => new MongoId("__someID__")));
$title = utf8_decode($document['title']);
Is there some way to automate this process? Can I change the character-encoding of MongoDB (I'm migrating a MySQL-database that's using cp1252 West Europe (latin1)?
I already considered changing the Content-Type-header, problem is that all static strings (hardcoded) aren't utf8...
Thanks in advance!
Tim
JSON and BSON can only encode / decode valid UTF-8 strings, if your data (included input) is not UTF-8 you need to convert it before passing it to any JSON dependent system, like this:
$string = iconv('UTF-8', 'UTF-8//IGNORE', $string); // or
$string = iconv('UTF-8', 'UTF-8//TRANSLIT', $string); // or even
$string = iconv('UTF-8', 'UTF-8//TRANSLIT//IGNORE', $string); // not sure how this behaves
Personally I prefer the first option, see the iconv() manual page. Other alternatives include:
mb_convert_encoding()
utf8_encode(utf8_decode($string))
You should always make sure your strings are UTF-8 encoded, even the user-submitted ones, however since you mentioned that you're migrating from MySQL to MongoDB, have you tried exporting your current database to CSV and using the import scripts that come with Mongo? They should handle this...
EDIT: I mentioned that BSON can only handle UTF-8, but I'm not sure if this is exactly true, I have a vague idea that BSON uses UTF-16 or UTF-32 to encode / decode data, but I can't check now.
As #gates said, all string data in BSON is encoded as UTF-8. MongoDB assumes this.
Another key point which neither answer addresses: PHP is not Unicode aware. As of 5.3, anyway. PHP 6 will supposedly be Unicode-aware. What this means is you have to know what encoding is used by your operating system by default and what encoding PHP is using.
Let's get back to your original question: "Is there some way to automate this process?" ... my suggestion is to make sure you are always using UTF-8 throughout your application. Configuration, input, data storage, presentation, everything. Then the "automated" part is that most of your PHP code will be simpler since it always assumes UTF-8. No conversions necessary. Heck, nobody said automation was cheap. :)
Here's kind of an aside. If you created a little PHP script to test that insert() code, figure out what encoding your file is, then convert to UTF-8 before inserting. For example, if you know the file is ISO-8859-1, try this:
$title = mb_convert_encoding("Péter", "UTF-8", "ISO-8859-1");
$db->collection->insert(array("title" => $title));
See also
http://www.phpwact.org/php/i18n/utf-8
http://www.ibm.com/developerworks/library/os-php-unicode/
http://htmlpurifier.org/docs/enduser-utf8.html
Can I change the character-encoding of MongoDB...
No data is stored in BSON. According to the BSON spec, all string are UTF-8.
Now, when I request the document, I still have to decode it ... :
Is there some way to automate this process?
It sounds like you are trying to output the data to web page. Needing to "decode" text that was already encoded seems incorrect.
Could this output problem be a configuration issue with Apache+PHP? UTF8+PHP is not automatic, a quick online search brought up several tutorials on this topic.
Related
I'm using Windows 7 64-bit, Python 3, MongoDB, and PyMongo. I know that in Python 3, all strings are unicode. I also know that MongoDB stores all strings as unicode. So I don't understand why, when I pull a document from my database where the value of a particular field is "C:\Some Folder\E=mc².xyz", Python treats that string as "C:\Some Folder\E=mc².xyz". It doesn't just print that way; os.path.exists() returns False. Now, as if that wasn't confusing enough, if I save the string to a text file, and then open it with the encoding explicitly set to "utf-8", the string appears correctly, and os.path.exists() returns True. What's going wrong, and how do I fix it?
Edit:
Here's some code I just wrote to demonstrate my problem:
from pymongo import MongoClient
db = MongoClient().test_db
orig_doc = {'string': 'E=mc²'}
_id = db.test_col.insert(orig_doc)
new_doc = db.test_col.find_one(_id)
print(new_doc['string'])
>>> E=mc²
As you can see, it works exactly as it should! Thus I now realize that I must've messed up when I migrated from PostgreSQL. Now I just need to fix the strings. I know that it's possible, but there's got to be a better way than writing the strings to a text file and then reading them back. I could do that, just as I did in my previous testing, but it just doesn't seem like the right way.
You can't store Unicode. It is a concept. MongoDB has to be using an encoding of Unicode, and it looks like UTF-8. Python 3 Unicode strings are stored internally as one of a number of encodings depending on the content of the string. What you have is a string decoded to Unicode with the wrong encoding:
>>> s='"C:\Some Folder\E=mc².xyz"' # The invalid decoding.
>>> print(s)
"C:\Some Folder\E=mc².xyz"
>>> print(s.encode('latin1').decode('utf8')) # Undo the wrong decoding, and apply the right one.
"C:\Some Folder\E=mc².xyz"
There's not enough information to tell you how to read MondoDB correctly, but this should help you along.
I've been having problems with "gremlins" from different encodings getting mixed into form input and data from a database within a Perl program. At first, I wasn't decoding, and smart quotes and similar things would generate multiple gibberish characters; but, blindly decoding everything as UTF-8 caused older Windows-1252 content to be filled with question marks.
So, I've used Encode::Detect::Detector and the decode() function to detect and decode all POST and GET input, along with data from a SQL database (the decoding process probably occurs on 10-20 strings of text each time a page is generated now). This seems to clean things up so UTF-8, ASCII and Windows-1252 content all display properly as UTF-8 output (as I've designated in the HTML headers):
my $encoding_name = Encode::Detect::Detector::detect($value);
eval { $value = decode($encoding_name, $value) };
My question is this: how resource heavy is this process? I haven't noticed a slowdown, so I think I'm happy with how this works, but if there's a more efficient way of doing this, I'd be happy to hear it.
The answer is highly application-dependent, so the acceptability of the 'expense' accrued is your call.
The best way to quantify the overhead is through profiling your code. You may want to give Devel::NYTProf a spin.
Tim Bunce's YAPC::EU presentation provide more details about the module.
I need to get a string from <STDIN>, written in latin and russian mixed encodings, and convert it to some url:
$search_url = "http://searchengine.com/search?text=" . uri_escape($query);
But this proccess goes bad and gives out Mojibake (a mixture of weird letters). What can I do with Perl to solve it?
Before you can get started, there's a few things you need to know.
You'll need to know the encoding of your input. "Latin" and "russian" aren't (character) encodings.
If you're dealing with multiple encodings, you'll need to know what is encoded using which encoding. "It's a mix" isn't good enough.
You'll need to know the encoding the site expects the query to use. This should be the same encoding as the page that contains the search form.
Then, it's just a matter of decoding the input using the correct encoding, and encoding the query using the correct encoding. That's the easy part. Encode provides functions decode and encode to do just that.
Until now, the project I work in used ASCII only in the source code. Due to several upcoming changes in I18N area and also because we need some Unicode strings in our tests, we are thinking about biting the bullet and move the source code to UTF-8, while using the utf8 pragma (use utf8;)
Since the code is in ASCII now, I don't expect to have any troubles with the code itself. However, I'm not quite aware of any side effects we might be getting, while I think it's quite probable that I will get some, considering our environment (perl5.8.8, Apache2, mod_perl, MSSQL Server with FreeTDS driver).
If you have done such migrations in the past: what problems can I expect? How can I manage them?
The utf8 pragma merely tells Perl that your source code is UTF-8 encoded. If you have only used ASCII in your source, you won't have any problems with Perl understanding the source code. You might want to make a branch in your source control just to be safe. :)
If you need to deal with UTF-8 data from files, or write UTF-8 to files, you'll need to set the encodings on your filehandles and encode your data as external bits expect it. See, for instance, With a utf8-encoded Perl script, can it open a filename encoded as GB2312?.
Check out the Perl documentation that tells you about Unicode:
perlunicode
perlunifaq
perlunitut
Also see Juerd's Perl Unicode Advice.
A few years ago I moved our in-house mod_perl platform (~35k LOC) to UTF-8. Here are the things which we had to consider/change:
despite the perl doc advice of 'only where necessary', go for using 'use utf8;' in every source file - it gives you consistency.
convert your database to UTF-8 and ensure your DB config sets the connection charset to UTF-8 (in MySQL, watch out for field length issues with VARCHARs when doing this)
use a recent version of DBI - older versions don't correctly set the utf8 flag on returned scalars
use the Encode module, avoid using perl's built in utf8 functions unless you know exactly what data you're dealing with
when reading UTF-8 files, specify the layer - open($fh,"<:utf8",$filename)
on a RedHat-style OS (even 2008 releases) the included libraries won't like reading XML files stored in utf8 scalars - upgrade perl or just use the :raw layer
in older perls (even 5.8.x versions) some older string functions can be unpredictable - eg. $b=substr(lc($utf8string),0,2048) fails randomly but $a=lc($utf8string);$b=substr($a,0,2048) works!
remember to convert your input - eg. in a web app, incoming form data may need decoding
ensure all dev staff know which way around the terms encode/decode are - a 'utf8 string' in perl is in /de/-coded form, a raw byte string containg utf8 data is /en/-coded
handle your URLs properly - /en/-code a utf8 string into bytes and then do the %xx encoding to produce the ASCII form of the URL, and /de/-code it when reading it from mod_perl (eg. $uri=utf_decode($r->uri()))
one more for web apps, remember the charset in the HTTP header overrides the charset specified with <meta>
I'm sure this one goes without saying - if you do any byte operations (eg. packet data, bitwise operations, even an MIME Content-Length header) make sure you're calculating with bytes and not chars
make sure your developers know how to ensure their text editors are set to UTF-8 even if there's no BOM on a given file
remember to ensure your revision control system (for google's benefit - subversion/svn) will correctly handle the files
where possible, stick to ASCII for filenames and variable names - this avoids portability issues when moving code around or using different dev tools
One more - this is the golden rule - don't just hack til it works, make sure you fully understand what's happening in a given en/decoding situation!
I'm sure you already had most of these sorted out but hopefully all that helps someone out there avoid the many hours debugging which we went through.
I support a website written in Tcl which displays data in Traditional Chinese (big5). We then have a Java servlet, using the translation code from mandarintools.com, to translate a page request into Simplified Chinese. The conversion as specified to the translation code is from UTF-8 to UTF-8S; Java is apparently correctly translating the data to UTF-8 as it comes in.
The Java translation code works but is slow, and since the website is written in Tcl someone on another list suggested I try using that. Unfortunately, Tcl doesn't support UTF-8S and I have been unable to figure out what translation to use in its place. I've tried gb2312, gb2312-raw,gb1988, euc-cn... all result in gibberish. My assumption is that Tcl is also translating to UTF-8 as it comes in, though I have tried converting from big5 first and it doesn't help.
My test code looks like this:
set page_body [ns_httpget http://www.mysite.com]
set translated_page_body [encoding convertto gb2312 $page_body]
ns_write $translated_page_body
I have also tried
set page_body [ns_httpget http://www.mysite.com]
set translated_page_body [encoding convertto gb2312 [encoding convertfrom big5 $page_body]]
ns_write $translated_page_body
But it didn't change anything.
Does anyone out there have enough experience with this to help me figure it out?
FYI for completeness' sake, I've been told by Tcl experts that you can't do the conversion this way, it has to be done via character replacement.
By any chance, are you grabbing your data from Oracle?
If so, see if you can use the CONVERT function to convert to from "utf8" to "al32utf8", which is the true Utf8 standard and which Tcl should work-with seamlessly.
If not, well, I guess I'll wait for you comment(s).