I have a chat server that stores messages in MongoDB...emoticons (emoji specifically) are giving me grief.
Apparently emoticons/emoji are UTF8-mb4 encoded...can MongoDB store data in this encoding? If it can't stored UTF8-mb4 directly is there some kind of workaround?
MongoDB supports UTF8 which supports all characters (variable byte length).
The original implementation "utf8" by MySQL was only for up to 3 Bytes (like many other implementations). The MySQL implementation called 'utf8-mb4' is actually up to 4 Bytes long (like the official RFC recommends it).
So utf8-mb4 in MySQL is the same as UTF8 in mongoDB.
What I saw in my Tests:
Robomongo does not support chinese 4 Byte characters while for example MongoVUE has no problems.
Interesting article about the utf8 max byte size: https://stijndewitt.wordpress.com/2014/08/09/max-bytes-in-a-utf-8-char/
(Answer updated)
Related
I'm trying to choose the best NoSQL database service for my data.
Most NoSQL databases out there only support UTF-8, and there's no way to enforce an encoding, unlike relational dbs. And the problem is that UTF-8 uses one byte only for the first 127 characters, but uses two bytes for the next 128, and these are the characters that comprise 80% of my data (don't ask me why I have more of these squiggles than the actual English alphabet, it's a long-winded answer).
I'll have to perform lots of queries with regular expressions on those strings, that look like "àñÿÝtçh" and are made up of mostly characters 128-255 in ISO-8859-1, ISO-8859-15 or Windows-1252. I know that I will never need to store characters outside that range in my database, so it's safe to work with only 256 characters and miss out on the gazillion characters UTF-8 supports. I am also aware that ISO-8859-1 will create lots of compatibility problems with JSON objects and things like that. However, we will be running lots of regex queries, and those can be quite complex, and the extra cost of doubling the bytes just because I have no choice but to use UTF-8 may have a negative impact on performance.
I understand that NoSQL databases tend to be schema-less, and fields are not normally defined with data types and encodings, but NoSQL will suit our project much better than SQL. Cassandra stores strings in 1-byte US-ASCII for the 0-127 lot, and not 1-byte UTF-8. Is there any NoSQL out there that defaults to ISO-8859-1 or ISO-8859-15 for the 0-255 lot?
Currently I use:
The utf8mb4 database character set.
The utf8mb4_unicode_520_ci database collation.
I understand that utf8mb4 supports up to four bytes per character. I also understand that Unicode is a standard that continues to get updates. In the past I thought utf8 was sufficient until I had some test data get corrupted, lesson learned. However I'm having difficulty understanding the upgrade path for both the character set and collations.
The utf8mb4_unicode_520_ci database collation is based off of Unicode Collation Algorithm version 5.2.0. If you navigate to the parent directory you'll see up to version 14.0 listed at the time of typing this. Now those are the Unicode standards, then there is the supported MariaDB character sets and collations.
Offhand I'm not sure when the need to go from four bytes per character gets superseded to go to eight bytes per character or even 16 so it's not as simple a measure of just updating the database collation. Additionally I'm not seeing anything that seems to be newer than version 5.2.0 on MariaDB's documentation.
So in short my three highly related questions are:
Are the newer collations such as version 14 still fully compatible with four byte characters or have they exhausted all combinations and now require up to eight or 16 bytes per character?
What is the latest database collation that MariaDB supports in regards to Unicode versions?
In regards to the second question, a newer version than 5.2.0 is supported by MariaDB then is utf8mb4 still sufficient for a character set or not?
I am not bound to or care about MySQL compatibility.
You can inspect the collations currently supported by your MariaDB instance:
SELECT * FROM INFORMATION_SCHEMA.COLLATIONS
WHERE CHARACTER_SET_NAME = 'utf8mb4';
As far as I know, MariaDB does not support any UTF-8 collation version more current than utf8_unicode_520ci. If you try to use the '900' version, for example importing metadata from MySQL to MariaDB, you get errors.
There is no such thing as an 8-byte or 16-byte UTF-8 encoding. UTF-8 is an encoding that uses between 1 and 4 bytes per character, no more than that.
MariaDB also supports utf16 and utf32, but neither of these uses more than 4 bytes per character. Utf16 is variable-length, using one or two 16-bit code units per character. Utf32 is fixed-width, always using 32-bits (4 bytes) per character.
Here I want to convert my string to unicode. I am using PostgreSQL 9.3 version. In SQL Server its much more easy:
Example:
sql = N'select * from tabletest'; /* For nvarchar/nchar/ntext */
OR
sql = U'select * from tabletest'; /* For varchar/char/text */
Question: How can I do the above conversion in PostgreSQL?
PostgreSQL databases have a native character type, the "server encoding". It is usually utf-8.
All text is in this encoding. Mixed encoding text is not supported, except if stored as bytea (i.e. as opaque byte sequences).
You can't store "unicode" or "non-unicode" strings, and PostgreSQL has no concept of "varchar" vs "nvarchar". With utf-8, characters that fall in the 7-bit ASCII range (and some others) are stored as a single byte, and wider chars require more storage, so it's just automatic. utf-8 requires more storage than ucs-2 or utf-16 for text that is all "wide" characters, but less for text that's a mixture.
PostgreSQL automatically converts to/from the client's text encoding, using the client_encoding setting. There is no need to convert explicitly.
If your client is "Unicode" (which Microsoft products tend to say when they mean UCS-2 or UTF-16), then most client drivers take care of any utf-8 <--> utf-16 conversion for you.
So you should not need to care, so long as your client does I/O with correct charset options and sets a correct client_encoding that matches the data its actually sends on the wire. (This is automatic with most client drivers like PgJDBC, nPgSQL, or the Unicode psqlODBC driver).
See:
character set support
In MongoDB 2.0.6, when attempting to store documents or query documents that contain string fields, where the value of a string include characters outside the BMP, I get a raft of errors like: "Not proper UTF-16: 55357", or "buffer too small"
What settings, changes, or recommendations are there to permit storage and query of multi-lingual strings in Mongo, particularly ones that include these characters above 0xFFFF?
Thanks.
There are several issues here:
1) Please be aware that MongoDB stores all documents using the BSON format. Also note that the BSON spec referes to a UTF-8 string encoding, not a UTF-16 encoding.
Ref: http://bsonspec.org/#/specification
2) All of the drivers, including the JavaScript driver in the mongo shell, should properly handle strings that are encoded as UTF-8. (If they don't then it's a bug!) Many of the drivers happen to handle UTF-16 properly, as well, although as far as I know, UTF-16 isn't officially supported.
3) When I tested this with the Python driver, MongoDB could successfully load and return a string value that contained a broken UTF-16 code pair. However, I couldn't load a broken code pair using the mongo shell, nor could I store a string containing a broken code pair into a JavaScript variable in the shell.
4) mapReduce() runs correctly on string data using a correct UTF-16 code pair, but it will generate an error when trying to run mapReduce() on string data containing a broken code pair.
It appears that the mapReduce() is failing when MongoDB is trying to convert the BSON to a JavaScript variable for use by the JavaScript engine.
5) I've filed Jira issue SERVER-6747 for this issue. Feel free to follow it and vote it up.
using jdbc (jt400) to insert data into an as400 table.
db table code page is 424. Host Code Page 424
the ebcdic 424 code page does not support many of the characters that may come from the client.
for example the sign → (Ascii 26 Hex 1A)
the result is an incorrect translation.
is there any built-in way in the toolbox to remove any of the unsupported characters?
You could try to create a logical file over your ccsid424 physical file with a different codepage. It is possible on the as/400 to create logical files with different codepages for individual columns, by adding the keyword CCSID(<num>). You can even set it to an unicode charset, e.g. CCSID(1200) for UTF-16. Of course your physical file will still only be able to store chars that are in the 424 codepage, and those will be replaced by some invalid character char, but the translation might be better that way.
There is no way to store chars, that are not in codepage 424 in a column with that codepage directly (the only way I can think of is encoding them somehow with multiple chars, but that is most likely not what you want to do, since it will bring more problems than it "solves").
If you have control over that system, and it is possible to do some bigger changes, you could do it the other way around: create a new unicode version of that physical file with a different name (I'd propose CCSID(1200), that's as close as you get to UTF-16 on as/400 afaik, and UTF-8 is not supported by all parts of the system in my experience. IBM does recommend 1200 for unicode). Than transfer all data from your old file to the new one, delete the old one (before that, backup it!), and than create a logical file over the new physical, with the name of the old physical file. In that logical file change all ccsid-bearing columns from 1200 to 424. That way, existing programs can still work on the data. Of course there will be invalid chars in the logical file now, once you insert data that is not in a subset of ccsid 424; so you will most likely have to take a look at all programs that use the new logical file.