What encoding setting do I need in MySql to support mathematical symbols? - mysqli

I need to support the following symbols: π, ∑, ≥, ≠, ≤, ∞, α, Ω, ←, ◊ in a C# application with a mysql back end.
I have tried setting charset = utf8 (in both the database and connection string), collation = utf8_unicode_cl
and I get "Incorrect string value" errors trying to save.

UPDATE:
I've just installed MySQL Server and HeidiSQL client to my PC, selecting UTF-8 as default server charset.
Also I created a test database and a table as follows:
The database:
CREATE DATABASE `test` /*!40100 CHARACTER SET utf8 COLLATE utf8_general_ci */
The table:
CREATE TABLE `math` (
`id` INT(10) NOT NULL,
`symbol` CHAR(1) NULL DEFAULT NULL,
PRIMARY KEY (`id`)
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB
ROW_FORMAT=DEFAULT
Then I inserted one by one some symbols, copying and pasting them from your post, and from this other page. This is the table after the inserts:
The following is my server configuration:
I hope this information will be actually useful for you
Also, check these links:
Unicode:
Free On-line Unicode Character Map
gives you the possibility to see the
different characters that are
supported (or NOT!) in your browser
and see which code is used if you need
that. A nice feature with the
characters is that you can easily
enlarge the text in your browser to
see them better. ([Ctrl]+[+] in
Mozilla) If you are interested in Math
symbols check list "22 Mathematical
Operators". For Chemists looking for
arrows, list "21" might be
interesting.
Mathematical UTF-8 Special Characters
Unicode 6.0 Character Code Charts
Mathematical Symbols in Unicode
Collation chart for utf8_general_ci, European alphabets (MySQL 6.0.4): Blocks: Basic Latin, Latin1 Supplement, Latin Extended-A, Latin Extended-B, Latin Extended Additional, Latin ligatures, Greek, Greek Extended, Cyrillic, Cyrillic Supplement, Armenian, Georgian
Other MySQL collation charts

Related

How to find UTF-8 codes via LIKE '%\xC2\xA0%'?

I have a column that contains NO-BREAK SPACE (\xC2\xA0) instead of SPACE and I need to find that rows.
Copy-pasting works:
SELECT PRODUCT_NAME
FROM TABLE t
WHERE PRODUCT_NAME LIKE '% %'
but using the code points does not:
SELECT PRODUCT_NAME
FROM TABLE t
WHERE PRODUCT_NAME LIKE '%\xC2\xA0%'
How can I find the rows where a colum contains such symbols via \x code points?
Try
WHERE regexp_Like(PRODUCT_NAME, UNISTR('\00A0'))
Depending on your database character set you may try CHR(106) instead of UNISTR('\00A0')
xC2 xA0 is not the code point, it is the binary representation of Unicode character U+00A0 (No-Break Space) encoded at UTF-8. The Unicode code point is U+00A0 or 160 (decimal)
Exasol does not support \x escape sequences. "\xC2\xA0" is not a code point, but the UTF-8 encoding of the code point U+00A0 (decimal 160). See https://www.compart.com/en/unicode/U+00A0 for more information.
However, you can use UNICODECHR(...):
SELECT PRODUCT_NAME
FROM TABLE t
WHERE PRODUCT_NAME LIKE '%'||unicodechr(160)||'%'
Note that you have to give it the decimal unicode code point. Not the UTF-8 encoding. You can use select to_number('00A0', 'XXXX'); to convert hex to decimal.
If you absolutely need to use \x escape sequences, you can create an UDF. See https://docs.exasol.com/7.1/database_concepts/udf_scripts.htm for more information.

Replace non UTF-8 from String

I have a table that has strings with non UTF-8 characters, like �. I need to change them in order they have back all accents, and other latin characters, like: cap� to capó. The field is a VARCHAR.
So far, I have tried:SELECT "Column Name", regexp_replace("Column Name", '[^\w]+','') FROM table
And:
CONVERT("Column Name", 'UTF8', 'LATIN1') but don't work at all.
For instance, the error I get is: "Regexp encountered an invalid UTF-8 character (...)"
I have seen other solutions, but I can't go on them because I cannot change the table because I am not administrator.
Is there any whay to achieve this?
If the database encoding is UTF8, then all your strings will contain only UTF8 characters. They just happen to be different characters than you want.
First, you have to find out what characters are in the strings. In the case you show, � is Unicode codepoint FFFD (in hexadecimal).
So you could use the replace function in PostgreSQL to replace it with ó (Unicode code point F3) like this:
SELECT replace(mycol, E'\uFFFD', E'\u00f3') FROM mytab;
This uses the Unicode character literal syntax of PostgreSQL; don't forget to prefix all strings with escapes in them with E for extended string literal syntax.
There are odds that the character is not really �, because that is the “REPLACEMENT CHARACTER” often used to represent characters that are not representable.
In that case, use psql and run a query like this to display the hexadecimal UTF-8 contents of your fields:
SELECT mycol::bytea FROM mytab WHERE id = 12345;
From the UTF-8 encoding of the character you can deduce what character it really is and use that in your call to replace.
If you have several characters, you will need several calls to replace to translate them all.

Force Unicode on Data Transfer utility for iSeries AS400 for TSV tab delimited files

I am using Data Transfer utility for IBM i in order to create TSV files from my AS400s and then import them to my SQl Server Data Warehouse.
Following this: SO Question about SSIS encoding script i want to stop using conversion in SSIS task and have the data ready from the source.
I have tried using vatious codepages in TSV creation (1200 etc.) but 1208 only does the trick in half: It creates UTF8 which then i have to convert to unicode as shown in the other question.
What CCSID i have to use to get unicode from the start?
Utility Screenshot:
On IBM i, CCSID support is intended to be seamless. Imagine the situation where the table is in German encoding, your job is in English and you are creating a new table in French - all on a system whose default encoding is Chinese. Use the appropriate CCSID for each of these and the operating system will do the character encoding conversion for you.
Unfortunately, many midrange systems aren't configured properly. Their system default CCSID is 'no CCSID / binary' - a remnant of a time some 20 years ago, before CCSID support. DSPSYSVAL QCCSID will tell you what the default CCSID is for your system. If it's 65535, that's 'binary'. This causes no end of problems, because the operating system can't figure out what the true character encoding is. Because CCSID(65535) was set for many years, almost all the tables on the system have this encoding. All the jobs on the system run under this encoding. When everything on the system is 65535, then the OS doesn't need to do any character conversion, and all seems well.
Then, someone needs multi-byte characters. It might be an Asian language, or as in your case, Unicode. If the system as a whole is 'binary / no conversion' it can be very frustrating because, essentially, the system admins have lied to the operating system with respect to the character encoding that is effect for the database and jobs.
I'm guessing that you are dealing with a CCSID(65535) environment. I think you are going to have to request some changes. At the very least, create a new / work table using an appropriate CCSID like EBCDIC US English (37). Use a system utility like CPYF to populate this table. Now try to download that, using a CCSID of say, 13488. If that does what you need, then perhaps all you need is an intermediate table to pass your data through.
Ultimately, the right solution is a proper CCSID configuration. Have the admins set the QCCSID system value and consider changing the encoding on the existing tables. After that, the system will handle multiple encodings seamlessly, as intended.
The CCSID on IBM i called 13488 is Unicode type UCS-2 (UTF-16 Big Endian). There is not "one unicode" - there are several types of Unicode formats. I looked at your other question. 1208 is also Unicode UTF-8. So what exactly is meant "to get Unicode to begin with" is not clear (you are getting Unicode to begin with in format UTF-8) -- but then I read your other question and the function you mention does not say what kind of "unicode" it expects :
using (StreamWriter writer = new StreamWriter(to, false, Encoding.Unicode, 1000000))
The operating system on IBM i default is to mainly store data in EBCDIC database tables, and there are some rare applications that are built on this system to use Unicode natively. It will translate the data into whatever type of Unicode it supports.
As for SQL Server and Java - I am fairly sure they use UCS-2 type Unicode so if you try using CCSID 13488 on the AS/400 side to transfer, it may let you avoid the extra conversion from UTF-8 Unicode because CCSID 13488 is UCS-2 style Unicode.
https://www-01.ibm.com/software/globalization/ccsid/ccsid_registered.html
There are 2 CCSID's for UTF-8 unicode on system i 1208 and 1209. 1208 is UTF-8 with IBM PAU 1209 is for UTF-8. See link above.

Localized COLLATE on a SQLite string comparison

I want to compare two strings in a SQLite DB without caring for the accents and the case. I mean "Événement" should be equal to "evenèment".
On Debian Wheezy, the SQLite package doesn't provide ICU. So I compiled the official SQLite package (version 3.7.15.2 2013-01-09 11:53:05) with contains an ICU module. Now, I do have a better Unicode support (the originallower() applied only to ASCII chars, now it works on other letters). But I can't manage to apply a collation to a comparison.
SELECT icu_load_collation('fr_FR', 'FRENCH');
SELECT 'événement' COLLATE FRENCH = 'evenement';
-- 0 (should be 1)
SELECT 'Événement' COLLATE FRENCH = 'événement';
-- 0 (should be 1 if collation was case-insensitive)
SELECT lower('Événement') = 'événement';
-- 1 (at least lower() works as expected with Unicode strings)
The SQLite documentation confirms that this is the right way to apply a collation. I think the documentation of this ICU extension is a bit light (few examples, nothing on case sensitivity for collations).
I don't understand why the COLLATE operator has no effect in my example above. Please help.
I took me hours to understand the situation... The way the ICU collations are defined in SQLite has (almost) no incidence on comparisons. An exception being, according to the ICU, Hebrew texts with cantillation marks. This is the default behavior of the ICU library's collation. With SQLite, LIKE becomes case-insensitive when ICU is loaded, but normalization of the accentuated letters can't be attained this way.
I finally understood that what I needed was to set the
strength
of the collation to the
primary level
instead of the default tertiary level.
I found no way to set this through the locale
(e.g several variants of SELECT icu_load_collation('fr_FR,strength=0', 'french') were useless).
So the only solution was to patch the code of SQLite.
It was easy thanks to the ucol_setStrength() function
in the ICU API.
The minimal change is a one-line patch: add the line ucol_setStrength(pUCollator, 0); after pUCollator = ucol_open(zLocale, &status); in the function icuLoadCollation().
For a backwards-compatible change, I added an optional third parameter to icu_load_collation() that sets the strength:
0 for default, 1 for primary, etc. up to 4-quaternary.
See the diff.
At last I have what I wanted:
SELECT icu_load_collation('fr_FR', 'french_ci', 1); -- collation with strength=primary
SELECT 'Événement' COLLATE french_ci = 'evenèment';
-- 1

Converting accented characters in PostgreSQL?

Is there an existing function to replace accented characters with unadorned characters in PostgreSQL? Characters like å and ø should become a and o respectively.
The closest thing I could find is the translate function, given the example in the comments section found here.
Some commonly used accented characters
can be searched using the following
function:
translate(search_terms,
'\303\200\303\201\303\202\303\203\303\204\303\205\303\206\303\207\303\210\303\211\303\212\303\213\303\214\303\215\303\216\303\217\303\221\303\222\303\223\303\224\303\225\303\226\303\230\303\231\303\232\303\233\303\234\303\235\303\237\303\240\303\241\303\242\303\243\303\244\303\245\303\246\303\247\303\250\303\251\303\252\303\253\303\254\303\255\303\256\303\257\303\261\303\262\303\263\303\264\303\265\303\266\303\270\303\271\303\272\303\273\303\274\303\275\303\277','AAAAAAACEEEEIIIINOOOOOOUUUUYSaaaaaaaceeeeiiiinoooooouuuuyy')
Are you doing this just for indexing/sorting? If so, you could use this postgresql extension, which provides proper Unicode collation. The same group has a postgresql extension for doing normalization.