Firebird 2.5 WIN1258 charset doesnot support Vietnamese? - character

I created a table in Firebird 2.5 with the following code:
CREATE TABLE DMSV (MASV CHAR(8) CHARACTER SET ASCII NOT NULL,
TENSV VARCHAR(120) CHARACTER SET UTF8 NOT NULL,
LOP CHAR(10) CHARACTER SET ASCII NOT NULL,
SDT VARCHAR(11)CHARACTER SET ASCII NOT NULL,
EMAIL VARCHAR(100) CHARACTER SET ASCII,FACE VARCHAR(100)
CHARACTER SET UTF8, PRIMARY KEY (MASV));
When i type Vietnamese character into Column "TenSV", the result cannot show the right character, it only shows the "?" character for some special Vietnamese character (ex: "?" for "ể"). I changed the character set to WIN1258 but the problem cannot be solved. What should i do to store the correct character? Thanks a lot for any help. :D

The character (ể) is unicode codepoint U+1EC3, which is not part of WIN1258, and therefor won't be stored (or displayed) using WIN1258. Make sure you use UTF8 for both your connection characterset, database/column characterset and that the application, console or whatever you use for display also supports this character.
See also http://www.scarfboy.com/coding/unicode-tool?s=U%2B1ec3

Related

How to find UTF-8 codes via LIKE '%\xC2\xA0%'?

I have a column that contains NO-BREAK SPACE (\xC2\xA0) instead of SPACE and I need to find that rows.
Copy-pasting works:
SELECT PRODUCT_NAME
FROM TABLE t
WHERE PRODUCT_NAME LIKE '% %'
but using the code points does not:
SELECT PRODUCT_NAME
FROM TABLE t
WHERE PRODUCT_NAME LIKE '%\xC2\xA0%'
How can I find the rows where a colum contains such symbols via \x code points?
Try
WHERE regexp_Like(PRODUCT_NAME, UNISTR('\00A0'))
Depending on your database character set you may try CHR(106) instead of UNISTR('\00A0')
xC2 xA0 is not the code point, it is the binary representation of Unicode character U+00A0 (No-Break Space) encoded at UTF-8. The Unicode code point is U+00A0 or 160 (decimal)
Exasol does not support \x escape sequences. "\xC2\xA0" is not a code point, but the UTF-8 encoding of the code point U+00A0 (decimal 160). See https://www.compart.com/en/unicode/U+00A0 for more information.
However, you can use UNICODECHR(...):
SELECT PRODUCT_NAME
FROM TABLE t
WHERE PRODUCT_NAME LIKE '%'||unicodechr(160)||'%'
Note that you have to give it the decimal unicode code point. Not the UTF-8 encoding. You can use select to_number('00A0', 'XXXX'); to convert hex to decimal.
If you absolutely need to use \x escape sequences, you can create an UDF. See https://docs.exasol.com/7.1/database_concepts/udf_scripts.htm for more information.

Replace non UTF-8 from String

I have a table that has strings with non UTF-8 characters, like �. I need to change them in order they have back all accents, and other latin characters, like: cap� to capó. The field is a VARCHAR.
So far, I have tried:SELECT "Column Name", regexp_replace("Column Name", '[^\w]+','') FROM table
And:
CONVERT("Column Name", 'UTF8', 'LATIN1') but don't work at all.
For instance, the error I get is: "Regexp encountered an invalid UTF-8 character (...)"
I have seen other solutions, but I can't go on them because I cannot change the table because I am not administrator.
Is there any whay to achieve this?
If the database encoding is UTF8, then all your strings will contain only UTF8 characters. They just happen to be different characters than you want.
First, you have to find out what characters are in the strings. In the case you show, � is Unicode codepoint FFFD (in hexadecimal).
So you could use the replace function in PostgreSQL to replace it with ó (Unicode code point F3) like this:
SELECT replace(mycol, E'\uFFFD', E'\u00f3') FROM mytab;
This uses the Unicode character literal syntax of PostgreSQL; don't forget to prefix all strings with escapes in them with E for extended string literal syntax.
There are odds that the character is not really �, because that is the “REPLACEMENT CHARACTER” often used to represent characters that are not representable.
In that case, use psql and run a query like this to display the hexadecimal UTF-8 contents of your fields:
SELECT mycol::bytea FROM mytab WHERE id = 12345;
From the UTF-8 encoding of the character you can deduce what character it really is and use that in your call to replace.
If you have several characters, you will need several calls to replace to translate them all.

Replacing an invisible unicode control character in TSQL

I'm trying to search and replace an invisible unicode control character in a string in TSQL. The control character is 'LSEP' or 0x2028.
I can find the character easily enough using either of these two queries:
SELECT * FROM Email WHERE Html LIKE '%[0x2028]%'
or
SELECT * FROM Email WHERE CHARINDEX(NCHAR(0x2028) COLLATE Latin1_General_BIN2, Html) > 0
However, when I come to try and replace it, the following just doesn't work:
UPDATE Email
SET Html = REPLACE(Html, NCHAR(0x2028) COLLATE Latin1_General_BIN2, '')
WHERE Html LIKE '%[0x2028]%'
Any ideas what I'm doing wrong. I can't use the character itself using N'LSEP', because it just appears as a newline in the script when I try and paste it!
Sample input, as requested:
</span><span>
</span><span>
Try this (it replaces the unicode LSEP with the unicode SPACE char)...
UPDATE Email
SET Html = REPLACE(Html, NCHAR(0x2028), NCHAR(0x0020))
WHERE Html LIKE '%[0x2028]%'

What encoding setting do I need in MySql to support mathematical symbols?

I need to support the following symbols: π, ∑, ≥, ≠, ≤, ∞, α, Ω, ←, ◊ in a C# application with a mysql back end.
I have tried setting charset = utf8 (in both the database and connection string), collation = utf8_unicode_cl
and I get "Incorrect string value" errors trying to save.
UPDATE:
I've just installed MySQL Server and HeidiSQL client to my PC, selecting UTF-8 as default server charset.
Also I created a test database and a table as follows:
The database:
CREATE DATABASE `test` /*!40100 CHARACTER SET utf8 COLLATE utf8_general_ci */
The table:
CREATE TABLE `math` (
`id` INT(10) NOT NULL,
`symbol` CHAR(1) NULL DEFAULT NULL,
PRIMARY KEY (`id`)
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB
ROW_FORMAT=DEFAULT
Then I inserted one by one some symbols, copying and pasting them from your post, and from this other page. This is the table after the inserts:
The following is my server configuration:
I hope this information will be actually useful for you
Also, check these links:
Unicode:
Free On-line Unicode Character Map
gives you the possibility to see the
different characters that are
supported (or NOT!) in your browser
and see which code is used if you need
that. A nice feature with the
characters is that you can easily
enlarge the text in your browser to
see them better. ([Ctrl]+[+] in
Mozilla) If you are interested in Math
symbols check list "22 Mathematical
Operators". For Chemists looking for
arrows, list "21" might be
interesting.
Mathematical UTF-8 Special Characters
Unicode 6.0 Character Code Charts
Mathematical Symbols in Unicode
Collation chart for utf8_general_ci, European alphabets (MySQL 6.0.4): Blocks: Basic Latin, Latin1 Supplement, Latin Extended-A, Latin Extended-B, Latin Extended Additional, Latin ligatures, Greek, Greek Extended, Cyrillic, Cyrillic Supplement, Armenian, Georgian
Other MySQL collation charts

I do replace literal \xNN with their character in Perl?

I have a Perl script that takes text values from a MySQL table and writes it to a text file. The problem is, when I open the text file for viewing I am getting a lot of hex characters like \x92 and \x93 which stands for single and double quotes, I guess.
I am using DBI->quote function to escape the special chars before writing the values to the text file. I have tried using Encode::Encoder, but with no luck. The character set on both the tables is latin1.
How do I get rid of those hex characters and get the character to show in the text file?
ISO Latin-1 does not define characters in the range 0x80 to 0x9f, so displaying these bytes in hex is expected. Most likely your data is actually encoded in Windows-1252, which is the same as Latin1 except that it defines additional characters (including left/right quotes) in this range.
\x92 and \x93 are empty characters in the latin1 character set (see here or here). If you are certain that you are indeed dealing with latin1, you can simply delete them.
It sounds like you need to change the character sets on the tables, or translate the non-latin-1 characters into latin-1 equivalents. I'd prefer the first solution. Get used to Unicode; you're going to have to learn it at some point. :)