How to find UTF-8 codes via LIKE '%\xC2\xA0%'? - unicode

I have a column that contains NO-BREAK SPACE (\xC2\xA0) instead of SPACE and I need to find that rows.
Copy-pasting works:
SELECT PRODUCT_NAME
FROM TABLE t
WHERE PRODUCT_NAME LIKE '% %'
but using the code points does not:
SELECT PRODUCT_NAME
FROM TABLE t
WHERE PRODUCT_NAME LIKE '%\xC2\xA0%'
How can I find the rows where a colum contains such symbols via \x code points?

Try
WHERE regexp_Like(PRODUCT_NAME, UNISTR('\00A0'))
Depending on your database character set you may try CHR(106) instead of UNISTR('\00A0')
xC2 xA0 is not the code point, it is the binary representation of Unicode character U+00A0 (No-Break Space) encoded at UTF-8. The Unicode code point is U+00A0 or 160 (decimal)

Exasol does not support \x escape sequences. "\xC2\xA0" is not a code point, but the UTF-8 encoding of the code point U+00A0 (decimal 160). See https://www.compart.com/en/unicode/U+00A0 for more information.
However, you can use UNICODECHR(...):
SELECT PRODUCT_NAME
FROM TABLE t
WHERE PRODUCT_NAME LIKE '%'||unicodechr(160)||'%'
Note that you have to give it the decimal unicode code point. Not the UTF-8 encoding. You can use select to_number('00A0', 'XXXX'); to convert hex to decimal.
If you absolutely need to use \x escape sequences, you can create an UDF. See https://docs.exasol.com/7.1/database_concepts/udf_scripts.htm for more information.

Related

Difference between unicode 0001 and 2401?

I am trying to use the SOH character as a delimiter for a CSV file that my code generates. However, it looks like there are two unicode characters for SOH?
https://www.compart.com/en/unicode/U+2401
https://www.compart.com/en/unicode/U+0001
I am not sure what is the difference between the two is? or which one should I use?
U+0001 is the control character. U+2401 is a symbolic picture of the character.
Example: ␁ (May not display in all browsers, but is a single pictograph of SOH)

Replace non UTF-8 from String

I have a table that has strings with non UTF-8 characters, like �. I need to change them in order they have back all accents, and other latin characters, like: cap� to capó. The field is a VARCHAR.
So far, I have tried:SELECT "Column Name", regexp_replace("Column Name", '[^\w]+','') FROM table
And:
CONVERT("Column Name", 'UTF8', 'LATIN1') but don't work at all.
For instance, the error I get is: "Regexp encountered an invalid UTF-8 character (...)"
I have seen other solutions, but I can't go on them because I cannot change the table because I am not administrator.
Is there any whay to achieve this?
If the database encoding is UTF8, then all your strings will contain only UTF8 characters. They just happen to be different characters than you want.
First, you have to find out what characters are in the strings. In the case you show, � is Unicode codepoint FFFD (in hexadecimal).
So you could use the replace function in PostgreSQL to replace it with ó (Unicode code point F3) like this:
SELECT replace(mycol, E'\uFFFD', E'\u00f3') FROM mytab;
This uses the Unicode character literal syntax of PostgreSQL; don't forget to prefix all strings with escapes in them with E for extended string literal syntax.
There are odds that the character is not really �, because that is the “REPLACEMENT CHARACTER” often used to represent characters that are not representable.
In that case, use psql and run a query like this to display the hexadecimal UTF-8 contents of your fields:
SELECT mycol::bytea FROM mytab WHERE id = 12345;
From the UTF-8 encoding of the character you can deduce what character it really is and use that in your call to replace.
If you have several characters, you will need several calls to replace to translate them all.

What's the ASCII character code for '—'?

I am working on decoding text. I am trying to find the character code for the — character, not to be mistaken for -, in ASCII. I have tried unsuccessfully. Does anybody know how to convert it?
Quotation from wiki (Em dash)
When an actual em dash is unavailable—as in the ASCII character set—a double ("--") or triple hyphen-minus ("---") is used. In Unicode, the em dash is U+2014 (decimal 8212).
Em dash character is not a part of ASCII character set.
— is known as an Em Dash. It's character code is \u2014. It is not an ASCII character, so you cannot decode it with the ASCII character set because it is not in the ASCII character table. You would probably want to use UTF8 instead.
Windows
For Windows on a keyboard with a Numeric keypad:
Use Alt+0150 (en dash), Alt+0151 (em dash), or Alt+8722 (minus sign) using the numeric keypad.
This character does not exist in ASCII, but only in Unicode, usually encoded by UTF-8.
In UTF-8, characters are encoded by 2- or 3-byte sequences (or occasionally longer), where none of the two or three bytes is a valid ASCII code, where all of them are outside the ASCII range of 0 through 127.
One suspects that the foregoing only partly answers your question, but if so then this is probably because your question is, inadvertently, only partly asked. For further details, you can extend your question with more specifics.
The character — is not part of the ASCII set.
But if you are looking to convert it to some other format (like U+hex), you can use this online tool. Put your character into the first green box and click "Convert" (above the box)
further below you'll find a number of different codes, including U+hex:
U+2014
Feel free to edit this answer if the link breaks or leave a comment so I can find a replacement.
Alt + 0151 seems to do the trick—perhaps it doesn't work on all keyboards.
alt-196 - while holding down the 'Alt' key, type 196 on the numeric keypad, then release the 'Alt' key

Print Unicode Characters in 8086

As you know, the print function in 8086, puts character in 8bits ( db ) and shows it in screen. Now, i want to print the Unicode character in 8086emu environment not ASCII. So, my challenge is how to use Unicode character in my program ? Does 8086 support Unicode characters?
Thanks in advance :)
If you mean printing in text mode, via interrupt 10h: you can't, as you only have a character map with just 256 characters available. You can redefine how these characters look like (load your custom font), but that still gives you only 256 characters. So you would need to identify the ones you need and then first somehow "render" the ones you need into the character table and for printing you would need to map the Unicode glyph to you character table indexes.
See also my answer to a similar question for more details.

I do replace literal \xNN with their character in Perl?

I have a Perl script that takes text values from a MySQL table and writes it to a text file. The problem is, when I open the text file for viewing I am getting a lot of hex characters like \x92 and \x93 which stands for single and double quotes, I guess.
I am using DBI->quote function to escape the special chars before writing the values to the text file. I have tried using Encode::Encoder, but with no luck. The character set on both the tables is latin1.
How do I get rid of those hex characters and get the character to show in the text file?
ISO Latin-1 does not define characters in the range 0x80 to 0x9f, so displaying these bytes in hex is expected. Most likely your data is actually encoded in Windows-1252, which is the same as Latin1 except that it defines additional characters (including left/right quotes) in this range.
\x92 and \x93 are empty characters in the latin1 character set (see here or here). If you are certain that you are indeed dealing with latin1, you can simply delete them.
It sounds like you need to change the character sets on the tables, or translate the non-latin-1 characters into latin-1 equivalents. I'd prefer the first solution. Get used to Unicode; you're going to have to learn it at some point. :)