Replace non UTF-8 from String - postgresql

I have a table that has strings with non UTF-8 characters, like �. I need to change them in order they have back all accents, and other latin characters, like: cap� to capó. The field is a VARCHAR.
So far, I have tried:SELECT "Column Name", regexp_replace("Column Name", '[^\w]+','') FROM table
And:
CONVERT("Column Name", 'UTF8', 'LATIN1') but don't work at all.
For instance, the error I get is: "Regexp encountered an invalid UTF-8 character (...)"
I have seen other solutions, but I can't go on them because I cannot change the table because I am not administrator.
Is there any whay to achieve this?

If the database encoding is UTF8, then all your strings will contain only UTF8 characters. They just happen to be different characters than you want.
First, you have to find out what characters are in the strings. In the case you show, � is Unicode codepoint FFFD (in hexadecimal).
So you could use the replace function in PostgreSQL to replace it with ó (Unicode code point F3) like this:
SELECT replace(mycol, E'\uFFFD', E'\u00f3') FROM mytab;
This uses the Unicode character literal syntax of PostgreSQL; don't forget to prefix all strings with escapes in them with E for extended string literal syntax.
There are odds that the character is not really �, because that is the “REPLACEMENT CHARACTER” often used to represent characters that are not representable.
In that case, use psql and run a query like this to display the hexadecimal UTF-8 contents of your fields:
SELECT mycol::bytea FROM mytab WHERE id = 12345;
From the UTF-8 encoding of the character you can deduce what character it really is and use that in your call to replace.
If you have several characters, you will need several calls to replace to translate them all.

Related

How to find UTF-8 codes via LIKE '%\xC2\xA0%'?

I have a column that contains NO-BREAK SPACE (\xC2\xA0) instead of SPACE and I need to find that rows.
Copy-pasting works:
SELECT PRODUCT_NAME
FROM TABLE t
WHERE PRODUCT_NAME LIKE '% %'
but using the code points does not:
SELECT PRODUCT_NAME
FROM TABLE t
WHERE PRODUCT_NAME LIKE '%\xC2\xA0%'
How can I find the rows where a colum contains such symbols via \x code points?
Try
WHERE regexp_Like(PRODUCT_NAME, UNISTR('\00A0'))
Depending on your database character set you may try CHR(106) instead of UNISTR('\00A0')
xC2 xA0 is not the code point, it is the binary representation of Unicode character U+00A0 (No-Break Space) encoded at UTF-8. The Unicode code point is U+00A0 or 160 (decimal)
Exasol does not support \x escape sequences. "\xC2\xA0" is not a code point, but the UTF-8 encoding of the code point U+00A0 (decimal 160). See https://www.compart.com/en/unicode/U+00A0 for more information.
However, you can use UNICODECHR(...):
SELECT PRODUCT_NAME
FROM TABLE t
WHERE PRODUCT_NAME LIKE '%'||unicodechr(160)||'%'
Note that you have to give it the decimal unicode code point. Not the UTF-8 encoding. You can use select to_number('00A0', 'XXXX'); to convert hex to decimal.
If you absolutely need to use \x escape sequences, you can create an UDF. See https://docs.exasol.com/7.1/database_concepts/udf_scripts.htm for more information.

Allowed characters in CSS 'content' property?

I've read that we must use Unicode values inside the content CSS property i.e. \ followed by the special character's hexadecimal number.
But what characters, other than alphanumerics, are actually allowed to be placed as is in the value of content property? (Google has no clue, hence the question.)
The rules for “escaping” characters are in the CSS 2.1 specification, clause 4.1.3 Characters and case. The special rules for quoted strings, as in content property value, are in clause 4.3.7 Strings. Within a quoted string, any character may appear as such, except for the character used to quote the string (" or '), a newline character, or a backslash character \.
The information that you must use \ escapes is thus wrong. You may use them, and may even need to use them if the character encoding of the document containing the style sheet does not let you enter all characters directly. But if the encoding is UTF-8, and is properly declared, then you can write content: '☺ Я Ω ⁴ ®'.
As far as I know, you can insert any Unicode character. (Here's a useful list of Unicode characters and their codes.)
To utilize these codes, you must escape them, like so:
U+27BA Becomes \27BA
Or, alternatively, I think you may just be able to escape the character itself:
content: '\➺';
Source: http://mathiasbynens.be/notes/css-escapes

Is Encoding the same as Escaping?

I am interested in theory on whether Encoding is the same as Escaping? According to Wikipedia
an escape character is a character
which invokes an alternative
interpretation on subsequent
characters in a character sequence.
My current thought is that they are different. Escaping is when you place an escape charater in front of a metacharacter(s) to mark it/them as to behave differently than what they would have normally.
Encoding, on the other hand, is all about transforming data into another form, and upon wanting to read the original content it is decoded back to its original form.
Escaping is a subset of encoding: You only encode certain characters by prefixing a special character instead of transferring (typically all or many) characters to another representation.
Escaping examples:
In an SQL statement: ... WHERE name='O\' Reilly'
In the shell: ls Thirty\ Seconds\ *
Many programming languages: "\"Test\" string (or """Test""")
Encoding examples:
Replacing < with < when outputting user input in HTML
The character encoding, like UTF-8
Using sequences that do not include the desired character, like \u0061 for a
They're different, and I think you're getting the distinction correctly.
Encoding is when you transform between a logical representation of a text ("logical string", e.g. Unicode) into a well-defined sequence of binary digits ("physical string", e.g. ASCII, UTF-8, UTF-16). Escaping is a special character (typically the backslash: '\') which initiates a different interpretation of the character(s) following the escape character; escaping is necessary when you need to encode a larger number of symbols to a smaller number of distinct (and finite) bit sequences.
They are indeed different.
You pretty much got it right.

to extract characters of a particular language

how can i extract only the characters in a particular language from a file containing language characters, alphanumeric character english alphabets
This depends on a few factors:
Is the string encoded with UTF-8?
Do you want all non-English characters, including things like symbols and punctuation marks, or only non-symbol characters from written languages?
Do you want to capture characters that are non-English or non-Latin? What I mean is, would you want characters like é and ç or would you only want characters outside of Romantic and Germanic alphabets?
and finally,
What programming language are you wanting to do this in?
Assuming that you are using UTF-8, you don't want basic punctuation but are okay with other symbols, and that you don't want any standard Latin characters but would be okay with accented characters and the like, you could use a string regular expression function in whatever language you are using that searches for all non-Ascii characters. This would elimnate most of what you probably are trying to weed out.
In php it would be:
$string2 = preg_replace('/[^(\x00-\x7F)]*/','', $string1);
However, this would remove line endings, which you may or may not want.

I do replace literal \xNN with their character in Perl?

I have a Perl script that takes text values from a MySQL table and writes it to a text file. The problem is, when I open the text file for viewing I am getting a lot of hex characters like \x92 and \x93 which stands for single and double quotes, I guess.
I am using DBI->quote function to escape the special chars before writing the values to the text file. I have tried using Encode::Encoder, but with no luck. The character set on both the tables is latin1.
How do I get rid of those hex characters and get the character to show in the text file?
ISO Latin-1 does not define characters in the range 0x80 to 0x9f, so displaying these bytes in hex is expected. Most likely your data is actually encoded in Windows-1252, which is the same as Latin1 except that it defines additional characters (including left/right quotes) in this range.
\x92 and \x93 are empty characters in the latin1 character set (see here or here). If you are certain that you are indeed dealing with latin1, you can simply delete them.
It sounds like you need to change the character sets on the tables, or translate the non-latin-1 characters into latin-1 equivalents. I'd prefer the first solution. Get used to Unicode; you're going to have to learn it at some point. :)