Replace characters with multi-character strings - postgresql

I am trying to replace German and Dutch umlauts such as ä, ü, or ß. They should be written like ae instead of ä. So I can't simply translate one char with another.
Is there a more elegant way to do that? Actually it looks like that (not completed yet):
SELECT addr, REPLACE (REPLACE(addr, 'ü','ue'),'ß','ss') FROM search;
On my way trying different commands I got another problem:
When I searched for Ü I got this:
ERROR: invalid byte sequence for encoding "UTF8": 0xdc27
Tried it with U&'\0220', it didn't replace anything. Only by using ü (for lowercase ü) it was replaced correctly. Has to do something with unicode, but how to solve this issue?
Kind regards from Germany. :)

Your server encoding seems to be UTF8.
I suspect your client_encoding does not match, which might give you a wrong impression of what you are dealing with. Check with:
SHOW client_encoding; -- in your actual session
And read this related answers:
Can not insert German characters in Postgres
Replace unicode characters in PostgreSQL
The rest of the tool chain has to be in sync, too. When using puTTY, for instance, one has to make sure, the terminal agrees with the rest: Change settings... Window -> Translation -> Remote character set = UTF-8.
As for your first question, you already have the best solution. A couple of umlauts are best replaced with a string of replace() statements.
As you seem to know already as well, single character replacements are more efficient with (a single) translate() statement.
Related:
Replace unicode characters in PostgreSQL
Regex remove all occurrences of multiple characters in a string

Beside other reasons I decided to write the replacement in python. Like Erwin wrote before, it seems there is no better solution as combining replace- commands.
In general pretty simple, even no encoding had to benn used. My "final" solution now looks like this:
ger_UE="Ü"
ger_AE="Ä"
ger_OE="Ö"
ger_SS="ß"
dk_AA="Å"
dk_OE="Ø"
dk_AE="Æ"
cur.execute("""Select addr, REPLACE (REPLACE (REPLACE( REPLACE (REPLACE (REPLACE (REPLACE(addr, '%s','UE'),'%s','OE'),'%s','AE'),'%s','SS'),'%s','AA'),'%s','OE'),'%s','AE')
from search WHERE x = '1';"""%(ger_UE,ger_OE,ger_AE,ger_SS,dk_AA,dk_OE,dk_AE))
I am now looking forward to the speed when it hits the large table. If anyone would like to make some annotations, they are very welcome.

Related

Erlang and binary with Cyrillic

I need to be able to use binaries with Cyrillic characters in them. I tried just writing <<"абвгд">> but I got a badarg error.
How can I work with Cyrillic (or unicode) strings in Erlang?
If you want to input the above expression in erlang shell, please read unicode module user manual.
Function character_to_binary, and character_to_list are both reversable function. The following are an example:
(emacs#yus-iMac.local)37> io:getopts().
[{expand_fun,#Fun<group.0.33302583>},
{echo,true},
{binary,false},
{encoding,unicode}]
(emacs#yus-iMac.local)40> A = unicode:characters_to_binary("上海").
<<228,184,138,230,181,183>>
(emacs#yus-iMac.local)41> unicode:characters_to_list(A).
[19978,28023]
(emacs#yus-iMac.local)45> io:format("~s~n",[ unicode:characters_to_list(A,utf8)]).
** exception error: bad argument
in function io:format/3
called as io:format(<0.30.0>,"~s~n",[[19978,28023]])
(emacs#yus-iMac.local)46> io:format("~ts~n",[ unicode:characters_to_list(A,utf8)]).
上海
ok
If you want to use unicode:characters_to_binary("上海"). directly in the source code, it is a little more complex. You can try it firstly to find difference.
The Erlang compiler will interpret the code as ISO-8859-1 encoded text, which limits you to Latin characters. Although you may be able to bang in some ISO characters that may have the same byte representation as you want in Unicode, this is not a very good idea.
You want to make sure your editor reads and writes ISO-8859-1, and you want to avoid using literals as much as possible. Source these strings from files.

sphinx dash in author names causing problems when searching

I've read all the posts about dashes and tried pretty much everything mentioned in them, yet cannot figure out a strange problem I'm having.
For example, I have an author name like this:
Arturo Pérez-Reverte
A search for 'pérez-reverte' will not turn up anything, nor will 'pérez-reverte' so escaping the dash is not the issue.
But a search for 'spider-man' will return hits, proving that the dash seems to be working.
However, a search for 'perez reverte' also finds a hit because it searches each word separately and finds the 'reverte' in 'perez-reverte' (but doesn't seem to find the 'perez').
A search for either 'pérez' or 'perez' finds the same number of documents, suggesting that the accent is not an issue (I do have a charset_table which accounts for accented characters).
So I'm very confused as to what's happening here. It if it isn't the accent and it isn't the dash, what could it be?
I don't have any ignore_chars set, I'm using UTF-8 and have a charset_table to treat accented characters as regular characters.
The only difference between these two terms is that one of them is a title (spider-man) and the other an author, but they are both part of the same Sphinx index declaration, so I don't see that as an issue in any way.
Any help would be greatly appreciated.
After much fighting with it, I found out that even though my database is all UTF-8 with the proper collation I needed to add this in sphinx.conf for everything to work properly:
sql_query_pre = SET NAMES utf8
sql_query_pre = SET CHARACTER SET utf8
After doing that, and having the proper charset_table, everything seems to be working fine.
Hope this helps someone else.

Simplified Chinese Unicode table

Where can I find a Unicode table showing only the simplified Chinese characters?
I have searched everywhere but cannot find anything.
UPDATE :
I have found that there is another encoding called GB 2312 -
http://en.wikipedia.org/wiki/GB_2312
- which contains only simplified characters.
Surely I can use this to get what I need?
I have also found this file which maps GB2312 to Unicode -
http://cpansearch.perl.org/src/GUS/Unicode-UTF8simple-1.06/gb2312.txt
- but I'm not sure if it's accurate or not.
If that table isn't correct maybe someone could point me to one that is, or maybe just a table of the GB2312 characters and some way to convert them?
UPDATE 2 :
This site also provides a GB/Unicode table and even a Java program to generate a file
with all the GB characters as well as the Unicode equivalents :
http://www.herongyang.com/gb2312/
The Unihan database contains this information in the file Unihan_Variants.txt. For example, a pair of traditional/simplified characters are:
U+673A kTraditionalVariant U+6A5F
U+6A5F kSimplifiedVariant U+673A
In the above case, U+6A5F is 機, the traditional form of 机 (U+673A).
Another approach is to use the CC-CEDICT project, which publishes a dictionary of Chinese characters and compounds (both traditional and simplified). Each entry looks something like:
宕機 宕机 [dang4 ji1] /to crash (of a computer)/Taiwanese term for 當機|当机[dang4 ji1]/
The first column is traditional characters, and the second column is simplified.
To get all the simplified characters, read this text file and make a list of every character that appears in the second column. Note that some characters may not appear by themselves (only in compounds), so it is not sufficient to look at single-character entries.
The OP doesn't indicate which language they're using, but if you're using Ruby, I've written a small library that can distinguish between simplified and traditional Chinese (plus Korean and Japanese as a bonus). As suggested in Greg's answer, it relies on a distilled version of Unihan_Variants.txt to figure out which chars are exclusively simplified and which are exclusively traditional.
https://github.com/jpatokal/script_detector
Sample:
p string
=> "我的氣墊船充滿了鱔魚."
> string.chinese?
=> true
> string.traditional_chinese?
=> true
> string.simplified_chinese?
=> false
But as the Unicode FAQ duly warns, this requires sizable fragments of text to work reliably, and will give misleading results for short strings. Consider the Japanese for Tokyo:
p string
=> "東京"
> string.chinese?
=> true
> string.traditional_chinese?
=> true
> string.japanese?
=> false
Since both characters happen to also be valid traditional Chinese, and there are no exclusively Japanese characters, it's not recognized correctly.
I'm not sure if that's easily done. The Han ideographs are unified in Unicode, so it's not immediately obvious how to do it. But the Unihan database (http://www.unicode.org/charts/unihan.html) might have the data you need.
Here is a regex of all simplified Chinese characters I made. For some reason Stackoverflow is complaining, so it's linked in a pastebin below.
https://pastebin.com/xw4p7RVJ
You'll notice that this list features ranges rather than each individual character, but also that these are utf-8 characters, not escaped representations. It's served me well in one iteration or another since around 2010. Hopefully everyone else can make some use of it now.
If you don't want the simplified chars (I can't imagine why, it's not come up once in 9 years), iterate over all the chars from ['一-龥'] and try to build a new list. Or run two regex's, one to check it is Chinese, but is not simplified Chinese
According to wikipedia simplified Chinese v. traditional, kanji, or other formats is left up to the font rendering in many cases. So while you could have a selection of simplified Chinese codepoints, this list would not be at all complete since many characters are no longer distinct.
I don't believe that there's a table with only simplified code points. I think they're all lumped together in the CJK range of 0x4E00 through 0x9FFF

How do I recover a document that has been sent through the character encoding wringer?

Until recently, my blog used mismatched character encoding settings for PHP and MySQL. I have since fixed the underlying problem, but I still have a ton of text that is filled with garbage. For instance, ï has become ï.
Is there software that can use pattern recognition and statistics to automatically discover broken text and fix it?
For example, it looks like U+00EF (UTF-8 0xC3 0xAF) has become U+00C3 U+00AF (UTF-8 0xC3 0x83 0xC2 0xAF). In other words, the hexadecimal encoding has been used for the code points. This pattern has happened to (seemingly random) non-ASCII characters across my site.
The example you cite looks like good old utf8-over-latin1. You might quickly try out a query like:
select convert(convert(the_problem_column using binary) using utf8)
to see if it irons out the problem.
An encoding conversion along those lines should work as long as all of your data went through the same sequence of encoding transformations, and as long as none of those transformations were lossy - you're just reversing the effect of some of those transformations.
If you can't rely on the data having gone through the same set of encoding transformations, then it's a matter of scanning through the data for garbage characters and replacing them with the intended character, which is risky because it depends on somebody's definition of what was garbage and what was intended.
Some discussion in this answer on how you might do that kind of repair using handmade scripts. I don't know of a tool that's aware of the full range of natural languages and encodings, that takes a more advanced statistical approach in spotting possible problems, and that recommends the exact transformation to fix the problem - something like that would be useful.
You probably want to look into regex, http://en.wikipedia.org/wiki/Regular_expression.
Using this you can then search out and replace the characters in question.
Here is the MySQL regex documentation, http://dev.mysql.com/doc/refman/5.1/en/regexp.html.

Detect presence of a specific charset

I need a way to detect whether a file contains characters from a certain charset.
Specifically, I want to detect the presence of UTF8-encoded cyrillic characters in a series of files. Is there a tool to do this?
Thanks
If you are looking for ready solution, you might want to try Enca.
However, if you only want to detect presence of what can be possibly decoded as UTF-8 Cyrillic characters (without any complete UTF-8 validity checks), you just have to grep for something like /(\xD0[\x81\x90-\xBF]|\xD1[\x80-\x8F\x91]){n,}/ (this exact regexp is for n subsequent UTF8-encoded Russian Cyrillic characters). For additional check that the whole file contains only valid UTF-8 data you can use something like isutf8(1).
Both methods have their good and bad sides and may sometimes give wrong results.
IIRC the ICU library has code that does character set detection. Though it's basically a best effort guess.
Edit: I did remember correctly, check out this paper / tutorial