Sphinx search with CJK (Japanese) - sphinx

I'll try to index by Sphinx some JAPANESE CONTENT in my project but it returns
ERROR: index 'idx_jp_main': sql_fetch_row: Incorrect string value: '\xE3\x83\x94\xE3\x83\xBC...' for column 'title' at row 1.
Somebody can help me, I've used utf8, utf8mb4, cp932, sjis and ujis but nothing, I have used also CJK Language to charset-table and ngram_chars but nothing same.
Thanks!

Related

pdfbox generates garbage japanese text if it contains kangxi radicals

we use pdfbox 2.0.20 and try to generate pdf file that contains following text with NotoSansJP(https://fonts.google.com/noto/specimen/Noto+Sans+JP). note that ⽤ in the text does not valid kanji (0xe794a8), this is kangxi radical use(0xe2bda4)
注文書兼注文請書Webページのボタン⽤GIF画像編集
result become below garbled text.
strange thing is if I copy-and-paste these garbled text in pdf here, result seems correct.
注⽂書兼注⽂請書Webページのボタン用GIF画像編集
except that 用 in the text become valid kanji(0xe794a8).
so, for me it seems that when text contains invalid kanji like kangxi radicals, different code pages are used.
but the fact that kangxi radical character seems to be modified to valid kanji, maybe it related with unicode normalization.
Does anyone experience same situation? and any thought about the reason?
regards,
EDIT: our nervous customer complains the problematic text contains sensitive data, I change the text a bit.

Escape Cyrillic, Chinese, Arabic, Hebrew characters in postgresql

I'm trying to load in a postgres table, records from a flat file, I'm doing it with the Copy command, which has worked well so far.
But now I am receiving fields with words in Chinese, Japanese, Cyrillic and other languages, and when I try to do it, it gives me an error in the load.
How could those characters escape in Postgres, I searched, but I have not found any reference to this type of topic.
You should not escape the characters, you should load them as they are.
Your database encoding is UTF8, so that's no problem. If your database encoding is not UTF8, change that.
For each file, figure out what its encoding is and use the ENCODING option of COPY or the environment variable PGCLIENTENCODING so that PostgreSQL knows which encoding the file is in.

Turkish characters in TYPO3 RTE aren't possible at TYPO3 UTF-8 Installation?

I want to insert a turkish name into my TYPO3 RichText-Editor (RTE, sysExt. rtehtmlarea), for example: "Özoğuz". The special letter "ğ" is here my problem, I've only see a question mark, after saving my text content element (s. pictures).
My charset is UTF-8 (setup.ts) and the database is also utf-8
config.metaCharset = utf-8
I also tried to insert ğ instead of "ğ" at code view (<>), but I've got the error, see second picture.
-
Maybe the turkish language needs ISO 8859-9 (Latin-5)?
How can I allow turkish at my german TYPO3 Website?
Backend:
Frontend:
UTF-8 handles Turkish chars correctly, the DB error at save and later lack of Turkish special chars definitely indicates, that your DB or at least table or column doesn't use UTF-8, note that also for TYPO3 6.0+ it's required to create UTF-8 table yourself and $TYPO3_CONF_VARS['SYS']['setDBinit'] = 'SET NAMES utf8;' will be ignored (read the notice).
Make sure that your MySQL server is configured to work with UTF-8 by default and also convert wrong tables/fields to use it.

Simplified Chinese Unicode table

Where can I find a Unicode table showing only the simplified Chinese characters?
I have searched everywhere but cannot find anything.
UPDATE :
I have found that there is another encoding called GB 2312 -
http://en.wikipedia.org/wiki/GB_2312
- which contains only simplified characters.
Surely I can use this to get what I need?
I have also found this file which maps GB2312 to Unicode -
http://cpansearch.perl.org/src/GUS/Unicode-UTF8simple-1.06/gb2312.txt
- but I'm not sure if it's accurate or not.
If that table isn't correct maybe someone could point me to one that is, or maybe just a table of the GB2312 characters and some way to convert them?
UPDATE 2 :
This site also provides a GB/Unicode table and even a Java program to generate a file
with all the GB characters as well as the Unicode equivalents :
http://www.herongyang.com/gb2312/
The Unihan database contains this information in the file Unihan_Variants.txt. For example, a pair of traditional/simplified characters are:
U+673A kTraditionalVariant U+6A5F
U+6A5F kSimplifiedVariant U+673A
In the above case, U+6A5F is 機, the traditional form of 机 (U+673A).
Another approach is to use the CC-CEDICT project, which publishes a dictionary of Chinese characters and compounds (both traditional and simplified). Each entry looks something like:
宕機 宕机 [dang4 ji1] /to crash (of a computer)/Taiwanese term for 當機|当机[dang4 ji1]/
The first column is traditional characters, and the second column is simplified.
To get all the simplified characters, read this text file and make a list of every character that appears in the second column. Note that some characters may not appear by themselves (only in compounds), so it is not sufficient to look at single-character entries.
The OP doesn't indicate which language they're using, but if you're using Ruby, I've written a small library that can distinguish between simplified and traditional Chinese (plus Korean and Japanese as a bonus). As suggested in Greg's answer, it relies on a distilled version of Unihan_Variants.txt to figure out which chars are exclusively simplified and which are exclusively traditional.
https://github.com/jpatokal/script_detector
Sample:
p string
=> "我的氣墊船充滿了鱔魚."
> string.chinese?
=> true
> string.traditional_chinese?
=> true
> string.simplified_chinese?
=> false
But as the Unicode FAQ duly warns, this requires sizable fragments of text to work reliably, and will give misleading results for short strings. Consider the Japanese for Tokyo:
p string
=> "東京"
> string.chinese?
=> true
> string.traditional_chinese?
=> true
> string.japanese?
=> false
Since both characters happen to also be valid traditional Chinese, and there are no exclusively Japanese characters, it's not recognized correctly.
I'm not sure if that's easily done. The Han ideographs are unified in Unicode, so it's not immediately obvious how to do it. But the Unihan database (http://www.unicode.org/charts/unihan.html) might have the data you need.
Here is a regex of all simplified Chinese characters I made. For some reason Stackoverflow is complaining, so it's linked in a pastebin below.
https://pastebin.com/xw4p7RVJ
You'll notice that this list features ranges rather than each individual character, but also that these are utf-8 characters, not escaped representations. It's served me well in one iteration or another since around 2010. Hopefully everyone else can make some use of it now.
If you don't want the simplified chars (I can't imagine why, it's not come up once in 9 years), iterate over all the chars from ['一-龥'] and try to build a new list. Or run two regex's, one to check it is Chinese, but is not simplified Chinese
According to wikipedia simplified Chinese v. traditional, kanji, or other formats is left up to the font rendering in many cases. So while you could have a selection of simplified Chinese codepoints, this list would not be at all complete since many characters are no longer distinct.
I don't believe that there's a table with only simplified code points. I think they're all lumped together in the CJK range of 0x4E00 through 0x9FFF

Converting accented characters in PostgreSQL?

Is there an existing function to replace accented characters with unadorned characters in PostgreSQL? Characters like å and ø should become a and o respectively.
The closest thing I could find is the translate function, given the example in the comments section found here.
Some commonly used accented characters
can be searched using the following
function:
translate(search_terms,
'\303\200\303\201\303\202\303\203\303\204\303\205\303\206\303\207\303\210\303\211\303\212\303\213\303\214\303\215\303\216\303\217\303\221\303\222\303\223\303\224\303\225\303\226\303\230\303\231\303\232\303\233\303\234\303\235\303\237\303\240\303\241\303\242\303\243\303\244\303\245\303\246\303\247\303\250\303\251\303\252\303\253\303\254\303\255\303\256\303\257\303\261\303\262\303\263\303\264\303\265\303\266\303\270\303\271\303\272\303\273\303\274\303\275\303\277','AAAAAAACEEEEIIIINOOOOOOUUUUYSaaaaaaaceeeeiiiinoooooouuuuyy')
Are you doing this just for indexing/sorting? If so, you could use this postgresql extension, which provides proper Unicode collation. The same group has a postgresql extension for doing normalization.