how to fulltext index both chinese and english characters together by using ngram parser in mysql 5.7? - cjk

I have a table named 'comp' with a column 'compName', the compName contain different country's Characters, I am using mysql5.7 with ngram parser, now it is fine to search the Chinese word, but the it brings me the bad result when i searched English word. According to INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE table , i found it participle English word by the character, like: abc will be participled as ab,bc, But as we understand, the English will be participled as "SPACE", right ? So how to resolve this kinds of case when using ngram parser in mysql5.7.

Related

Is it possible to combine multiple text search configuration in FTS on postgresql?

I tried to combine multiple text search to use it into text search on postgresql.
I tried :
Create text search configuration test (
copy = english, french
)
But this didn't work:
text search configuration parameter "french" not recognized
I have a column which mixed of english french words and I want to get multiple configuration texts to search the queries items.
Example:
to_tsvector('test', words) ## to_tsquery('test','activité')
to_tsvector('test', words) ## to_tsquery('test', 'mystery')
How can I mix different text configurations to get result when I look for a french or english word?
The French text search configuration uses French stemming (the french_stem dictionary), while for English english_stem is used.
How do you want to stem for both? You could create a text search configuration that applies both stemmers, but I guess that the result would not be convincing. Similar for stop words.
You can explicitly specify the text search configuration in the query if you know what language you want to search for.

Foreign languages words in a text

I've a french text with some words in english and I want to find those words and highlight all of them at once. Is there any program that can help me do that? Is it possible to do this with any other foreign language?
I'm using microsoft Word.
Word can do this IF the English words are formatted with the English language (and the rest in the French language). In that case, Word's FIND functionality advanced options are able to filter so that the language formatting is searched (instead of text).

Weird characters in a Microsoft Word document won't export/can't be searched

I have a document which has been sloppily authored. It's a dictionary that contains cyrillic characters. Most of the dictionary is manageable, but I'm stuck with one thing I need help with. Words have accented letters in them and they're mostly formatted properly as a letter with a unicode accent (thus forming a single letter). However there are some very peculiar letters that look similar for example to: a;´ (where "a" is any arbitrary cyrillic letter). You'd expect á in its place. However it wouldn't be a problem per se if only this thing could be exported to, say HTML and manipulated in a text editor. The problem is that Word treats this "thing" as a single character/entity and
when exporting it is COMPLETELY omitted
when copied it can only be pasted into Notepad (which translates it into three separate characters), when being pasted into WordPad it just won't appear at all.
when a search is run in Word it won't find the letter, neither the actual character nor the exactly copied/pasted combination.
the letter will disappear when the document is opened in any other software, such as Libre Office
At this point I'm trying to:
understand what this combination is exactly
run a search/replace operation to find and weed out all of those errors
Here's a sample Word file.
Here's a screenshot of the word/letter in question:
which when typed correctly should appear like "скре́пка".
The 'character' appears to be a Word field of type 'eq' (equation). Here is the field with toggled field codes:
If it is a large document you could try to create a VBA routine that removes the fields and replaces them with corresponding characters.
Assuming that #Anonimista’s analysis is correct, as I think it is, you could fix the file by running some search and replace operations in Word, replacing e.g. ^19eq \o(е;´)^21 by е́ (the latter is Cyrillic letter е followed by combining acute accent U+0301). This is dull because you would need to do this for each vowel separately (and for uppercase vowels too). But I cannot find a way to use wildcards in this context; the codes ^19 and ^21 for start and end of field work only when wildcards are not enabled.

Simplified Chinese Unicode table

Where can I find a Unicode table showing only the simplified Chinese characters?
I have searched everywhere but cannot find anything.
UPDATE :
I have found that there is another encoding called GB 2312 -
http://en.wikipedia.org/wiki/GB_2312
- which contains only simplified characters.
Surely I can use this to get what I need?
I have also found this file which maps GB2312 to Unicode -
http://cpansearch.perl.org/src/GUS/Unicode-UTF8simple-1.06/gb2312.txt
- but I'm not sure if it's accurate or not.
If that table isn't correct maybe someone could point me to one that is, or maybe just a table of the GB2312 characters and some way to convert them?
UPDATE 2 :
This site also provides a GB/Unicode table and even a Java program to generate a file
with all the GB characters as well as the Unicode equivalents :
http://www.herongyang.com/gb2312/
The Unihan database contains this information in the file Unihan_Variants.txt. For example, a pair of traditional/simplified characters are:
U+673A kTraditionalVariant U+6A5F
U+6A5F kSimplifiedVariant U+673A
In the above case, U+6A5F is 機, the traditional form of 机 (U+673A).
Another approach is to use the CC-CEDICT project, which publishes a dictionary of Chinese characters and compounds (both traditional and simplified). Each entry looks something like:
宕機 宕机 [dang4 ji1] /to crash (of a computer)/Taiwanese term for 當機|当机[dang4 ji1]/
The first column is traditional characters, and the second column is simplified.
To get all the simplified characters, read this text file and make a list of every character that appears in the second column. Note that some characters may not appear by themselves (only in compounds), so it is not sufficient to look at single-character entries.
The OP doesn't indicate which language they're using, but if you're using Ruby, I've written a small library that can distinguish between simplified and traditional Chinese (plus Korean and Japanese as a bonus). As suggested in Greg's answer, it relies on a distilled version of Unihan_Variants.txt to figure out which chars are exclusively simplified and which are exclusively traditional.
https://github.com/jpatokal/script_detector
Sample:
p string
=> "我的氣墊船充滿了鱔魚."
> string.chinese?
=> true
> string.traditional_chinese?
=> true
> string.simplified_chinese?
=> false
But as the Unicode FAQ duly warns, this requires sizable fragments of text to work reliably, and will give misleading results for short strings. Consider the Japanese for Tokyo:
p string
=> "東京"
> string.chinese?
=> true
> string.traditional_chinese?
=> true
> string.japanese?
=> false
Since both characters happen to also be valid traditional Chinese, and there are no exclusively Japanese characters, it's not recognized correctly.
I'm not sure if that's easily done. The Han ideographs are unified in Unicode, so it's not immediately obvious how to do it. But the Unihan database (http://www.unicode.org/charts/unihan.html) might have the data you need.
Here is a regex of all simplified Chinese characters I made. For some reason Stackoverflow is complaining, so it's linked in a pastebin below.
https://pastebin.com/xw4p7RVJ
You'll notice that this list features ranges rather than each individual character, but also that these are utf-8 characters, not escaped representations. It's served me well in one iteration or another since around 2010. Hopefully everyone else can make some use of it now.
If you don't want the simplified chars (I can't imagine why, it's not come up once in 9 years), iterate over all the chars from ['一-龥'] and try to build a new list. Or run two regex's, one to check it is Chinese, but is not simplified Chinese
According to wikipedia simplified Chinese v. traditional, kanji, or other formats is left up to the font rendering in many cases. So while you could have a selection of simplified Chinese codepoints, this list would not be at all complete since many characters are no longer distinct.
I don't believe that there's a table with only simplified code points. I think they're all lumped together in the CJK range of 0x4E00 through 0x9FFF

Converting accented characters in PostgreSQL?

Is there an existing function to replace accented characters with unadorned characters in PostgreSQL? Characters like å and ø should become a and o respectively.
The closest thing I could find is the translate function, given the example in the comments section found here.
Some commonly used accented characters
can be searched using the following
function:
translate(search_terms,
'\303\200\303\201\303\202\303\203\303\204\303\205\303\206\303\207\303\210\303\211\303\212\303\213\303\214\303\215\303\216\303\217\303\221\303\222\303\223\303\224\303\225\303\226\303\230\303\231\303\232\303\233\303\234\303\235\303\237\303\240\303\241\303\242\303\243\303\244\303\245\303\246\303\247\303\250\303\251\303\252\303\253\303\254\303\255\303\256\303\257\303\261\303\262\303\263\303\264\303\265\303\266\303\270\303\271\303\272\303\273\303\274\303\275\303\277','AAAAAAACEEEEIIIINOOOOOOUUUUYSaaaaaaaceeeeiiiinoooooouuuuyy')
Are you doing this just for indexing/sorting? If so, you could use this postgresql extension, which provides proper Unicode collation. The same group has a postgresql extension for doing normalization.