Unicode range for Japanese - unicode

I am trying to separate English and Japanese characters. I need to find Unicode range of all Japanese characters. What is Unicode range of all Japanese characters ?

As zawhtut mentioned, this page has a reference for several unicode ranges. To summarize the ranges:
Japanese-style punctuation ( 3000 - 303f)
Hiragana ( 3040 - 309f)
Katakana ( 30a0 - 30ff)
Full-width roman characters and half-width katakana ( ff00 - ffef)
CJK unifed ideographs - Common and uncommon kanji ( 4e00 - 9faf)

Although this question already has an answer, this blog post is probably more complete.
Please visit the site and get their metrics up, but for posterity here's a copy-paste.
Hiragana
Unicode code points regex: [\x3041-\x3096]
Unicode block property regex: \p{Hiragana}
ぁ あ ぃ い ぅ う ぇ え ぉ お か が き ぎ く ぐ け げ こ ご さ ざ し じ す ず せ ぜ そ ぞ た だ ち ぢ っ
つ づ て で と ど な に ぬ ね の は ば ぱ ひ び ぴ ふ ぶ ぷ へ べ ぺ ほ ぼ ぽ ま み む め も ゃ や ゅ ゆ
ょ よ ら り る れ ろ ゎ わ ゐ ゑ を ん ゔ ゕ ゖ ゙ ゚ ゛ ゜ ゝ ゞ ゟ
Katakana (Full Width)
Unicode code points regex: [\x30A0-\x30FF]
Unicode block property regex: \p{Katakana}
゠ ァ ア ィ イ ゥ ウ ェ エ ォ オ カ ガ キ ギ ク グ ケ ゲ コ ゴ サ ザ シ ジ ス ズ セ ゼ ソ ゾ タ ダ チ ヂ
ッ ツ ヅ テ デ ト ド ナ ニ ヌ ネ ノ ハ バ パ ヒ ビ ピ フ ブ プ ヘ ベ ペ ホ ボ ポ マ ミ ム メ モ ャ ヤ ュ
ユ ョ ヨ ラ リ ル レ ロ ヮ ワ ヰ ヱ ヲ ン ヴ ヵ ヶ ヷ ヸ ヹ ヺ ・ ー ヽ ヾ ヿ
Kanji
Unicode code points regex: [\x3400-\x4DB5\x4E00-\x9FCB\xF900-\xFA6A]
Unicode block property regex: \p{Han}
漢字 日本語 文字 言語 言葉 etc. Too many characters to list.
This regular expression will match all the kanji, including those used
in Chinese.
Kanji Radicals
Unicode code points regex: [\x2E80-\x2FD5]
⺀ ⺁ ⺂ ⺃ ⺄ ⺅ ⺆ ⺇ ⺈ ⺉ ⺊ ⺋ ⺌ ⺍ ⺎ ⺏ ⺐ ⺑ ⺒ ⺓ ⺔ ⺕ ⺖ ⺗ ⺘ ⺙ ⺚ ⺛ ⺜ ⺝ ⺞ ⺟ ⺠ ⺡ ⺢
⺣ ⺤ ⺥ ⺦ ⺧ ⺨ ⺩ ⺪ ⺫ ⺬ ⺭ ⺮ ⺯ ⺰ ⺱ ⺲ ⺳ ⺴ ⺵ ⺶ ⺷ ⺸ ⺹ ⺺ ⺻ ⺼ ⺽ ⺾ ⺿ ⻀ ⻁ ⻂ ⻃ ⻄ ⻅
⻆ ⻇ ⻈ ⻉ ⻊ ⻋ ⻌ ⻍ ⻎ ⻏ ⻐ ⻑ ⻒ ⻓ ⻔ ⻕ ⻖ ⻗ ⻘ ⻙ ⻚ ⻛ ⻜ ⻝ ⻞ ⻟ ⻠ ⻡ ⻢ ⻣ ⻤ ⻥ ⻦ ⻧ ⻨
⻩ ⻪ ⻫ ⻬ ⻭ ⻮ ⻯ ⻰ ⻱ ⻲ ⻳ ⼀ ⼁ ⼂ ⼃ ⼄ ⼅ ⼆ ⼇ ⼈ ⼉ ⼊ ⼋ ⼌ ⼍ ⼎ ⼏ ⼐ ⼑ ⼒ ⼓ ⼔ ⼕ ⼖ ⼗
⼘ ⼙ ⼚ ⼛ ⼜ ⼝ ⼞ ⼟ ⼠ ⼡ ⼢ ⼣ ⼤ ⼥ ⼦ ⼧ ⼨ ⼩ ⼪ ⼫ ⼬ ⼭ ⼮ ⼯ ⼰ ⼱ ⼲ ⼳ ⼴ ⼵ ⼶ ⼷ ⼸ ⼹ ⼺
⼻ ⼼ ⼽ ⼾ ⼿ ⽀ ⽁ ⽂ ⽃ ⽄ ⽅ ⽆ ⽇ ⽈ ⽉ ⽊ ⽋ ⽌ ⽍ ⽎ ⽏ ⽐ ⽑ ⽒ ⽓ ⽔ ⽕ ⽖ ⽗ ⽘ ⽙ ⽚ ⽛ ⽜ ⽝
⽞ ⽟ ⽠ ⽡ ⽢ ⽣ ⽤ ⽥ ⽦ ⽧ ⽨ ⽩ ⽪ ⽫ ⽬ ⽭ ⽮ ⽯ ⽰ ⽱ ⽲ ⽳ ⽴ ⽵ ⽶ ⽷ ⽸ ⽹ ⽺ ⽻ ⽼ ⽽ ⽾ ⽿ ⾀
⾁ ⾂ ⾃ ⾄ ⾅ ⾆ ⾇ ⾈ ⾉ ⾊ ⾋ ⾌ ⾍ ⾎ ⾏ ⾐ ⾑ ⾒ ⾓ ⾔ ⾕ ⾖ ⾗ ⾘ ⾙ ⾚ ⾛ ⾜ ⾝ ⾞ ⾟ ⾠ ⾡ ⾢ ⾣
⾤ ⾥ ⾦ ⾧ ⾨ ⾩ ⾪ ⾫ ⾬ ⾭ ⾮ ⾯ ⾰ ⾱ ⾲ ⾳ ⾴ ⾵ ⾶ ⾷ ⾸ ⾹ ⾺ ⾻ ⾼ ⾽ ⾾ ⾿ ⿀ ⿁ ⿂ ⿃ ⿄ ⿅ ⿆
⿇ ⿈ ⿉ ⿊ ⿋ ⿌ ⿍ ⿎ ⿏ ⿐ ⿑ ⿒ ⿓ ⿔ ⿕
Katakana and Punctuation (Half Width)
Unicode code points regex: [\xFF5F-\xFF9F]
⦅ ⦆ 。 「 」 、 ・ ヲ ァ ィ ゥ ェ ォ ャ ュ ョ ッ ー ア イ ウ エ オ カ キ ク ケ コ サ シ ス セ ソ タ チ
ツ テ ト ナ ニ ヌ ネ ノ ハ ヒ フ ヘ ホ マ ミ ム メ モ ヤ ユ ヨ ラ リ ル レ ロ ワ ン ゙
Japanese Symbols and Punctuation
Unicode code points regex: [\x3000-\x303F]
、 。 〃 〄 々 〆 〇 〈 〉 《 》 「 」 『 』 【 】 〒 〓 〔 〕 〖 〗 〘 〙 〚 〛 〜 〝 〞 〟 〠 〡 〢 〣
〤 〥 〦 〧 〨 〩 〪 〫 〬 〭 〮 〯 〰 〱 〲 〳 〴 〵 〶 〷 〸 〹 〺 〻 〼 〽 〾 〿
Miscellaneous Japanese Symbols and Characters
Unicode code points regex: [\x31F0-\x31FF\x3220-\x3243\x3280-\x337F]
ㇰ ㇱ ㇲ ㇳ ㇴ ㇵ ㇶ ㇷ ㇸ ㇹ ㇺ ㇻ ㇼ ㇽ ㇾ ㇿ ㈠ ㈡ ㈢ ㈣ ㈤ ㈥ ㈦ ㈧ ㈨ ㈩ ㈪ ㈫ ㈬ ㈭ ㈮ ㈯ ㈰ ㈱ ㈲
㈳ ㈴ ㈵ ㈶ ㈷ ㈸ ㈹ ㈺ ㈻ ㈼ ㈽ ㈾ ㈿ ㉀ ㉁ ㉂ ㉃ ㊀ ㊁ ㊂ ㊃ ㊄ ㊅ ㊆ ㊇ ㊈ ㊉ ㊊ ㊋ ㊌ ㊍ ㊎ ㊏ ㊐ ㊑
㊒ ㊓ ㊔ ㊕ ㊖ ㊗ ㊘ ㊙ ㊚ ㊛ ㊜ ㊝ ㊞ ㊟ ㊠ ㊡ ㊢ ㊣ ㊤ ㊥ ㊦ ㊧ ㊨ ㊩ ㊪ ㊫ ㊬ ㊭ ㊮ ㊯ ㊰ ㊱ ㊲ ㊳ ㊴
㊵ ㊶ ㊷ ㊸ ㊹ ㊺ ㊻ ㊼ ㊽ ㊾ ㊿ ㋀ ㋁ ㋂ ㋃ ㋄ ㋅ ㋆ ㋇ ㋈ ㋉ ㋊ ㋋ ㋐ ㋑ ㋒ ㋓ ㋔ ㋕ ㋖ ㋗ ㋘ ㋙ ㋚ ㋛
㋜ ㋝ ㋞ ㋟ ㋠ ㋡ ㋢ ㋣ ㋤ ㋥ ㋦ ㋧ ㋨ ㋩ ㋪ ㋫ ㋬ ㋭ ㋮ ㋯ ㋰ ㋱ ㋲ ㋳ ㋴ ㋵ ㋶ ㋷ ㋸ ㋹ ㋺ ㋻ ㋼ ㋽ ㋾
㌀ ㌁ ㌂ ㌃ ㌄ ㌅ ㌆ ㌇ ㌈ ㌉ ㌊ ㌋ ㌌ ㌍ ㌎ ㌏ ㌐ ㌑ ㌒ ㌓ ㌔ ㌕ ㌖ ㌗ ㌘ ㌙ ㌚ ㌛ ㌜ ㌝ ㌞ ㌟ ㌠ ㌡ ㌢
㌣ ㌤ ㌥ ㌦ ㌧ ㌨ ㌩ ㌪ ㌫ ㌬ ㌭ ㌮ ㌯ ㌰ ㌱ ㌲ ㌳ ㌴ ㌵ ㌶ ㌷ ㌸ ㌹ ㌺ ㌻ ㌼ ㌽ ㌾ ㌿ ㍀ ㍁ ㍂ ㍃ ㍄ ㍅
㍆ ㍇ ㍈ ㍉ ㍊ ㍋ ㍌ ㍍ ㍎ ㍏ ㍐ ㍑ ㍒ ㍓ ㍔ ㍕ ㍖ ㍗ ㍘ ㍙ ㍚ ㍛ ㍜ ㍝ ㍞ ㍟ ㍠ ㍡ ㍢ ㍣ ㍤ ㍥ ㍦ ㍧ ㍨
㍩ ㍪ ㍫ ㍬ ㍭ ㍮ ㍯ ㍰ ㍱ ㍲ ㍳ ㍴ ㍵ ㍶ ㍻ ㍼ ㍽ ㍾ ㍿
Alphanumeric and Punctuation (Full Width)
Unicode code points regex: [\xFF01-\xFF5E]
! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C
D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f
g h i j k l m n o p q r s t u v w x y z { | } ~

Please see this page for a reference. It contains Katakana, Hiragana and Kanji unicode ranges.

CJK(Chinese Japanese and Korean), Hiragana and Katakana(include Halfwidth Katakana)
http://www.unicode.org/charts/

What is Unicode range of all Japanese characters?
Have a look at page of The WiLI benchmark dataset for written
language identification, especially table II. The number in bracket is the part of the language you capture with the Unicode code range (in decimal).
12352 - 12543: Japanese (48.73%), English (0.00%)
19000 - 44000: Japanese (32.78%), English (0.00%)
20 - 128: English (99.74%), Japanese (11.58%)
You can see that 20 - 128 captures English really well and that all 3 blocks are important for Japanese, but still big parts are missing.
Those numbers are created with lidtk and WiLI-2018.

Related

html2pdf not showing character correctly, encoding for ē

I'm struggling with some characters in a PDF I'm trying to create with html2pdf. The following code creates the PDF, but ē is shown an e.
$html2pdf=new Html2Pdf();
$html2pdf->writeHTML('<h1>Fēnix</h1>');
$html2pdf->output();
When getting the name from my database, ē is shown a ?.
$query=$mysqli->query('SELECT name FROM table WHERE id=1;');
$result=$query->fetch_assoc();
$html2pdf=new Html2Pdf();
$html2pdf->writeHTML('<h1>'.$result['name'].'</h1>');
$html2pdf->output();
This is the way I connect to my database:
$mysqli=new mysqli('host', 'user', 'pass', 'db');
I have also tried adding a charset:
$mysqli->set_charset('utf8');
Or initiating the class with parameters:
$html2pdf=new Html2Pdf('P', 'A4', 'nl');
$html2pdf=new Html2Pdf('P', 'A4', 'nl', true, 'UTF8');
Other characters that are giving issues are: Ś ą ł ś
Both server and database are UTF-8.
The solution is to apply a UTF-8 font to all elements.
* { font-family:freeserif; }

How to disable subwords embedding training when using fasttext?

Here is a snippet of the corpus I try to use for training word embedding.
news_subent_12402 news_dlsub_00322 news_dlsub_00001 news_sub_00035 news_subent_07737 news_sub_00038 news_dlsub_00925 news_subent_07934 news_sub_00057 news_dlsub_01826 news_dlsub_00437 news_sub_00037 news_sub_00050 news_dlsub_00205 news_sub_00270 news_subent_05735 news_dlsub_00143 news_subent_12439 news_sub_00051 news_subent_08446 news_dlsub_00091 news_sub_00222 news_dlsub_00009 news_dlsub_00126 news_subent_15202 news_dlsub_00019 news_sub_00076 news_dlsub_00059 news_subent_11158 news_subent_10981 news_dlsub_00634 news_dlsub_00018 news_subent_03496 news_subent_16059 news_subent_08005 news_dlsub_00020 news_subent_15460 news_dlsub_00908 news_subent_12712 news_sub_00258 news_sub_00048 news_dlsub_00022 news_dlsub_00206 news_dlsub_00106 news_sub_00248 news_sub_00047 news_subent_02476 news_subent_14554 news_dlsub_00134 news_sub_00070 news_subent_06676 news_dlsub_00306 news_subent_11635 news_dlsub_01137 news_sub_00081 news_dlsub_00024 news_dlsub_00242 news_dlsub_00920 news_dlsub_00198 news_subent_02562 news_subent_09358 news_dlsub_00101 news_subent_02696 news_subent_17124 news_sub_00244 news_dlsub_00045 news_sub_00049 news_dlsub_00575 news_dlsub_00163 news_subent_03497 news_subent_10972 news_subent_05406 news_sub_00039 news_subent_14976 news_subent_20148 news_subent_02955 news_sub_00245 news_subent_02399 news_dlsub_00669 news_subent_12423 news_dlsub_00180 news_dlsub_00013 news_dlsub_00075 news_sub_00264 news_dlsub_01833 news_sub_00040 news_sub_00257 news_dlsub_00021 news_subent_14967 news_subent_03495 news_dlsub_00035 news_subent_21377 news_sub_00059 news_dlsub_01260 news_sub_00232 news_dlsub_00316 news_dlsub_00014 news_dlsub_00023 news_dlsub_00046 news_subent_02007 news_dlsub_00458 news_dlsub_00269 news_subent_04653 news_subent_06231 news_dlsub_01751 news_dlsub_00186 news_dlsub_00043 news_dlsub_00128 news_subent_05276 news_sub_00259 news_dlsub_00102 news_sub_00268 news_dlsub_00185 news_sub_00041 news_subent_09122 news_dlsub_00116 news_subent_09210 news_subent_07733 news_subent_06393 news_dlsub_00244 news_dlsub_00622 news_sub_00226 news_sub_00043 news_dlsub_00067
news_subent_03827 news_dlsub_00065 news_sub_00251 news_dlsub_01826 news_subent_17688 news_subent_07649 news_subent_02941 news_dlsub_00100 news_subent_08198 news_subent_02990 news_dlsub_00033 news_subent_02562 news_dlsub_00043 news_dlsub_00024 news_dlsub_00015 news_subent_07628 news_subent_07045 news_dlsub_00234 news_subent_09178 news_dlsub_00458 news_subent_02923 news_sub_00226 news_dlsub_00120 news_sub_00247 news_dlsub_00014 news_dlsub_01830 news_subent_02946 news_dlsub_00086 news_dlsub_00046 news_dlsub_00038 news_subent_16554 news_subent_03073 news_dlsub_00128 news_dlsub_00098 news_subent_02905 news_subent_09117 news_dlsub_00021 news_dlsub_00143 news_subent_03054 news_dlsub_00126 news_subent_16372 news_dlsub_01833 news_subent_03495 news_sub_00245 news_dlsub_00101 news_sub_00258 news_subent_11431 news_sub_00148 news_subent_09320 news_sub_00232 news_subent_02460 news_dlsub_00032 news_dlsub_00067 news_dlsub_00064 news_dlsub_00045 news_dlsub_00116 news_subent_11663 news_subent_03501 news_subent_02030 news_dlsub_00035 news_dlsub_00476 news_dlsub_00039 news_subent_14505 news_dlsub_00091 news_sub_00244 news_sub_00268 news_dlsub_00130 news_subent_02007 news_subent_03014 news_dlsub_00022 news_dlsub_00019 news_subent_09358 news_dlsub_00270 news_subent_17124 news_dlsub_00071 news_sub_00266 news_subent_06429 news_subent_02621 news_sub_00248
news_subent_03497 news_subent_03495 news_dlsub_01326 news_sub_00151 news_sub_00070 news_dlsub_00143 news_dlsub_00012 news_dlsub_00212 news_subent_04653 news_subent_02022 news_dlsub_00101 football_club_187 news_subent_02902 news_dlsub_00116 news_dlsub_00925 news_sub_00137 news_dlsub_00120 news_sub_00036 news_subent_02889 news_subent_14976 news_dlsub_00269 news_dlsub_00687 news_subent_15202 news_dlsub_00669 news_dlsub_00126 news_sub_00248 news_dlsub_00437 news_sub_00071 news_dlsub_00177 news_dlsub_00694 news_dlsub_00618 news_sub_00051 news_sub_00043 news_subent_14997 news_subent_02411 news_subent_16059 news_sub_00245 news_subent_02923 news_dlsub_00035 news_sub_00069 news_subent_05320 news_sub_00082 news_sub_00259 news_dlsub_01035 news_dlsub_00413 news_sub_00072 news_dlsub_00020 news_sub_00052 news_dlsub_00023 news_subent_03496 news_subent_02893 news_subent_16508 news_sub_00065 news_sub_00047 news_subent_05740 news_subent_13389 news_sub_00055 news_subent_09439 news_subent_02991 news_sub_00268 news_dlsub_00003 news_subent_04609 news_subent_03509 news_subent_04069 news_dlsub_00128 news_dlsub_00099 news_dlsub_00206 news_dlsub_00582 news_sub_00037 news_dlsub_00021 news_sub_00247 news_dlsub_01179 news_sub_00057 news_dlsub_00046 news_sub_00039 news_sub_00050 news_subent_03014 news_sub_00042 news_dlsub_01826 news_sub_00038 news_dlsub_00410 news_subent_12422 news_sub_00048 news_subent_13648 news_dlsub_01807 news_subent_20148 news_sub_00084 news_sub_00049 news_dlsub_00029 news_subent_11392 news_dlsub_00412 news_sub_00246 news_sub_00244 news_subent_16385 news_dlsub_00634 news_subent_13536 news_subent_03073 news_sub_00226 news_subent_11478 news_sub_00035 news_subent_14967 football_club_192 news_sub_00232 news_sub_00054 news_subent_06587 news_dlsub_00014 news_subent_02399 news_dlsub_00013 news_dlsub_00102 news_sub_00040 news_subent_01990 news_dlsub_00007 news_subent_07675 news_subent_07719 news_sub_00041 news_subent_04655 news_dlsub_00300 news_dlsub_00019 news_subent_07756 news_dlsub_00234 news_sub_00076
While that every line is a sentence and news_dlsub_00001 is just an intact word. I do not want the fasttext to construct subword embedding and what I want is just the embeddings for the intact words like news_dlsub_01326 news_subent_12402 and so on.
There are 15354 distinct words in my corpus and about 10m rows(sentences) overall.
Here is the training script :
./fasttext skipgram -input user_profile_tags_rows.txt -output model_user_tags -lr 0.01 -epoch 50 -wordNgrams 1 -bucket 200000 -dim 128 -loss hs -thread 80 -ws 5 -minCount 1
So how can I set the training script that disable the embedding representation training for subwords for efficiency ? Thanks.
If you want to train word embeddings with no subword information, you can set the -maxn parameter to 0. This means that you only use character ngrams with a max length of 0, i.e., no character ngrams are used.
Set both options to zero: -maxn 0 -minn 0

Emoji value range

I was trying to take out all emoji chars out of a string (like a sanitizer). But I cannot find a complete set of emoji values.
What is the complete set of emoji chars' UTF16 values?
The Unicode standard's Unicode® Technical Report #51 includes a list of emoji (emoji-data.txt):
...
21A9 ; text ; L1 ; none ; j # V1.1 (↩) LEFTWARDS ARROW WITH HOOK
21AA ; text ; L1 ; none ; j # V1.1 (↪) RIGHTWARDS ARROW WITH HOOK
231A ; emoji ; L1 ; none ; j # V1.1 (⌚) WATCH
231B ; emoji ; L1 ; none ; j # V1.1 (⌛) HOURGLASS
...
I believe you would want to remove each character listed in this document which had a Default_Emoji_Style of emoji.
There is no way, other than reference to a definition list like this, to identify the emoji characters in Unicode. As the reference to the FAQ says, they are spread throughout different blocks.
I have composed list based on Joe's and Doctor.Who's answers:
U+00A9, U+00AE, U+203C, U+2049, U+20E3, U+2122, U+2139, U+2194-2199, U+21A9-21AA, U+231A, U+231B, U+2328, U+23CF, U+23E9-23F3, U+23F8-23FA, U+24C2, U+25AA, U+25AB, U+25B6, U+25C0, U+25FB-25FE, U+2600-27EF, U+2934, U+2935, U+2B00-2BFF, U+3030, U+303D, U+3297, U+3299, U+1F000-1F02F, U+1F0A0-1F0FF, U+1F100-1F64F, U+1F680-1F6FF, U+1F910-1F96B, U+1F980-1F9E0
unicode-range: U+0080-02AF, U+0300-03FF, U+0600-06FF, U+0C00-0C7F, U+1DC0-1DFF, U+1E00-1EFF, U+2000-209F, U+20D0-214F, U+2190-23FF, U+2460-25FF, U+2600-27EF, U+2900-29FF, U+2B00-2BFF, U+2C60-2C7F, U+2E00-2E7F, U+3000-303F, U+A490-A4CF, U+E000-F8FF, U+FE00-FE0F, U+FE30-FE4F, U+1F000-1F02F, U+1F0A0-1F0FF, U+1F100-1F64F, U+1F680-1F6FF, U+1F910-1F96B, U+1F980-1F9E0;
Emoji ranges are updated for every new version of Unicode Emoji. Ranges below are correct for version 14.0
Here is my gist for an advanced version of this code.
def is_contains_emoji(p_string_in_unicode):
"""
Instead of searching all chars of a text in a emoji lookup dictionary this function just
checks whether any char in the text is in unicode emoji range
It is much faster than a dictionary lookup for a large text
However it only tells whether a text contains an emoji. It does not return the found emojis
"""
range_min = ord(u'\U0001F300') # 127744
range_max = ord(u"\U0001FAF6") # 129782
range_min_2 = 126980
range_max_2 = 127569
range_min_3 = 169
range_max_3 = 174
range_min_4 = 8205
range_max_4 = 12953
if p_string_in_unicode:
for a_char in p_string_in_unicode:
char_code = ord(a_char)
if range_min <= char_code <= range_max:
# or range_min_2 <= char_code <= range_max_2 or range_min_3 <= char_code <= range_max_3 or range_min_4 <= char_code <= range_max_4:
return True
elif range_min_2 <= char_code <= range_max_2:
return True
elif range_min_3 <= char_code <= range_max_3:
return True
elif range_min_4 <= char_code <= range_max_4:
return True
return False
else:
return False
You can get ranges of characters meeting any requirements specified by their category and properties from the Official UnicodeSet Utility
According to their search result, the full range of emoji is:
[\U0001F3FB-\U0001F3FF * # \U0001F600 \U0001F603 \U0001F604 \U0001F601 \U0001F606 \U0001F605 \U0001F923 \U0001F602 \U0001F642 \U0001F643 \U0001FAE0 \U0001F609 \U0001F60A \U0001F607 \U0001F970 \U0001F60D \U0001F929 \U0001F618 \U0001F617 \u263A \U0001F61A \U0001F619 \U0001F972 \U0001F60B \U0001F61B \U0001F61C \U0001F92A \U0001F61D \U0001F911 \U0001F917 \U0001F92D \U0001FAE2 \U0001FAE3 \U0001F92B \U0001F914 \U0001FAE1 \U0001F910 \U0001F928 \U0001F610 \U0001F611 \U0001F636 \U0001FAE5 \U0001F60F \U0001F612 \U0001F644 \U0001F62C \U0001F925 \U0001FAE8 \U0001F60C \U0001F614 \U0001F62A \U0001F924 \U0001F634 \U0001F637 \U0001F912 \U0001F915 \U0001F922 \U0001F92E \U0001F927 \U0001F975 \U0001F976 \U0001F974 \U0001F635 \U0001F92F \U0001F920 \U0001F973 \U0001F978 \U0001F60E \U0001F913 \U0001F9D0 \U0001F615 \U0001FAE4 \U0001F61F \U0001F641 \u2639 \U0001F62E \U0001F62F \U0001F632 \U0001F633 \U0001F97A \U0001F979 \U0001F626-\U0001F628 \U0001F630 \U0001F625 \U0001F622 \U0001F62D \U0001F631 \U0001F616 \U0001F623 \U0001F61E \U0001F613 \U0001F629 \U0001F62B \U0001F971 \U0001F624 \U0001F621 \U0001F620 \U0001F92C \U0001F608 \U0001F47F \U0001F480 \u2620 \U0001F4A9 \U0001F921 \U0001F479-\U0001F47B \U0001F47D \U0001F47E \U0001F916 \U0001F63A \U0001F638 \U0001F639 \U0001F63B-\U0001F63D \U0001F640 \U0001F63F \U0001F63E \U0001F648-\U0001F64A \U0001F48B \U0001F48C \U0001F498 \U0001F49D \U0001F496 \U0001F497 \U0001F493 \U0001F49E \U0001F495 \U0001F49F \u2763 \U0001F494 \u2764 \U0001F9E1 \U0001F49B \U0001F49A \U0001F499 \U0001F49C \U0001FA75-\U0001FA77 \U0001F90E \U0001F5A4 \U0001F90D \U0001F4AF \U0001F4A2 \U0001F4A5 \U0001F4AB \U0001F4A6 \U0001F4A8 \U0001F573 \U0001F4A3 \U0001F4AC \U0001F5E8 \U0001F5EF \U0001F4AD \U0001F4A4 \U0001F44B \U0001F91A \U0001F590 \u270B \U0001F596 \U0001FAF1-\U0001FAF4 \U0001F44C \U0001F90C \U0001F90F \u270C \U0001F91E \U0001FAF0 \U0001F91F \U0001F918 \U0001F919 \U0001F448 \U0001F449 \U0001F446 \U0001F595 \U0001F447 \u261D \U0001FAF5 \U0001F44D \U0001F44E \u270A \U0001F44A \U0001F91B \U0001F91C \U0001F44F \U0001F64C \U0001FAF6 \U0001F450 \U0001F932 \U0001F91D \U0001F64F \U0001FAF7 \U0001FAF8 \u270D \U0001F485 \U0001F933 \U0001F4AA \U0001F9BE \U0001F9BF \U0001F9B5 \U0001F9B6 \U0001F442 \U0001F9BB \U0001F443 \U0001F9E0 \U0001FAC0 \U0001FAC1 \U0001F9B7 \U0001F9B4 \U0001F440 \U0001F441 \U0001F445 \U0001F444 \U0001FAE6 \U0001F476 \U0001F9D2 \U0001F466 \U0001F467 \U0001F9D1\U0001F471 \U0001F468\U0001F9D4 \U0001F469 \U0001F9D3 \U0001F474 \U0001F475 \U0001F64D \U0001F64E \U0001F645 \U0001F646 \U0001F481 \U0001F64B \U0001F9CF \U0001F647 \U0001F926 \U0001F937 \U0001F46E \U0001F575 \U0001F482 \U0001F977 \U0001F477 \U0001FAC5 \U0001F934 \U0001F478 \U0001F473 \U0001F472 \U0001F9D5 \U0001F935 \U0001F470 \U0001F930 \U0001FAC3 \U0001FAC4 \U0001F931 \U0001F47C \U0001F385 \U0001F936 \U0001F9B8 \U0001F9B9 \U0001F9D9-\U0001F9DF \U0001F9CC \U0001F486 \U0001F487 \U0001F6B6 \U0001F9CD \U0001F9CE \U0001F3C3 \U0001F483 \U0001F57A \U0001F574 \U0001F46F \U0001F9D6 \U0001F9D7 \U0001F93A \U0001F3C7 \u26F7 \U0001F3C2 \U0001F3CC \U0001F3C4 \U0001F6A3 \U0001F3CA \u26F9 \U0001F3CB \U0001F6B4 \U0001F6B5 \U0001F938 \U0001F93C-\U0001F93E \U0001F939 \U0001F9D8 \U0001F6C0 \U0001F6CC \U0001F46D \U0001F46B \U0001F46C \U0001F48F \U0001F491 \U0001F46A \U0001F5E3 \U0001F464 \U0001F465 \U0001FAC2 \U0001F463 \U0001F9B0 \U0001F9B1 \U0001F9B3 \U0001F9B2 \U0001F435 \U0001F412 \U0001F98D \U0001F9A7 \U0001F436 \U0001F415 \U0001F9AE \U0001F429 \U0001F43A \U0001F98A \U0001F99D \U0001F431 \U0001F408 \U0001F981 \U0001F42F \U0001F405 \U0001F406 \U0001F434 \U0001FACE \U0001FACF \U0001F40E \U0001F984 \U0001F993 \U0001F98C \U0001F9AC \U0001F42E \U0001F402-\U0001F404 \U0001F437 \U0001F416 \U0001F417 \U0001F43D \U0001F40F \U0001F411 \U0001F410 \U0001F42A \U0001F42B \U0001F999 \U0001F992 \U0001F418 \U0001F9A3 \U0001F98F \U0001F99B \U0001F42D \U0001F401 \U0001F400 \U0001F439 \U0001F430 \U0001F407 \U0001F43F \U0001F9AB \U0001F994 \U0001F987 \U0001F43B \U0001F428 \U0001F43C \U0001F9A5 \U0001F9A6 \U0001F9A8 \U0001F998 \U0001F9A1 \U0001F43E \U0001F983 \U0001F414 \U0001F413 \U0001F423-\U0001F427 \U0001F54A \U0001F985 \U0001F986 \U0001F9A2 \U0001F989 \U0001F9A4 \U0001FAB6 \U0001F9A9 \U0001F99A \U0001F99C \U0001FABD \U0001FABF \U0001F438 \U0001F40A \U0001F422 \U0001F98E \U0001F40D \U0001F432 \U0001F409 \U0001F995 \U0001F996 \U0001F433 \U0001F40B \U0001F42C \U0001F9AD \U0001F41F-\U0001F421 \U0001F988 \U0001F419 \U0001F41A \U0001FAB8 \U0001FABC \U0001F40C \U0001F98B \U0001F41B-\U0001F41D \U0001FAB2 \U0001F41E \U0001F997 \U0001FAB3 \U0001F577 \U0001F578 \U0001F982 \U0001F99F \U0001FAB0 \U0001FAB1 \U0001F9A0 \U0001F490 \U0001F338 \U0001F4AE \U0001FAB7 \U0001F3F5 \U0001F339 \U0001F940 \U0001F33A-\U0001F33C \U0001F337 \U0001FABB \U0001F331 \U0001FAB4 \U0001F332-\U0001F335 \U0001F33E \U0001F33F \u2618 \U0001F340-\U0001F343 \U0001FAB9 \U0001FABA \U0001F347-\U0001F34D \U0001F96D \U0001F34E-\U0001F353 \U0001FAD0 \U0001F95D \U0001F345 \U0001FAD2 \U0001F965 \U0001F951 \U0001F346 \U0001F954 \U0001F955 \U0001F33D \U0001F336 \U0001FAD1 \U0001F952 \U0001F96C \U0001F966 \U0001F9C4 \U0001F9C5 \U0001F344 \U0001F95C \U0001FAD8 \U0001F330 \U0001FADA \U0001FADB \U0001F35E \U0001F950 \U0001F956 \U0001FAD3 \U0001F968 \U0001F96F \U0001F95E \U0001F9C7 \U0001F9C0 \U0001F356 \U0001F357 \U0001F969 \U0001F953 \U0001F354 \U0001F35F \U0001F355 \U0001F32D \U0001F96A \U0001F32E \U0001F32F \U0001FAD4 \U0001F959 \U0001F9C6 \U0001F95A \U0001F373 \U0001F958 \U0001F372 \U0001FAD5 \U0001F963 \U0001F957 \U0001F37F \U0001F9C8 \U0001F9C2 \U0001F96B \U0001F371 \U0001F358-\U0001F35D \U0001F360 \U0001F362-\U0001F365 \U0001F96E \U0001F361 \U0001F95F-\U0001F961 \U0001F980 \U0001F99E \U0001F990 \U0001F991 \U0001F9AA \U0001F366-\U0001F36A \U0001F382 \U0001F370 \U0001F9C1 \U0001F967 \U0001F36B-\U0001F36F \U0001F37C \U0001F95B \u2615 \U0001FAD6 \U0001F375 \U0001F376 \U0001F37E \U0001F377-\U0001F37B \U0001F942 \U0001F943 \U0001FAD7 \U0001F964 \U0001F9CB \U0001F9C3 \U0001F9C9 \U0001F9CA \U0001F962 \U0001F37D \U0001F374 \U0001F944 \U0001F52A \U0001FAD9 \U0001F3FA \U0001F30D-\U0001F310 \U0001F5FA \U0001F5FE \U0001F9ED \U0001F3D4 \u26F0 \U0001F30B \U0001F5FB \U0001F3D5 \U0001F3D6 \U0001F3DC-\U0001F3DF \U0001F3DB \U0001F3D7 \U0001F9F1 \U0001FAA8 \U0001FAB5 \U0001F6D6 \U0001F3D8 \U0001F3DA \U0001F3E0-\U0001F3E6 \U0001F3E8-\U0001F3ED \U0001F3EF \U0001F3F0 \U0001F492 \U0001F5FC \U0001F5FD \u26EA \U0001F54C \U0001F6D5 \U0001F54D \u26E9 \U0001F54B \u26F2 \u26FA \U0001F301 \U0001F303 \U0001F3D9 \U0001F304-\U0001F307 \U0001F309 \u2668 \U0001F3A0 \U0001F6DD \U0001F3A1 \U0001F3A2 \U0001F488 \U0001F3AA \U0001F682-\U0001F68A \U0001F69D \U0001F69E \U0001F68B-\U0001F68E \U0001F690-\U0001F699 \U0001F6FB \U0001F69A-\U0001F69C \U0001F3CE \U0001F3CD \U0001F6F5 \U0001F9BD \U0001F9BC \U0001F6FA \U0001F6B2 \U0001F6F4 \U0001F6F9 \U0001F6FC \U0001F68F \U0001F6E3 \U0001F6E4 \U0001F6E2 \u26FD \U0001F6DE \U0001F6A8 \U0001F6A5 \U0001F6A6 \U0001F6D1 \U0001F6A7 \u2693 \U0001F6DF \u26F5 \U0001F6F6 \U0001F6A4 \U0001F6F3 \u26F4 \U0001F6E5 \U0001F6A2 \u2708 \U0001F6E9 \U0001F6EB \U0001F6EC \U0001FA82 \U0001F4BA \U0001F681 \U0001F69F-\U0001F6A1 \U0001F6F0 \U0001F680 \U0001F6F8 \U0001F6CE \U0001F9F3 \u231B \u23F3 \u231A \u23F0-\u23F2 \U0001F570 \U0001F55B \U0001F567 \U0001F550 \U0001F55C \U0001F551 \U0001F55D \U0001F552 \U0001F55E \U0001F553 \U0001F55F \U0001F554 \U0001F560 \U0001F555 \U0001F561 \U0001F556 \U0001F562 \U0001F557 \U0001F563 \U0001F558 \U0001F564 \U0001F559 \U0001F565 \U0001F55A \U0001F566 \U0001F311-\U0001F31C \U0001F321 \u2600 \U0001F31D \U0001F31E \U0001FA90 \u2B50 \U0001F31F \U0001F320 \U0001F30C \u2601 \u26C5 \u26C8 \U0001F324-\U0001F32C \U0001F300 \U0001F308 \U0001F302 \u2602 \u2614 \u26F1 \u26A1 \u2744 \u2603 \u26C4 \u2604 \U0001F525 \U0001F4A7 \U0001F30A \U0001F383 \U0001F384 \U0001F386 \U0001F387 \U0001F9E8 \u2728 \U0001F388-\U0001F38B \U0001F38D-\U0001F391 \U0001F9E7 \U0001F380 \U0001F381 \U0001F397 \U0001F39F \U0001F3AB \U0001F396 \U0001F3C6 \U0001F3C5 \U0001F947-\U0001F949 \u26BD \u26BE \U0001F94E \U0001F3C0 \U0001F3D0 \U0001F3C8 \U0001F3C9 \U0001F3BE \U0001F94F \U0001F3B3 \U0001F3CF \U0001F3D1 \U0001F3D2 \U0001F94D \U0001F3D3 \U0001F3F8 \U0001F94A \U0001F94B \U0001F945 \u26F3 \u26F8 \U0001F3A3 \U0001F93F \U0001F3BD \U0001F3BF \U0001F6F7 \U0001F94C \U0001F3AF \U0001FA80 \U0001FA81 \U0001F3B1 \U0001F52E \U0001FA84 \U0001F9FF \U0001FAAC \U0001F3AE \U0001F579 \U0001F3B0 \U0001F3B2 \U0001F9E9 \U0001F9F8 \U0001FA85 \U0001FAA9 \U0001FA86 \u2660 \u2665 \u2666 \u2663 \u265F \U0001F0CF \U0001F004 \U0001F3B4 \U0001F3AD \U0001F5BC \U0001F3A8 \U0001F9F5 \U0001FAA1 \U0001F9F6 \U0001FAA2 \U0001F453 \U0001F576 \U0001F97D \U0001F97C \U0001F9BA \U0001F454-\U0001F456 \U0001F9E3-\U0001F9E6 \U0001F457 \U0001F458 \U0001F97B \U0001FA71-\U0001FA73 \U0001F459 \U0001F45A \U0001FAAD \U0001FAAE \U0001F45B-\U0001F45D \U0001F6CD \U0001F392 \U0001FA74 \U0001F45E \U0001F45F \U0001F97E \U0001F97F \U0001F460 \U0001F461 \U0001FA70 \U0001F462 \U0001F451 \U0001F452 \U0001F3A9 \U0001F393 \U0001F9E2 \U0001FA96 \u26D1 \U0001F4FF \U0001F484 \U0001F48D \U0001F48E \U0001F507-\U0001F50A \U0001F4E2 \U0001F4E3 \U0001F4EF \U0001F514 \U0001F515 \U0001F3BC \U0001F3B5 \U0001F3B6 \U0001F399-\U0001F39B \U0001F3A4 \U0001F3A7 \U0001F4FB \U0001F3B7 \U0001FA97 \U0001F3B8-\U0001F3BB \U0001FA95 \U0001F941 \U0001FA98 \U0001FA87 \U0001FA88 \U0001F4F1 \U0001F4F2 \u260E \U0001F4DE-\U0001F4E0 \U0001F50B \U0001FAAB \U0001F50C \U0001F4BB \U0001F5A5 \U0001F5A8 \u2328 \U0001F5B1 \U0001F5B2 \U0001F4BD-\U0001F4C0 \U0001F9EE \U0001F3A5 \U0001F39E \U0001F4FD \U0001F3AC \U0001F4FA \U0001F4F7-\U0001F4F9 \U0001F4FC \U0001F50D \U0001F50E \U0001F56F \U0001F4A1 \U0001F526 \U0001F3EE \U0001FA94 \U0001F4D4-\U0001F4DA \U0001F4D3 \U0001F4D2 \U0001F4C3 \U0001F4DC \U0001F4C4 \U0001F4F0 \U0001F5DE \U0001F4D1 \U0001F516 \U0001F3F7 \U0001F4B0 \U0001FA99 \U0001F4B4-\U0001F4B8 \U0001F4B3 \U0001F9FE \U0001F4B9 \u2709 \U0001F4E7-\U0001F4E9 \U0001F4E4-\U0001F4E6 \U0001F4EB \U0001F4EA \U0001F4EC-\U0001F4EE \U0001F5F3 \u270F \u2712 \U0001F58B \U0001F58A \U0001F58C \U0001F58D \U0001F4DD \U0001F4BC \U0001F4C1 \U0001F4C2 \U0001F5C2 \U0001F4C5 \U0001F4C6 \U0001F5D2 \U0001F5D3 \U0001F4C7-\U0001F4CE \U0001F587 \U0001F4CF \U0001F4D0 \u2702 \U0001F5C3 \U0001F5C4 \U0001F5D1 \U0001F512 \U0001F513 \U0001F50F-\U0001F511 \U0001F5DD \U0001F528 \U0001FA93 \u26CF \u2692 \U0001F6E0 \U0001F5E1 \u2694 \U0001F52B \U0001FA83 \U0001F3F9 \U0001F6E1 \U0001FA9A \U0001F527 \U0001FA9B \U0001F529 \u2699 \U0001F5DC \u2696 \U0001F9AF \U0001F517 \u26D3 \U0001FA9D \U0001F9F0 \U0001F9F2 \U0001FA9C \u2697 \U0001F9EA-\U0001F9EC \U0001F52C \U0001F52D \U0001F4E1 \U0001F489 \U0001FA78 \U0001F48A \U0001FA79 \U0001FA7C \U0001FA7A \U0001FA7B \U0001F6AA \U0001F6D7 \U0001FA9E \U0001FA9F \U0001F6CF \U0001F6CB \U0001FA91 \U0001F6BD \U0001FAA0 \U0001F6BF \U0001F6C1 \U0001FAA4 \U0001FA92 \U0001F9F4 \U0001F9F7 \U0001F9F9-\U0001F9FB \U0001FAA3 \U0001F9FC \U0001FAE7 \U0001FAA5 \U0001F9FD \U0001F9EF \U0001F6D2 \U0001F6AC \u26B0 \U0001FAA6 \u26B1 \U0001F5FF \U0001FAA7 \U0001FAAA \U0001F3E7 \U0001F6AE \U0001F6B0 \u267F \U0001F6B9-\U0001F6BC \U0001F6BE \U0001F6C2-\U0001F6C5 \u26A0 \U0001F6B8 \u26D4 \U0001F6AB \U0001F6B3 \U0001F6AD \U0001F6AF \U0001F6B1 \U0001F6B7 \U0001F4F5 \U0001F51E \u2622 \u2623 \u2B06 \u2197 \u27A1 \u2198 \u2B07 \u2199 \u2B05 \u2196 \u2195 \u2194 \u21A9 \u21AA \u2934 \u2935 \U0001F503 \U0001F504 \U0001F519-\U0001F51D \U0001F6D0 \u269B \U0001F549 \u2721 \u2638 \u262F \u271D \u2626 \u262A \u262E \U0001F54E \U0001F52F \U0001FAAF \u2648-\u2653 \u26CE \U0001F500-\U0001F502 \u25B6 \u23E9 \u23ED \u23EF \u25C0 \u23EA \u23EE \U0001F53C \u23EB \U0001F53D \u23EC \u23F8-\u23FA \u23CF \U0001F3A6 \U0001F505 \U0001F506 \U0001F4F6 \U0001F4F3 \U0001F4F4 \U0001F6DC \u2640 \u2642 \u26A7 \u2716 \u2795-\u2797 \U0001F7F0 \u267E \u203C \u2049 \u2753-\u2755 \u2757 \u3030 \U0001F4B1 \U0001F4B2 \u2695 \u267B \u269C \U0001F531 \U0001F4DB \U0001F530 \u2B55 \u2705 \u2611 \u2714 \u274C \u274E \u27B0 \u27BF \u303D \u2733 \u2734 \u2747 \u00A9 \u00AE \u2122 \U0001F51F-\U0001F524 \U0001F170 \U0001F18E \U0001F171 \U0001F191-\U0001F193 \u2139 \U0001F194 \u24C2 \U0001F195 \U0001F196 \U0001F17E \U0001F197 \U0001F17F \U0001F198-\U0001F19A \U0001F201 \U0001F202 \U0001F237 \U0001F236 \U0001F22F \U0001F250 \U0001F239 \U0001F21A \U0001F232 \U0001F251 \U0001F238 \U0001F234 \U0001F233 \u3297 \u3299 \U0001F23A \U0001F235 \U0001F534 \U0001F7E0-\U0001F7E2 \U0001F535 \U0001F7E3 \U0001F7E4 \u26AB \u26AA \U0001F7E5 \U0001F7E7-\U0001F7E9 \U0001F7E6 \U0001F7EA \U0001F7EB \u2B1B \u2B1C \u25FC \u25FB \u25FE \u25FD \u25AA \u25AB \U0001F536-\U0001F53B \U0001F4A0 \U0001F518 \U0001F533 \U0001F532 \U0001F3C1 \U0001F6A9 \U0001F38C \U0001F3F4 \U0001F3F3 \U0001F1E6-\U0001F1FF 0-9]
Triple click to select whole line
You can choose to exclude basic latin characters[#*0-9] in your program.
If you only deal with English character and emoji character I think it is doable. First convert your string to UTF-16 characters, then check each characters whose value is bigger than 0x0xD800 (for emoji it is actually >=0xD836) should be emoji.
This is because "The Unicode standard permanently reserves the code point values between 0xD800 to 0xDFFF for UTF-16 encoding of the high and low surrogates" and of course English characters (and many other character won't fall in this range)
But because emoji code point starts from U1F300 their UFT-16 value actually fall in this range.
Check here for a quick reference for emoji UFT-16 value, if you don't bother to do it yourself.

two separate rt indexes sphinx

I'm able to do Unicode search in sphinx now, the issue I'm seeing is that English isn't working any more when I search, the question is do I need to have separate indexes for languages? or one should be enough for both languages?
path = /var/data/sphinx/forums
rt_field = subject
rt_attr_uint = pid
charset_type = utf-8
charset_table = charset_table = U+0622->U+0627, U+0623->U+0627, U+0624->U+0648, U+0625->U+0627, U+0626->U+064A, U+06C0->U+06D5, U+06C2->U+06C1, U+06D3->U+06D2, U+FB50->U+0671, U+FB51->U+0671, U+FB52->U+067B, U+FB53->U+067B, U+FB54->U+067B, U+FB56->U+067E, U+FB57->U+067E, U+FB58->U+067E, U+FB5A->U+0680, U+FB5B->U+0680, U+FB5C->U+0680, U+FB5E->U+067A, U+FB5F->U+067A, U+FB60->U+067A, U+FB62->U+067F, U+FB63->U+067F, U+FB64->U+067F, U+FB66->U+0679, U+FB67->U+0679, U+FB68->U+0679, U+FB6A->U+06A4, U+FB6B->U+06A4, U+FB6C->U+06A4, U+FB6E->U+06A6, U+FB6F->U+06A6, U+FB70->U+06A6, U+FB72->U+0684, U+FB73->U+0684, U+FB74->U+0684, U+FB76->U+0683, U+FB77->U+0683, U+FB78->U+0683, U+FB7A->U+0686, U+FB7B->U+0686, U+FB7C->U+0686, U+FB7E->U+0687, U+FB7F->U+0687, U+FB80->U+0687, U+FB82->U+068D, U+FB83->U+068D, U+FB84->U+068C, U+FB85->U+068C, U+FB86->U+068E, U+FB87->U+068E, U+FB88->U+0688, U+FB89->U+0688, U+FB8A->U+0698, U+FB8B->U+0698, U+FB8C->U+0691, U+FB8D->U+0691, U+FB8E->U+06A9, U+FB8F->U+06A9, U+FB90->U+06A9, U+FB92->U+06AF, U+FB93->U+06AF, U+FB94->U+06AF, U+FB96->U+06B3, U+FB97->U+06B3, U+FB98->U+06B3, U+FB9A->U+06B1, U+FB9B->U+06B1, U+FB9C->U+06B1, U+FB9E->U+06BA, U+FB9F->U+06BA, U+FBA0->U+06BB, U+FBA1->U+06BB, U+FBA2->U+06BB, U+FBA4->U+06C0, U+FBA5->U+06C0, U+FBA6->U+06C1, U+FBA7->U+06C1, U+FBA8->U+06C1, U+FBAA->U+06BE, U+FBAB->U+06BE, U+FBAC->U+06BE, U+FBAE->U+06D2, U+FBAF->U+06D2, U+FBB0->U+06D3, U+FBB1->U+06D3, U+FBD3->U+06AD, U+FBD4->U+06AD, U+FBD5->U+06AD, U+FBD7->U+06C7, U+FBD8->U+06C7, U+FBD9->U+06C6, U+FBDA->U+06C6, U+FBDB->U+06C8, U+FBDC->U+06C8, U+FBDD->U+0677, U+FBDE->U+06CB, U+FBDF->U+06CB, U+FBE0->U+06C5, U+FBE1->U+06C5, U+FBE2->U+06C9, U+FBE3->U+06C9, U+FBE4->U+06D0, U+FBE5->U+06D0, U+FBE6->U+06D0, U+FBE8->U+0649, U+FBFC->U+06CC, U+FBFD->U+06CC, U+FBFE->U+06CC, U+0621, U+0627..U+063A, U+0641..U+064A, U+0660..U+0669, U+066E, U+066F, U+0671..U+06BF, U+06C1, U+06C3..U+06D2, U+06D5, U+06EE..U+06FC, U+06FF, U+0750..U+076D, U+FB55, U+FB59, U+FB5D, U+FB61, U+FB65, U+FB69, U+FB6D, U+FB71, U+FB75, U+FB79, U+FB7D, U+FB81, U+FB91, U+FB95, U+FB99, U+FB9D, U+FBA3, U+FBA9, U+FBAD, U+FBD6, U+FBE7, U+FBE9, U+FBFF
The reason I'm asking this is that when I remove the charset_table and charset_type English starts working again
Your charset_table doesnt have the basic 'English' characters, eg
0..9, A..Z->a..z, _, a..z,
The reason removing the charset_table, works, is because then it falls back to the defaul, which is the default sbcs charset:
charset_table = 0..9, A..Z->a..z, _, a..z, \
U+A8->U+B8, U+B8, U+C0..U+DF->U+E0..U+FF, U+E0..U+FF
So to get multiple languages in one index, need to make the charset_table include the charactors from ALL the languages invovled.

Cucumber, Rspec: unicode symbols in output

I wonder if it is possible to make Cucumber output matching errors in Russian instead of this:
Сценарий: Успешное добавление кгиги # features/books/add_book.feature:12
Если я добавил книгу # features/step_definitions/books_steps.rb:3
То я должен увидеть добавленную книгу # features/step_definitions/books_steps.rb:15
expected there to be content "\320\235\320\260\320\267\320\262\320\260\320\275
\320\270\320\265 \320\272\320\275\320\270\320\263\320\270" in "\320\236\321\210\320\270\320
\261\320\272\320\260 502!\n...
Where "\320\235\320\260\320\267\320\262\320\260\320\275" is a Russian word. It may be a feature of Rspec. Any Ideas would be great.
Adding
$KCODE='u'
to my features/support/env.rb helped a little:
А должен увидеть сообщение о том, что пароль неверен
expected there to be content "Неверный прол\321\214"
This solution is only for 1.8.7 – in 1.9.3
# encoding: utf-8
works just fine