I'm able to do Unicode search in sphinx now, the issue I'm seeing is that English isn't working any more when I search, the question is do I need to have separate indexes for languages? or one should be enough for both languages?
path = /var/data/sphinx/forums
rt_field = subject
rt_attr_uint = pid
charset_type = utf-8
charset_table = charset_table = U+0622->U+0627, U+0623->U+0627, U+0624->U+0648, U+0625->U+0627, U+0626->U+064A, U+06C0->U+06D5, U+06C2->U+06C1, U+06D3->U+06D2, U+FB50->U+0671, U+FB51->U+0671, U+FB52->U+067B, U+FB53->U+067B, U+FB54->U+067B, U+FB56->U+067E, U+FB57->U+067E, U+FB58->U+067E, U+FB5A->U+0680, U+FB5B->U+0680, U+FB5C->U+0680, U+FB5E->U+067A, U+FB5F->U+067A, U+FB60->U+067A, U+FB62->U+067F, U+FB63->U+067F, U+FB64->U+067F, U+FB66->U+0679, U+FB67->U+0679, U+FB68->U+0679, U+FB6A->U+06A4, U+FB6B->U+06A4, U+FB6C->U+06A4, U+FB6E->U+06A6, U+FB6F->U+06A6, U+FB70->U+06A6, U+FB72->U+0684, U+FB73->U+0684, U+FB74->U+0684, U+FB76->U+0683, U+FB77->U+0683, U+FB78->U+0683, U+FB7A->U+0686, U+FB7B->U+0686, U+FB7C->U+0686, U+FB7E->U+0687, U+FB7F->U+0687, U+FB80->U+0687, U+FB82->U+068D, U+FB83->U+068D, U+FB84->U+068C, U+FB85->U+068C, U+FB86->U+068E, U+FB87->U+068E, U+FB88->U+0688, U+FB89->U+0688, U+FB8A->U+0698, U+FB8B->U+0698, U+FB8C->U+0691, U+FB8D->U+0691, U+FB8E->U+06A9, U+FB8F->U+06A9, U+FB90->U+06A9, U+FB92->U+06AF, U+FB93->U+06AF, U+FB94->U+06AF, U+FB96->U+06B3, U+FB97->U+06B3, U+FB98->U+06B3, U+FB9A->U+06B1, U+FB9B->U+06B1, U+FB9C->U+06B1, U+FB9E->U+06BA, U+FB9F->U+06BA, U+FBA0->U+06BB, U+FBA1->U+06BB, U+FBA2->U+06BB, U+FBA4->U+06C0, U+FBA5->U+06C0, U+FBA6->U+06C1, U+FBA7->U+06C1, U+FBA8->U+06C1, U+FBAA->U+06BE, U+FBAB->U+06BE, U+FBAC->U+06BE, U+FBAE->U+06D2, U+FBAF->U+06D2, U+FBB0->U+06D3, U+FBB1->U+06D3, U+FBD3->U+06AD, U+FBD4->U+06AD, U+FBD5->U+06AD, U+FBD7->U+06C7, U+FBD8->U+06C7, U+FBD9->U+06C6, U+FBDA->U+06C6, U+FBDB->U+06C8, U+FBDC->U+06C8, U+FBDD->U+0677, U+FBDE->U+06CB, U+FBDF->U+06CB, U+FBE0->U+06C5, U+FBE1->U+06C5, U+FBE2->U+06C9, U+FBE3->U+06C9, U+FBE4->U+06D0, U+FBE5->U+06D0, U+FBE6->U+06D0, U+FBE8->U+0649, U+FBFC->U+06CC, U+FBFD->U+06CC, U+FBFE->U+06CC, U+0621, U+0627..U+063A, U+0641..U+064A, U+0660..U+0669, U+066E, U+066F, U+0671..U+06BF, U+06C1, U+06C3..U+06D2, U+06D5, U+06EE..U+06FC, U+06FF, U+0750..U+076D, U+FB55, U+FB59, U+FB5D, U+FB61, U+FB65, U+FB69, U+FB6D, U+FB71, U+FB75, U+FB79, U+FB7D, U+FB81, U+FB91, U+FB95, U+FB99, U+FB9D, U+FBA3, U+FBA9, U+FBAD, U+FBD6, U+FBE7, U+FBE9, U+FBFF
The reason I'm asking this is that when I remove the charset_table and charset_type English starts working again
Your charset_table doesnt have the basic 'English' characters, eg
0..9, A..Z->a..z, _, a..z,
The reason removing the charset_table, works, is because then it falls back to the defaul, which is the default sbcs charset:
charset_table = 0..9, A..Z->a..z, _, a..z, \
U+A8->U+B8, U+B8, U+C0..U+DF->U+E0..U+FF, U+E0..U+FF
So to get multiple languages in one index, need to make the charset_table include the charactors from ALL the languages invovled.
Related
I'm struggling with some characters in a PDF I'm trying to create with html2pdf. The following code creates the PDF, but ē is shown an e.
$html2pdf=new Html2Pdf();
$html2pdf->writeHTML('<h1>Fēnix</h1>');
$html2pdf->output();
When getting the name from my database, ē is shown a ?.
$query=$mysqli->query('SELECT name FROM table WHERE id=1;');
$result=$query->fetch_assoc();
$html2pdf=new Html2Pdf();
$html2pdf->writeHTML('<h1>'.$result['name'].'</h1>');
$html2pdf->output();
This is the way I connect to my database:
$mysqli=new mysqli('host', 'user', 'pass', 'db');
I have also tried adding a charset:
$mysqli->set_charset('utf8');
Or initiating the class with parameters:
$html2pdf=new Html2Pdf('P', 'A4', 'nl');
$html2pdf=new Html2Pdf('P', 'A4', 'nl', true, 'UTF8');
Other characters that are giving issues are: Ś ą ł ś
Both server and database are UTF-8.
The solution is to apply a UTF-8 font to all elements.
* { font-family:freeserif; }
I was trying to take out all emoji chars out of a string (like a sanitizer). But I cannot find a complete set of emoji values.
What is the complete set of emoji chars' UTF16 values?
The Unicode standard's Unicode® Technical Report #51 includes a list of emoji (emoji-data.txt):
...
21A9 ; text ; L1 ; none ; j # V1.1 (↩) LEFTWARDS ARROW WITH HOOK
21AA ; text ; L1 ; none ; j # V1.1 (↪) RIGHTWARDS ARROW WITH HOOK
231A ; emoji ; L1 ; none ; j # V1.1 (⌚) WATCH
231B ; emoji ; L1 ; none ; j # V1.1 (⌛) HOURGLASS
...
I believe you would want to remove each character listed in this document which had a Default_Emoji_Style of emoji.
There is no way, other than reference to a definition list like this, to identify the emoji characters in Unicode. As the reference to the FAQ says, they are spread throughout different blocks.
I have composed list based on Joe's and Doctor.Who's answers:
U+00A9, U+00AE, U+203C, U+2049, U+20E3, U+2122, U+2139, U+2194-2199, U+21A9-21AA, U+231A, U+231B, U+2328, U+23CF, U+23E9-23F3, U+23F8-23FA, U+24C2, U+25AA, U+25AB, U+25B6, U+25C0, U+25FB-25FE, U+2600-27EF, U+2934, U+2935, U+2B00-2BFF, U+3030, U+303D, U+3297, U+3299, U+1F000-1F02F, U+1F0A0-1F0FF, U+1F100-1F64F, U+1F680-1F6FF, U+1F910-1F96B, U+1F980-1F9E0
unicode-range: U+0080-02AF, U+0300-03FF, U+0600-06FF, U+0C00-0C7F, U+1DC0-1DFF, U+1E00-1EFF, U+2000-209F, U+20D0-214F, U+2190-23FF, U+2460-25FF, U+2600-27EF, U+2900-29FF, U+2B00-2BFF, U+2C60-2C7F, U+2E00-2E7F, U+3000-303F, U+A490-A4CF, U+E000-F8FF, U+FE00-FE0F, U+FE30-FE4F, U+1F000-1F02F, U+1F0A0-1F0FF, U+1F100-1F64F, U+1F680-1F6FF, U+1F910-1F96B, U+1F980-1F9E0;
Emoji ranges are updated for every new version of Unicode Emoji. Ranges below are correct for version 14.0
Here is my gist for an advanced version of this code.
def is_contains_emoji(p_string_in_unicode):
"""
Instead of searching all chars of a text in a emoji lookup dictionary this function just
checks whether any char in the text is in unicode emoji range
It is much faster than a dictionary lookup for a large text
However it only tells whether a text contains an emoji. It does not return the found emojis
"""
range_min = ord(u'\U0001F300') # 127744
range_max = ord(u"\U0001FAF6") # 129782
range_min_2 = 126980
range_max_2 = 127569
range_min_3 = 169
range_max_3 = 174
range_min_4 = 8205
range_max_4 = 12953
if p_string_in_unicode:
for a_char in p_string_in_unicode:
char_code = ord(a_char)
if range_min <= char_code <= range_max:
# or range_min_2 <= char_code <= range_max_2 or range_min_3 <= char_code <= range_max_3 or range_min_4 <= char_code <= range_max_4:
return True
elif range_min_2 <= char_code <= range_max_2:
return True
elif range_min_3 <= char_code <= range_max_3:
return True
elif range_min_4 <= char_code <= range_max_4:
return True
return False
else:
return False
You can get ranges of characters meeting any requirements specified by their category and properties from the Official UnicodeSet Utility
According to their search result, the full range of emoji is:
[\U0001F3FB-\U0001F3FF * # \U0001F600 \U0001F603 \U0001F604 \U0001F601 \U0001F606 \U0001F605 \U0001F923 \U0001F602 \U0001F642 \U0001F643 \U0001FAE0 \U0001F609 \U0001F60A \U0001F607 \U0001F970 \U0001F60D \U0001F929 \U0001F618 \U0001F617 \u263A \U0001F61A \U0001F619 \U0001F972 \U0001F60B \U0001F61B \U0001F61C \U0001F92A \U0001F61D \U0001F911 \U0001F917 \U0001F92D \U0001FAE2 \U0001FAE3 \U0001F92B \U0001F914 \U0001FAE1 \U0001F910 \U0001F928 \U0001F610 \U0001F611 \U0001F636 \U0001FAE5 \U0001F60F \U0001F612 \U0001F644 \U0001F62C \U0001F925 \U0001FAE8 \U0001F60C \U0001F614 \U0001F62A \U0001F924 \U0001F634 \U0001F637 \U0001F912 \U0001F915 \U0001F922 \U0001F92E \U0001F927 \U0001F975 \U0001F976 \U0001F974 \U0001F635 \U0001F92F \U0001F920 \U0001F973 \U0001F978 \U0001F60E \U0001F913 \U0001F9D0 \U0001F615 \U0001FAE4 \U0001F61F \U0001F641 \u2639 \U0001F62E \U0001F62F \U0001F632 \U0001F633 \U0001F97A \U0001F979 \U0001F626-\U0001F628 \U0001F630 \U0001F625 \U0001F622 \U0001F62D \U0001F631 \U0001F616 \U0001F623 \U0001F61E \U0001F613 \U0001F629 \U0001F62B \U0001F971 \U0001F624 \U0001F621 \U0001F620 \U0001F92C \U0001F608 \U0001F47F \U0001F480 \u2620 \U0001F4A9 \U0001F921 \U0001F479-\U0001F47B \U0001F47D \U0001F47E \U0001F916 \U0001F63A \U0001F638 \U0001F639 \U0001F63B-\U0001F63D \U0001F640 \U0001F63F \U0001F63E \U0001F648-\U0001F64A \U0001F48B \U0001F48C \U0001F498 \U0001F49D \U0001F496 \U0001F497 \U0001F493 \U0001F49E \U0001F495 \U0001F49F \u2763 \U0001F494 \u2764 \U0001F9E1 \U0001F49B \U0001F49A \U0001F499 \U0001F49C \U0001FA75-\U0001FA77 \U0001F90E \U0001F5A4 \U0001F90D \U0001F4AF \U0001F4A2 \U0001F4A5 \U0001F4AB \U0001F4A6 \U0001F4A8 \U0001F573 \U0001F4A3 \U0001F4AC \U0001F5E8 \U0001F5EF \U0001F4AD \U0001F4A4 \U0001F44B \U0001F91A \U0001F590 \u270B \U0001F596 \U0001FAF1-\U0001FAF4 \U0001F44C \U0001F90C \U0001F90F \u270C \U0001F91E \U0001FAF0 \U0001F91F \U0001F918 \U0001F919 \U0001F448 \U0001F449 \U0001F446 \U0001F595 \U0001F447 \u261D \U0001FAF5 \U0001F44D \U0001F44E \u270A \U0001F44A \U0001F91B \U0001F91C \U0001F44F \U0001F64C \U0001FAF6 \U0001F450 \U0001F932 \U0001F91D \U0001F64F \U0001FAF7 \U0001FAF8 \u270D \U0001F485 \U0001F933 \U0001F4AA \U0001F9BE \U0001F9BF \U0001F9B5 \U0001F9B6 \U0001F442 \U0001F9BB \U0001F443 \U0001F9E0 \U0001FAC0 \U0001FAC1 \U0001F9B7 \U0001F9B4 \U0001F440 \U0001F441 \U0001F445 \U0001F444 \U0001FAE6 \U0001F476 \U0001F9D2 \U0001F466 \U0001F467 \U0001F9D1\U0001F471 \U0001F468\U0001F9D4 \U0001F469 \U0001F9D3 \U0001F474 \U0001F475 \U0001F64D \U0001F64E \U0001F645 \U0001F646 \U0001F481 \U0001F64B \U0001F9CF \U0001F647 \U0001F926 \U0001F937 \U0001F46E \U0001F575 \U0001F482 \U0001F977 \U0001F477 \U0001FAC5 \U0001F934 \U0001F478 \U0001F473 \U0001F472 \U0001F9D5 \U0001F935 \U0001F470 \U0001F930 \U0001FAC3 \U0001FAC4 \U0001F931 \U0001F47C \U0001F385 \U0001F936 \U0001F9B8 \U0001F9B9 \U0001F9D9-\U0001F9DF \U0001F9CC \U0001F486 \U0001F487 \U0001F6B6 \U0001F9CD \U0001F9CE \U0001F3C3 \U0001F483 \U0001F57A \U0001F574 \U0001F46F \U0001F9D6 \U0001F9D7 \U0001F93A \U0001F3C7 \u26F7 \U0001F3C2 \U0001F3CC \U0001F3C4 \U0001F6A3 \U0001F3CA \u26F9 \U0001F3CB \U0001F6B4 \U0001F6B5 \U0001F938 \U0001F93C-\U0001F93E \U0001F939 \U0001F9D8 \U0001F6C0 \U0001F6CC \U0001F46D \U0001F46B \U0001F46C \U0001F48F \U0001F491 \U0001F46A \U0001F5E3 \U0001F464 \U0001F465 \U0001FAC2 \U0001F463 \U0001F9B0 \U0001F9B1 \U0001F9B3 \U0001F9B2 \U0001F435 \U0001F412 \U0001F98D \U0001F9A7 \U0001F436 \U0001F415 \U0001F9AE \U0001F429 \U0001F43A \U0001F98A \U0001F99D \U0001F431 \U0001F408 \U0001F981 \U0001F42F \U0001F405 \U0001F406 \U0001F434 \U0001FACE \U0001FACF \U0001F40E \U0001F984 \U0001F993 \U0001F98C \U0001F9AC \U0001F42E \U0001F402-\U0001F404 \U0001F437 \U0001F416 \U0001F417 \U0001F43D \U0001F40F \U0001F411 \U0001F410 \U0001F42A \U0001F42B \U0001F999 \U0001F992 \U0001F418 \U0001F9A3 \U0001F98F \U0001F99B \U0001F42D \U0001F401 \U0001F400 \U0001F439 \U0001F430 \U0001F407 \U0001F43F \U0001F9AB \U0001F994 \U0001F987 \U0001F43B \U0001F428 \U0001F43C \U0001F9A5 \U0001F9A6 \U0001F9A8 \U0001F998 \U0001F9A1 \U0001F43E \U0001F983 \U0001F414 \U0001F413 \U0001F423-\U0001F427 \U0001F54A \U0001F985 \U0001F986 \U0001F9A2 \U0001F989 \U0001F9A4 \U0001FAB6 \U0001F9A9 \U0001F99A \U0001F99C \U0001FABD \U0001FABF \U0001F438 \U0001F40A \U0001F422 \U0001F98E \U0001F40D \U0001F432 \U0001F409 \U0001F995 \U0001F996 \U0001F433 \U0001F40B \U0001F42C \U0001F9AD \U0001F41F-\U0001F421 \U0001F988 \U0001F419 \U0001F41A \U0001FAB8 \U0001FABC \U0001F40C \U0001F98B \U0001F41B-\U0001F41D \U0001FAB2 \U0001F41E \U0001F997 \U0001FAB3 \U0001F577 \U0001F578 \U0001F982 \U0001F99F \U0001FAB0 \U0001FAB1 \U0001F9A0 \U0001F490 \U0001F338 \U0001F4AE \U0001FAB7 \U0001F3F5 \U0001F339 \U0001F940 \U0001F33A-\U0001F33C \U0001F337 \U0001FABB \U0001F331 \U0001FAB4 \U0001F332-\U0001F335 \U0001F33E \U0001F33F \u2618 \U0001F340-\U0001F343 \U0001FAB9 \U0001FABA \U0001F347-\U0001F34D \U0001F96D \U0001F34E-\U0001F353 \U0001FAD0 \U0001F95D \U0001F345 \U0001FAD2 \U0001F965 \U0001F951 \U0001F346 \U0001F954 \U0001F955 \U0001F33D \U0001F336 \U0001FAD1 \U0001F952 \U0001F96C \U0001F966 \U0001F9C4 \U0001F9C5 \U0001F344 \U0001F95C \U0001FAD8 \U0001F330 \U0001FADA \U0001FADB \U0001F35E \U0001F950 \U0001F956 \U0001FAD3 \U0001F968 \U0001F96F \U0001F95E \U0001F9C7 \U0001F9C0 \U0001F356 \U0001F357 \U0001F969 \U0001F953 \U0001F354 \U0001F35F \U0001F355 \U0001F32D \U0001F96A \U0001F32E \U0001F32F \U0001FAD4 \U0001F959 \U0001F9C6 \U0001F95A \U0001F373 \U0001F958 \U0001F372 \U0001FAD5 \U0001F963 \U0001F957 \U0001F37F \U0001F9C8 \U0001F9C2 \U0001F96B \U0001F371 \U0001F358-\U0001F35D \U0001F360 \U0001F362-\U0001F365 \U0001F96E \U0001F361 \U0001F95F-\U0001F961 \U0001F980 \U0001F99E \U0001F990 \U0001F991 \U0001F9AA \U0001F366-\U0001F36A \U0001F382 \U0001F370 \U0001F9C1 \U0001F967 \U0001F36B-\U0001F36F \U0001F37C \U0001F95B \u2615 \U0001FAD6 \U0001F375 \U0001F376 \U0001F37E \U0001F377-\U0001F37B \U0001F942 \U0001F943 \U0001FAD7 \U0001F964 \U0001F9CB \U0001F9C3 \U0001F9C9 \U0001F9CA \U0001F962 \U0001F37D \U0001F374 \U0001F944 \U0001F52A \U0001FAD9 \U0001F3FA \U0001F30D-\U0001F310 \U0001F5FA \U0001F5FE \U0001F9ED \U0001F3D4 \u26F0 \U0001F30B \U0001F5FB \U0001F3D5 \U0001F3D6 \U0001F3DC-\U0001F3DF \U0001F3DB \U0001F3D7 \U0001F9F1 \U0001FAA8 \U0001FAB5 \U0001F6D6 \U0001F3D8 \U0001F3DA \U0001F3E0-\U0001F3E6 \U0001F3E8-\U0001F3ED \U0001F3EF \U0001F3F0 \U0001F492 \U0001F5FC \U0001F5FD \u26EA \U0001F54C \U0001F6D5 \U0001F54D \u26E9 \U0001F54B \u26F2 \u26FA \U0001F301 \U0001F303 \U0001F3D9 \U0001F304-\U0001F307 \U0001F309 \u2668 \U0001F3A0 \U0001F6DD \U0001F3A1 \U0001F3A2 \U0001F488 \U0001F3AA \U0001F682-\U0001F68A \U0001F69D \U0001F69E \U0001F68B-\U0001F68E \U0001F690-\U0001F699 \U0001F6FB \U0001F69A-\U0001F69C \U0001F3CE \U0001F3CD \U0001F6F5 \U0001F9BD \U0001F9BC \U0001F6FA \U0001F6B2 \U0001F6F4 \U0001F6F9 \U0001F6FC \U0001F68F \U0001F6E3 \U0001F6E4 \U0001F6E2 \u26FD \U0001F6DE \U0001F6A8 \U0001F6A5 \U0001F6A6 \U0001F6D1 \U0001F6A7 \u2693 \U0001F6DF \u26F5 \U0001F6F6 \U0001F6A4 \U0001F6F3 \u26F4 \U0001F6E5 \U0001F6A2 \u2708 \U0001F6E9 \U0001F6EB \U0001F6EC \U0001FA82 \U0001F4BA \U0001F681 \U0001F69F-\U0001F6A1 \U0001F6F0 \U0001F680 \U0001F6F8 \U0001F6CE \U0001F9F3 \u231B \u23F3 \u231A \u23F0-\u23F2 \U0001F570 \U0001F55B \U0001F567 \U0001F550 \U0001F55C \U0001F551 \U0001F55D \U0001F552 \U0001F55E \U0001F553 \U0001F55F \U0001F554 \U0001F560 \U0001F555 \U0001F561 \U0001F556 \U0001F562 \U0001F557 \U0001F563 \U0001F558 \U0001F564 \U0001F559 \U0001F565 \U0001F55A \U0001F566 \U0001F311-\U0001F31C \U0001F321 \u2600 \U0001F31D \U0001F31E \U0001FA90 \u2B50 \U0001F31F \U0001F320 \U0001F30C \u2601 \u26C5 \u26C8 \U0001F324-\U0001F32C \U0001F300 \U0001F308 \U0001F302 \u2602 \u2614 \u26F1 \u26A1 \u2744 \u2603 \u26C4 \u2604 \U0001F525 \U0001F4A7 \U0001F30A \U0001F383 \U0001F384 \U0001F386 \U0001F387 \U0001F9E8 \u2728 \U0001F388-\U0001F38B \U0001F38D-\U0001F391 \U0001F9E7 \U0001F380 \U0001F381 \U0001F397 \U0001F39F \U0001F3AB \U0001F396 \U0001F3C6 \U0001F3C5 \U0001F947-\U0001F949 \u26BD \u26BE \U0001F94E \U0001F3C0 \U0001F3D0 \U0001F3C8 \U0001F3C9 \U0001F3BE \U0001F94F \U0001F3B3 \U0001F3CF \U0001F3D1 \U0001F3D2 \U0001F94D \U0001F3D3 \U0001F3F8 \U0001F94A \U0001F94B \U0001F945 \u26F3 \u26F8 \U0001F3A3 \U0001F93F \U0001F3BD \U0001F3BF \U0001F6F7 \U0001F94C \U0001F3AF \U0001FA80 \U0001FA81 \U0001F3B1 \U0001F52E \U0001FA84 \U0001F9FF \U0001FAAC \U0001F3AE \U0001F579 \U0001F3B0 \U0001F3B2 \U0001F9E9 \U0001F9F8 \U0001FA85 \U0001FAA9 \U0001FA86 \u2660 \u2665 \u2666 \u2663 \u265F \U0001F0CF \U0001F004 \U0001F3B4 \U0001F3AD \U0001F5BC \U0001F3A8 \U0001F9F5 \U0001FAA1 \U0001F9F6 \U0001FAA2 \U0001F453 \U0001F576 \U0001F97D \U0001F97C \U0001F9BA \U0001F454-\U0001F456 \U0001F9E3-\U0001F9E6 \U0001F457 \U0001F458 \U0001F97B \U0001FA71-\U0001FA73 \U0001F459 \U0001F45A \U0001FAAD \U0001FAAE \U0001F45B-\U0001F45D \U0001F6CD \U0001F392 \U0001FA74 \U0001F45E \U0001F45F \U0001F97E \U0001F97F \U0001F460 \U0001F461 \U0001FA70 \U0001F462 \U0001F451 \U0001F452 \U0001F3A9 \U0001F393 \U0001F9E2 \U0001FA96 \u26D1 \U0001F4FF \U0001F484 \U0001F48D \U0001F48E \U0001F507-\U0001F50A \U0001F4E2 \U0001F4E3 \U0001F4EF \U0001F514 \U0001F515 \U0001F3BC \U0001F3B5 \U0001F3B6 \U0001F399-\U0001F39B \U0001F3A4 \U0001F3A7 \U0001F4FB \U0001F3B7 \U0001FA97 \U0001F3B8-\U0001F3BB \U0001FA95 \U0001F941 \U0001FA98 \U0001FA87 \U0001FA88 \U0001F4F1 \U0001F4F2 \u260E \U0001F4DE-\U0001F4E0 \U0001F50B \U0001FAAB \U0001F50C \U0001F4BB \U0001F5A5 \U0001F5A8 \u2328 \U0001F5B1 \U0001F5B2 \U0001F4BD-\U0001F4C0 \U0001F9EE \U0001F3A5 \U0001F39E \U0001F4FD \U0001F3AC \U0001F4FA \U0001F4F7-\U0001F4F9 \U0001F4FC \U0001F50D \U0001F50E \U0001F56F \U0001F4A1 \U0001F526 \U0001F3EE \U0001FA94 \U0001F4D4-\U0001F4DA \U0001F4D3 \U0001F4D2 \U0001F4C3 \U0001F4DC \U0001F4C4 \U0001F4F0 \U0001F5DE \U0001F4D1 \U0001F516 \U0001F3F7 \U0001F4B0 \U0001FA99 \U0001F4B4-\U0001F4B8 \U0001F4B3 \U0001F9FE \U0001F4B9 \u2709 \U0001F4E7-\U0001F4E9 \U0001F4E4-\U0001F4E6 \U0001F4EB \U0001F4EA \U0001F4EC-\U0001F4EE \U0001F5F3 \u270F \u2712 \U0001F58B \U0001F58A \U0001F58C \U0001F58D \U0001F4DD \U0001F4BC \U0001F4C1 \U0001F4C2 \U0001F5C2 \U0001F4C5 \U0001F4C6 \U0001F5D2 \U0001F5D3 \U0001F4C7-\U0001F4CE \U0001F587 \U0001F4CF \U0001F4D0 \u2702 \U0001F5C3 \U0001F5C4 \U0001F5D1 \U0001F512 \U0001F513 \U0001F50F-\U0001F511 \U0001F5DD \U0001F528 \U0001FA93 \u26CF \u2692 \U0001F6E0 \U0001F5E1 \u2694 \U0001F52B \U0001FA83 \U0001F3F9 \U0001F6E1 \U0001FA9A \U0001F527 \U0001FA9B \U0001F529 \u2699 \U0001F5DC \u2696 \U0001F9AF \U0001F517 \u26D3 \U0001FA9D \U0001F9F0 \U0001F9F2 \U0001FA9C \u2697 \U0001F9EA-\U0001F9EC \U0001F52C \U0001F52D \U0001F4E1 \U0001F489 \U0001FA78 \U0001F48A \U0001FA79 \U0001FA7C \U0001FA7A \U0001FA7B \U0001F6AA \U0001F6D7 \U0001FA9E \U0001FA9F \U0001F6CF \U0001F6CB \U0001FA91 \U0001F6BD \U0001FAA0 \U0001F6BF \U0001F6C1 \U0001FAA4 \U0001FA92 \U0001F9F4 \U0001F9F7 \U0001F9F9-\U0001F9FB \U0001FAA3 \U0001F9FC \U0001FAE7 \U0001FAA5 \U0001F9FD \U0001F9EF \U0001F6D2 \U0001F6AC \u26B0 \U0001FAA6 \u26B1 \U0001F5FF \U0001FAA7 \U0001FAAA \U0001F3E7 \U0001F6AE \U0001F6B0 \u267F \U0001F6B9-\U0001F6BC \U0001F6BE \U0001F6C2-\U0001F6C5 \u26A0 \U0001F6B8 \u26D4 \U0001F6AB \U0001F6B3 \U0001F6AD \U0001F6AF \U0001F6B1 \U0001F6B7 \U0001F4F5 \U0001F51E \u2622 \u2623 \u2B06 \u2197 \u27A1 \u2198 \u2B07 \u2199 \u2B05 \u2196 \u2195 \u2194 \u21A9 \u21AA \u2934 \u2935 \U0001F503 \U0001F504 \U0001F519-\U0001F51D \U0001F6D0 \u269B \U0001F549 \u2721 \u2638 \u262F \u271D \u2626 \u262A \u262E \U0001F54E \U0001F52F \U0001FAAF \u2648-\u2653 \u26CE \U0001F500-\U0001F502 \u25B6 \u23E9 \u23ED \u23EF \u25C0 \u23EA \u23EE \U0001F53C \u23EB \U0001F53D \u23EC \u23F8-\u23FA \u23CF \U0001F3A6 \U0001F505 \U0001F506 \U0001F4F6 \U0001F4F3 \U0001F4F4 \U0001F6DC \u2640 \u2642 \u26A7 \u2716 \u2795-\u2797 \U0001F7F0 \u267E \u203C \u2049 \u2753-\u2755 \u2757 \u3030 \U0001F4B1 \U0001F4B2 \u2695 \u267B \u269C \U0001F531 \U0001F4DB \U0001F530 \u2B55 \u2705 \u2611 \u2714 \u274C \u274E \u27B0 \u27BF \u303D \u2733 \u2734 \u2747 \u00A9 \u00AE \u2122 \U0001F51F-\U0001F524 \U0001F170 \U0001F18E \U0001F171 \U0001F191-\U0001F193 \u2139 \U0001F194 \u24C2 \U0001F195 \U0001F196 \U0001F17E \U0001F197 \U0001F17F \U0001F198-\U0001F19A \U0001F201 \U0001F202 \U0001F237 \U0001F236 \U0001F22F \U0001F250 \U0001F239 \U0001F21A \U0001F232 \U0001F251 \U0001F238 \U0001F234 \U0001F233 \u3297 \u3299 \U0001F23A \U0001F235 \U0001F534 \U0001F7E0-\U0001F7E2 \U0001F535 \U0001F7E3 \U0001F7E4 \u26AB \u26AA \U0001F7E5 \U0001F7E7-\U0001F7E9 \U0001F7E6 \U0001F7EA \U0001F7EB \u2B1B \u2B1C \u25FC \u25FB \u25FE \u25FD \u25AA \u25AB \U0001F536-\U0001F53B \U0001F4A0 \U0001F518 \U0001F533 \U0001F532 \U0001F3C1 \U0001F6A9 \U0001F38C \U0001F3F4 \U0001F3F3 \U0001F1E6-\U0001F1FF 0-9]
Triple click to select whole line
You can choose to exclude basic latin characters[#*0-9] in your program.
If you only deal with English character and emoji character I think it is doable. First convert your string to UTF-16 characters, then check each characters whose value is bigger than 0x0xD800 (for emoji it is actually >=0xD836) should be emoji.
This is because "The Unicode standard permanently reserves the code point values between 0xD800 to 0xDFFF for UTF-16 encoding of the high and low surrogates" and of course English characters (and many other character won't fall in this range)
But because emoji code point starts from U1F300 their UFT-16 value actually fall in this range.
Check here for a quick reference for emoji UFT-16 value, if you don't bother to do it yourself.
I try to automaticly set the language in my typo3 6.2 One-Tree Page.
To my setup, I use RealURL to add the langauge to the URL, I use the default Lparameter. I DON'T use ISO codes for the languages, but I use static_info_tables to set the ISO Code. For the language switch I try to use the extention rlmp_language_detection but it does not work.
My language config (typo3name, Official ISO Code - selected with static_info_tables, ID - used for Lparameter)
default, -, 0
en-jp, en, 1
en-us, en, 2
jp-jp, ja, 3
My Typoscript for the plugin:
plugin.tx_rlmplanguagedetection_pi1 {
useOneTreeMethod = 1
defaultLang = en
}
My Typoscript for langauges:
config {
sys_language_uid = 0
language = en
locale_all = en-eu
}
[globalVar = GP:L = 1]
config {
sys_language_uid = 1
language = en
locale_all = en-us
}
[global]
[globalVar = GP:L = 2]
config {
sys_language_uid = 2
language = en
locale_all = en-jp
}
[global]
[globalVar = GP:L = 3]
config {
sys_language_uid = 3
language = jp
locale_all = jp-jp
}
[global]
To test it I set my first langauge to japanese and when I request the rootpage this is in my request-header:
Accept-Language:ja,de-DE;q=0.8,de;q=0.6,en-US;q=0.4,en;q=0.2
http://mybrowserinfo.com/ say:
Language:Japanese
System Language:Not detectable with this browser
User Language:de
But no L Parameter is set at all, so I get the default language.
I've had same problems:
How i made it works step by step:
1 Step:
install static_info_tables be careful it should be in DB utf-8 charsetin latin1 (in my case) don't work
2 Step:
istall rlmp_language_detection
3 Step:
check if on you server installed php-geoip php module, if not install Ext: ml_geoip or any way install it.
4 Step
include static templates in your TS template
5 Step
Pls don't forget to choose ISO code of toy lang in base line pages tree
6 Step
TS settings - add this after all lang configuration
plugin.tx_rlmplanguagedetection_pi1 {
# this mean that you hav ejust one tree pages for all lang, for multi trees look manual
useOneTreeMethod = 1
#important - this your website sys_language default
defaultLang = ru
# use -1 when you wont to test redirect, after change to 0
cookieLifetime = -1
# you defind which method will be used for redirect browser or ip, better testing just with one
testOrder = browser,ip
# we can config aliases like "code = lang1, lang2"
languageAliases >
languageAliases {
ua = uk,en
en = en
ru = ru,en
}
#we can country codes dependencies "country code = lang"
countryCodeToLanguageCode >
countryCodeToLanguageCode {
ua = uk
us = en
gb = en
nz = en
au = en
ie = en
ca = en
by = ru
}
#this just limit input params array
limitToLanguages = ru,uk,en
}
# ok just after all we include ext to page, on ts begining you shoud to have "page = PAGE" just check
page.1007 = < plugin.tx_rlmplanguagedetection_pi1
I'd suggest using .htaccess to redirect the browser acceptance language. This saves you loading up the whole TYPO3 instance just to do a redirect.
Depending on weather you use realurl it would look something like this:
RewriteCond %{HTTP:Accept-Language} ^en-us [NC]
RewriteRule ^$ /en-us/ [L,R=307]
RewriteCond %{HTTP:Accept-Language} ^ja [NC]
RewriteRule ^$ /jp/ [L,R=307]
RewriteCond %{HTTP:Accept-Language} ^en-gb [NC]
RewriteRule ^$ /en-eu/ [L,R=307]
I don't know how to target en-jp though.
On our websites we use the 307 as status code, so the browser will always look for the initial page (in case the structure changes), also it doesn't affect SEO.
I am trying to separate English and Japanese characters. I need to find Unicode range of all Japanese characters. What is Unicode range of all Japanese characters ?
As zawhtut mentioned, this page has a reference for several unicode ranges. To summarize the ranges:
Japanese-style punctuation ( 3000 - 303f)
Hiragana ( 3040 - 309f)
Katakana ( 30a0 - 30ff)
Full-width roman characters and half-width katakana ( ff00 - ffef)
CJK unifed ideographs - Common and uncommon kanji ( 4e00 - 9faf)
Although this question already has an answer, this blog post is probably more complete.
Please visit the site and get their metrics up, but for posterity here's a copy-paste.
Hiragana
Unicode code points regex: [\x3041-\x3096]
Unicode block property regex: \p{Hiragana}
ぁ あ ぃ い ぅ う ぇ え ぉ お か が き ぎ く ぐ け げ こ ご さ ざ し じ す ず せ ぜ そ ぞ た だ ち ぢ っ
つ づ て で と ど な に ぬ ね の は ば ぱ ひ び ぴ ふ ぶ ぷ へ べ ぺ ほ ぼ ぽ ま み む め も ゃ や ゅ ゆ
ょ よ ら り る れ ろ ゎ わ ゐ ゑ を ん ゔ ゕ ゖ ゙ ゚ ゛ ゜ ゝ ゞ ゟ
Katakana (Full Width)
Unicode code points regex: [\x30A0-\x30FF]
Unicode block property regex: \p{Katakana}
゠ ァ ア ィ イ ゥ ウ ェ エ ォ オ カ ガ キ ギ ク グ ケ ゲ コ ゴ サ ザ シ ジ ス ズ セ ゼ ソ ゾ タ ダ チ ヂ
ッ ツ ヅ テ デ ト ド ナ ニ ヌ ネ ノ ハ バ パ ヒ ビ ピ フ ブ プ ヘ ベ ペ ホ ボ ポ マ ミ ム メ モ ャ ヤ ュ
ユ ョ ヨ ラ リ ル レ ロ ヮ ワ ヰ ヱ ヲ ン ヴ ヵ ヶ ヷ ヸ ヹ ヺ ・ ー ヽ ヾ ヿ
Kanji
Unicode code points regex: [\x3400-\x4DB5\x4E00-\x9FCB\xF900-\xFA6A]
Unicode block property regex: \p{Han}
漢字 日本語 文字 言語 言葉 etc. Too many characters to list.
This regular expression will match all the kanji, including those used
in Chinese.
Kanji Radicals
Unicode code points regex: [\x2E80-\x2FD5]
⺀ ⺁ ⺂ ⺃ ⺄ ⺅ ⺆ ⺇ ⺈ ⺉ ⺊ ⺋ ⺌ ⺍ ⺎ ⺏ ⺐ ⺑ ⺒ ⺓ ⺔ ⺕ ⺖ ⺗ ⺘ ⺙ ⺛ ⺜ ⺝ ⺞ ⺟ ⺠ ⺡ ⺢
⺣ ⺤ ⺥ ⺦ ⺧ ⺨ ⺩ ⺪ ⺫ ⺬ ⺭ ⺮ ⺯ ⺰ ⺱ ⺲ ⺳ ⺴ ⺵ ⺶ ⺷ ⺸ ⺹ ⺺ ⺻ ⺼ ⺽ ⺾ ⺿ ⻀ ⻁ ⻂ ⻃ ⻄ ⻅
⻆ ⻇ ⻈ ⻉ ⻊ ⻋ ⻌ ⻍ ⻎ ⻏ ⻐ ⻑ ⻒ ⻓ ⻔ ⻕ ⻖ ⻗ ⻘ ⻙ ⻚ ⻛ ⻜ ⻝ ⻞ ⻟ ⻠ ⻡ ⻢ ⻣ ⻤ ⻥ ⻦ ⻧ ⻨
⻩ ⻪ ⻫ ⻬ ⻭ ⻮ ⻯ ⻰ ⻱ ⻲ ⻳ ⼀ ⼁ ⼂ ⼃ ⼄ ⼅ ⼆ ⼇ ⼈ ⼉ ⼊ ⼋ ⼌ ⼍ ⼎ ⼏ ⼐ ⼑ ⼒ ⼓ ⼔ ⼕ ⼖ ⼗
⼘ ⼙ ⼚ ⼛ ⼜ ⼝ ⼞ ⼟ ⼠ ⼡ ⼢ ⼣ ⼤ ⼥ ⼦ ⼧ ⼨ ⼩ ⼪ ⼫ ⼬ ⼭ ⼮ ⼯ ⼰ ⼱ ⼲ ⼳ ⼴ ⼵ ⼶ ⼷ ⼸ ⼹ ⼺
⼻ ⼼ ⼽ ⼾ ⼿ ⽀ ⽁ ⽂ ⽃ ⽄ ⽅ ⽆ ⽇ ⽈ ⽉ ⽊ ⽋ ⽌ ⽍ ⽎ ⽏ ⽐ ⽑ ⽒ ⽓ ⽔ ⽕ ⽖ ⽗ ⽘ ⽙ ⽚ ⽛ ⽜ ⽝
⽞ ⽟ ⽠ ⽡ ⽢ ⽣ ⽤ ⽥ ⽦ ⽧ ⽨ ⽩ ⽪ ⽫ ⽬ ⽭ ⽮ ⽯ ⽰ ⽱ ⽲ ⽳ ⽴ ⽵ ⽶ ⽷ ⽸ ⽹ ⽺ ⽻ ⽼ ⽽ ⽾ ⽿ ⾀
⾁ ⾂ ⾃ ⾄ ⾅ ⾆ ⾇ ⾈ ⾉ ⾊ ⾋ ⾌ ⾍ ⾎ ⾏ ⾐ ⾑ ⾒ ⾓ ⾔ ⾕ ⾖ ⾗ ⾘ ⾙ ⾚ ⾛ ⾜ ⾝ ⾞ ⾟ ⾠ ⾡ ⾢ ⾣
⾤ ⾥ ⾦ ⾧ ⾨ ⾩ ⾪ ⾫ ⾬ ⾭ ⾮ ⾯ ⾰ ⾱ ⾲ ⾳ ⾴ ⾵ ⾶ ⾷ ⾸ ⾹ ⾺ ⾻ ⾼ ⾽ ⾾ ⾿ ⿀ ⿁ ⿂ ⿃ ⿄ ⿅ ⿆
⿇ ⿈ ⿉ ⿊ ⿋ ⿌ ⿍ ⿎ ⿏ ⿐ ⿑ ⿒ ⿓ ⿔ ⿕
Katakana and Punctuation (Half Width)
Unicode code points regex: [\xFF5F-\xFF9F]
⦅ ⦆ 。 「 」 、 ・ ヲ ァ ィ ゥ ェ ォ ャ ュ ョ ッ ー ア イ ウ エ オ カ キ ク ケ コ サ シ ス セ ソ タ チ
ツ テ ト ナ ニ ヌ ネ ノ ハ ヒ フ ヘ ホ マ ミ ム メ モ ヤ ユ ヨ ラ リ ル レ ロ ワ ン ゙
Japanese Symbols and Punctuation
Unicode code points regex: [\x3000-\x303F]
、 。 〃 〄 々 〆 〇 〈 〉 《 》 「 」 『 』 【 】 〒 〓 〔 〕 〖 〗 〘 〙 〚 〛 〜 〝 〞 〟 〠 〡 〢 〣
〤 〥 〦 〧 〨 〩 〪 〫 〬 〭 〮 〯 〰 〱 〲 〳 〴 〵 〶 〷 〸 〹 〺 〻 〼 〽 〾 〿
Miscellaneous Japanese Symbols and Characters
Unicode code points regex: [\x31F0-\x31FF\x3220-\x3243\x3280-\x337F]
ㇰ ㇱ ㇲ ㇳ ㇴ ㇵ ㇶ ㇷ ㇸ ㇹ ㇺ ㇻ ㇼ ㇽ ㇾ ㇿ ㈠ ㈡ ㈢ ㈣ ㈤ ㈥ ㈦ ㈧ ㈨ ㈩ ㈪ ㈫ ㈬ ㈭ ㈮ ㈯ ㈰ ㈱ ㈲
㈳ ㈴ ㈵ ㈶ ㈷ ㈸ ㈹ ㈺ ㈻ ㈼ ㈽ ㈾ ㈿ ㉀ ㉁ ㉂ ㉃ ㊀ ㊁ ㊂ ㊃ ㊄ ㊅ ㊆ ㊇ ㊈ ㊉ ㊊ ㊋ ㊌ ㊍ ㊎ ㊏ ㊐ ㊑
㊒ ㊓ ㊔ ㊕ ㊖ ㊗ ㊘ ㊙ ㊚ ㊛ ㊜ ㊝ ㊞ ㊟ ㊠ ㊡ ㊢ ㊣ ㊤ ㊥ ㊦ ㊧ ㊨ ㊩ ㊪ ㊫ ㊬ ㊭ ㊮ ㊯ ㊰ ㊱ ㊲ ㊳ ㊴
㊵ ㊶ ㊷ ㊸ ㊹ ㊺ ㊻ ㊼ ㊽ ㊾ ㊿ ㋀ ㋁ ㋂ ㋃ ㋄ ㋅ ㋆ ㋇ ㋈ ㋉ ㋊ ㋋ ㋐ ㋑ ㋒ ㋓ ㋔ ㋕ ㋖ ㋗ ㋘ ㋙ ㋚ ㋛
㋜ ㋝ ㋞ ㋟ ㋠ ㋡ ㋢ ㋣ ㋤ ㋥ ㋦ ㋧ ㋨ ㋩ ㋪ ㋫ ㋬ ㋭ ㋮ ㋯ ㋰ ㋱ ㋲ ㋳ ㋴ ㋵ ㋶ ㋷ ㋸ ㋹ ㋺ ㋻ ㋼ ㋽ ㋾
㌀ ㌁ ㌂ ㌃ ㌄ ㌅ ㌆ ㌇ ㌈ ㌉ ㌊ ㌋ ㌌ ㌍ ㌎ ㌏ ㌐ ㌑ ㌒ ㌓ ㌔ ㌕ ㌖ ㌗ ㌘ ㌙ ㌚ ㌛ ㌜ ㌝ ㌞ ㌟ ㌠ ㌡ ㌢
㌣ ㌤ ㌥ ㌦ ㌧ ㌨ ㌩ ㌪ ㌫ ㌬ ㌭ ㌮ ㌯ ㌰ ㌱ ㌲ ㌳ ㌴ ㌵ ㌶ ㌷ ㌸ ㌹ ㌺ ㌻ ㌼ ㌽ ㌾ ㌿ ㍀ ㍁ ㍂ ㍃ ㍄ ㍅
㍆ ㍇ ㍈ ㍉ ㍊ ㍋ ㍌ ㍍ ㍎ ㍏ ㍐ ㍑ ㍒ ㍓ ㍔ ㍕ ㍖ ㍗ ㍘ ㍙ ㍚ ㍛ ㍜ ㍝ ㍞ ㍟ ㍠ ㍡ ㍢ ㍣ ㍤ ㍥ ㍦ ㍧ ㍨
㍩ ㍪ ㍫ ㍬ ㍭ ㍮ ㍯ ㍰ ㍱ ㍲ ㍳ ㍴ ㍵ ㍶ ㍻ ㍼ ㍽ ㍾ ㍿
Alphanumeric and Punctuation (Full Width)
Unicode code points regex: [\xFF01-\xFF5E]
! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C
D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f
g h i j k l m n o p q r s t u v w x y z { | } ~
Please see this page for a reference. It contains Katakana, Hiragana and Kanji unicode ranges.
CJK(Chinese Japanese and Korean), Hiragana and Katakana(include Halfwidth Katakana)
http://www.unicode.org/charts/
What is Unicode range of all Japanese characters?
Have a look at page of The WiLI benchmark dataset for written
language identification, especially table II. The number in bracket is the part of the language you capture with the Unicode code range (in decimal).
12352 - 12543: Japanese (48.73%), English (0.00%)
19000 - 44000: Japanese (32.78%), English (0.00%)
20 - 128: English (99.74%), Japanese (11.58%)
You can see that 20 - 128 captures English really well and that all 3 blocks are important for Japanese, but still big parts are missing.
Those numbers are created with lidtk and WiLI-2018.
We are using Sphinx with MYSQL. So our MYSQL is utf, and has Chinese characters and we need Sphinx to support CJK. Here's what we have in sphinx.conf:
charset_type = utf-8
charset_table = 0..9, U+27, U+41..U+5a->U+61..U+7a, U+61..U+7a, \
U+aa, U+b5, U+ba, \
U+c0..U+d6->U+e0..U+f6, U+d8..U+de->U+f8..U+fe, U+df..U+f6, \
U+f8..U+ff, U+100..U+12f/2, U+130->U+69, \
U+131, U+132..U+137/2, U+138, \
...
...
...
ngram_chars = U+3400..U+4DB5, U+4E00..U+9FA5, U+20000..U+2A6D6,U+4E00..U+9FBB, U+3400..U+4DB5, U+20000..U+2A6D6, U+FA0E, U+FA0F, U+FA11, U+FA13, U+FA14, U+FA1F, U+FA21, U+FA23, U+FA24, U+FA27, U+FA28, U+FA29, U+3105..U+312C, U+31A0..U+31B7, U+3041, \
U+3043, U+3045, U+3047, U+3049, U+304B, U+304D, U+304F, U+3051, U+3053, U+3055, U+3057, U+3059, U+305B, U+305D, U+305F, U+3061, U+3063, U+3066, U+3068, U+306A..U+306F, U+3072, U+3075, U+3078, U+307B, U+307E..U+3083, U+3085, U+3087, U+3089..U+308E, U+3090..U+3093, \
U+30A1, U+30A3, U+30A5, U+30A7, U+30A9, U+30AD, U+30AF, U+30B3, U+30B5, U+30BB, U+30BD, U+30BF, U+30C1, U+30C3, U+30C4, U+30C6, U+30CA, U+30CB, U+30CD, U+30CE, U+30DE, U+30DF, U+30E1, U+30E2, U+30E3, U+30E5, U+30E7, U+30EE, U+30F0..U+30F3, U+30F5, U+30F6, U+31F0, \
U+31F1, U+31F2, U+31F3, U+31F4, U+31F5, U+31F6, U+31F7, U+31F8, U+31F9, U+31FA, U+31FB, U+31FC, U+31FD, U+31FE, U+31FF, U+AC00..U+D7A3, U+1100..U+1159, U+1161..U+11A2, U+11A8..U+11F9, U+A000..U+A48C, U+A492..U+A4C6
ngram_len = 1
And mysql conf:
character_set_client:utf8
character_set_connection:utf8
character_set_database:utf8 character_set_results:utf8 character_set_server:utf8 character_set_system:utf8 collation_connection:utf8_general_ci collation_database:utf8_general_ci collation_server:utf8_general_ci init_connect:SET NAMES utf8
It manage to index weird characters such as this as Chinese: 今宵离别åŽä½•æ—¥å›å†æ¥
And real chinese like this it's showing up as ??? in sphinx: 后来
My believe is there's some encoding problem but I don't know where.