Wubihua / Traditional Chinese / Open-source index - cjk

I have been looking for an open-source index of Chinese characters (Traditional) indexed for the Wubihua (五筆劃) input method. I have found only partial lists (up to four digits), and only for Simplified. I know there are lists out there, since all the phones in Hong Kong have Wubihua for Traditional installed... Any pointers?

Google (in this case Google T-9) at the rescue:
https://code.google.com/p/ibus-t9/issues/detail?id=3
[This page has the link to the table.txt file].
Not perfect because it lists both Traditional and Simplified characters, but very complete.

Related

What is the difference between IBM874 and MS874?

I am trying to add Thai Collation support in my driver and to do so, I need to use the appropriate character encoding. So after some research, I am left with the two options :
Code page 874, which is also known as CP874 and IBM874
and
Code page 1162, which is also known as windows-874, CP1162, IBM1162, MS874, x-windows-874, and x-IBM874
They both seem to belong to the family ISO/IEC 8859-11 and only differ from it by a couple of (8 to 9) symbols, which is nearly identical to the Thai Standard TIS-620
My question is, which among the two (IBM874 and MS874), would be the best choice to provide support for Thai Collation.
I tried both one after the other and both seem to do the job. I cannot seem to find much information about the two on google.
Can someone please help me understand which among the two is a more appropriate or comprehensive choice ?
P.S: I found an Oracle doc which mentions about the two and the only notable difference I see is that :
MS874 is described as "Windows Thai" and is categorized under "Extended Encoding Set" - International Version
whereas
IBM874 as described as "IBM Thai" and falls under "Basic Encoding Set" - European Version
The 'International Version' seems to support all encodings listed on the Oracle page. So I am guessing that is the more extensive or appropriate choice and so I am planning to go ahead with MS874. Am i missing something?

Typeahead Bloodhound - Filter

My index contains the word dog how can i also find this entry if i type dogs? I would find all parts of the word 'dogs','dog','do' to a min length of 2 or 3 chars
I'm not an expert on Bloodhound, but what you're talking about here is called stemming, and it seems like the kind of thing that you could do using the datumTokenizer and the queryTokenizer.
There are stemmers for most languages of varying quality, but I think the one most people are using for English these days is the Snowball Stemmer. There are a number of implementations in JavaScript floating around.
In general for things to work properly you'll want to stem both the uer's query and the results.

Understanding the terms - Character Encodings, Fonts, Glyphs

I am trying to understand this stuff so that I can effectively work on internationalizing a project at work. I have just started and very much like to know from your expertise whether I've understood these concepts correct. So far here is the dumbed down version(for my understanding) of what I've gathered from web:
Character Encodings -> Set of rules that tell the OS how to store characters. Eg., ISO8859-1,MSWIN1252,UTF-8,UCS-2,UTF-16. These rules are also called Code Pages/Character Sets which maps individual characters to numbers. Apparently unicode handles this a bit differently than others. ie., instead of a direct mapping from a number(code point) to a glyph, it maps the code point to an abstract "character" which might be represented by different glyphs.[ http://www.joelonsoftware.com/articles/Unicode.html ]
Fonts -> These are implementation of character encodings. They are files of different formats (True Type,Open Type,Post Script) that contain mapping for each character in an encoding to number.
Glyphs -> These are visual representation of characters stored in the font files.
And based on the above understanding I have the below questions,
1)For the OS to understand an encoding, should it be installed separately?. Or installing a font that supports an encoding would suffice?. Is it okay to use the analogy of a protocol say TCP used in a network to an encoding as it is just a set of rules. (which ofcourse begs the question, how does the OS understands these network protocols when I do not install them :-p)
2)Will a font always have the complete implementation of a code page or just part of it?. Is there a tool that I can use to see each character in a font(.TTF file?)[Windows font viewer shows how a style of the font looks like but doesn't give information regarding the list of characters in the font file]
3)Does a font file support multiple encodings?. Is there a way to know which encoding(s) a font supports?
I apologize for asking too many questions, but I had these in my mind for some time and I couldn't find any site that is simple enough for my understanding. Any help/links for understanding this stuff would be most welcome. Thanks in advance.
If you want to learn more, of course I can point you to some resources:
Unicode, writing systems, etc.
The best source of information would probably be this book by Jukka:
Unicode Explained
If you were to follow the link, you'd also find these books:
CJKV Information Processing - deals with Chinese, Japanese, Korean and Vietnamese in detail but to me it seems quite hard to read.
Fonts & Encodings - personally I haven't read this book, so I can't tell you if it is good or not. Seems to be on topic.
Internationalization
If you want to learn about i18n, I can mention countless resources. But let's start with book that will save you great deal of time (you won't become i18n expert overnight, you know):
Developing International Software - it might be 8 years old but this is still worth every cent you're going to spend on it. Maybe the programming examples regard to Windows (C++ and .Net) but the i18n and L10n knowledge is really there. A colleague of mine said once that it saved him about 2 years of learning. As far as I can tell, he wasn't overstating.
You might be interested in some blogs or web sites on the topic:
Sorting it all out - Michael Kaplan's blog, often on i18n support on Windows platform
Global by design - John Yunker is actively posting bits of i18n knowledge to this site
Internationalization (I18n), Localization (L10n), Standards, and Amusements - also known as i18nguy, the web site where you can find more links, tutorials and stuff.
Java Internationalization
I am afraid that I am not aware of many up to date resources on that topic (that is publicly available ones). The only current resource I know is Java Internationalization trail. Unfortunately, it is fairly incomplete.
JavaScript Internationalization
If you are developing web applications, you probably need also something related to i18n in js. Unfortunately, the support is rather poor but there are few libraries which help dealing with the problem. The most notable examples would be Dojo Toolkit and Globalize.
The prior is a bit heavy, although supports many aspects of i18n, the latter is lightweight but unfortunately many stuff is missing. If you choose to use Globalize, you might be interested in the latest Jukka's book:
Going Global with JavaScript & Globalize.js - I read this and as far I can tell, it is great. It doesn't cover the topics you were originally asking for but it is still worth reading, even for hands-on examples of how to use Globalize.
Apparently unicode handles this a bit differently than others. ie.,
instead of a direct mapping from a number(code point) to a glyph, it
maps the code point to an abstract "character" which might be
represented by different glyphs.
In the Unicode Character Encoding Model, there are 4 levels:
Abstract Character Repertoire (ACR) — The set of characters to be encoded.
Coded Character Set (CCS) — A one-to-one mapping from characters to integer code points.
Character Encoding Form (CEF) — A mapping from code points to a sequence of fixed-width code units.
Character Encoding Scheme (CES) — A mapping from code units to a serialized sequence of bytes.
For example, the character 𝄞 is represented by the code point U+1D11E in the Unicode CCS, the two code units D834 DD1E in the UTF-16 CEF, and the four bytes 34 D8 1E DD in the UTF-16LE CES.
In most older encodings like US-ASCII, the CEF and CES are trivial: Each character is directly represented by a single byte representing its ASCII code.
1) For the OS to understand an encoding, should it be installed
separately?.
The OS doesn't have to understand an encoding. You're perfectly free to use a third-party encoding library like ICU or GNU libiconv to convert between your encoding and the OS's native encoding, at the application level.
2)Will a font always have the complete implementation of a code page or just part of it?.
In the days of 7-bit (128-character) and 8-bit (256-character) encodings, it was common for fonts to include glyphs for the entire code page. It is not common today for fonts to include all 100,000+ assigned characters in Unicode.
I'll provide you with short answers to your questions.
It's generally not the OS that supports an encoding but the applications. Encodings are used to convert a stream of bytes to lists of characters. For example, in C# reading a UTF-8 string will automatically make it UTF-16 if you tell it to treat it as a string.
No matter what encoding you use, C# will simply use UTF-16 internally and when you want to, for example, print a string from a foreign encoding, it will convert it to UTF-16 first, then look up the corresponding characters in the character tables (fonts) and shows the glyphs.
I don't recall ever seeing a complete font. I don't have much experience with working with fonts either, so I cannot give you an answer for this one.
The answer to this one is in #1, but a short summary: fonts are usually encoding-independent, meaning that as long as the system can convert the input encoding to the font encoding you'll be fine.
Bonus answer: On "how does the OS understand network protocols it doesn't know?": again it's not the OS that handles them but the application. As long as the OS knows where to redirect the traffic (which application) it really doesn't need to care about the protocol. Low-level protocols usually do have to be installed, to allow the OS to know where to send the data.
This answer is based on my understanding of encodings, which may be wrong. Do correct me if that's the case!

Simplified Chinese Unicode table

Where can I find a Unicode table showing only the simplified Chinese characters?
I have searched everywhere but cannot find anything.
UPDATE :
I have found that there is another encoding called GB 2312 -
http://en.wikipedia.org/wiki/GB_2312
- which contains only simplified characters.
Surely I can use this to get what I need?
I have also found this file which maps GB2312 to Unicode -
http://cpansearch.perl.org/src/GUS/Unicode-UTF8simple-1.06/gb2312.txt
- but I'm not sure if it's accurate or not.
If that table isn't correct maybe someone could point me to one that is, or maybe just a table of the GB2312 characters and some way to convert them?
UPDATE 2 :
This site also provides a GB/Unicode table and even a Java program to generate a file
with all the GB characters as well as the Unicode equivalents :
http://www.herongyang.com/gb2312/
The Unihan database contains this information in the file Unihan_Variants.txt. For example, a pair of traditional/simplified characters are:
U+673A kTraditionalVariant U+6A5F
U+6A5F kSimplifiedVariant U+673A
In the above case, U+6A5F is 機, the traditional form of 机 (U+673A).
Another approach is to use the CC-CEDICT project, which publishes a dictionary of Chinese characters and compounds (both traditional and simplified). Each entry looks something like:
宕機 宕机 [dang4 ji1] /to crash (of a computer)/Taiwanese term for 當機|当机[dang4 ji1]/
The first column is traditional characters, and the second column is simplified.
To get all the simplified characters, read this text file and make a list of every character that appears in the second column. Note that some characters may not appear by themselves (only in compounds), so it is not sufficient to look at single-character entries.
The OP doesn't indicate which language they're using, but if you're using Ruby, I've written a small library that can distinguish between simplified and traditional Chinese (plus Korean and Japanese as a bonus). As suggested in Greg's answer, it relies on a distilled version of Unihan_Variants.txt to figure out which chars are exclusively simplified and which are exclusively traditional.
https://github.com/jpatokal/script_detector
Sample:
p string
=> "我的氣墊船充滿了鱔魚."
> string.chinese?
=> true
> string.traditional_chinese?
=> true
> string.simplified_chinese?
=> false
But as the Unicode FAQ duly warns, this requires sizable fragments of text to work reliably, and will give misleading results for short strings. Consider the Japanese for Tokyo:
p string
=> "東京"
> string.chinese?
=> true
> string.traditional_chinese?
=> true
> string.japanese?
=> false
Since both characters happen to also be valid traditional Chinese, and there are no exclusively Japanese characters, it's not recognized correctly.
I'm not sure if that's easily done. The Han ideographs are unified in Unicode, so it's not immediately obvious how to do it. But the Unihan database (http://www.unicode.org/charts/unihan.html) might have the data you need.
Here is a regex of all simplified Chinese characters I made. For some reason Stackoverflow is complaining, so it's linked in a pastebin below.
https://pastebin.com/xw4p7RVJ
You'll notice that this list features ranges rather than each individual character, but also that these are utf-8 characters, not escaped representations. It's served me well in one iteration or another since around 2010. Hopefully everyone else can make some use of it now.
If you don't want the simplified chars (I can't imagine why, it's not come up once in 9 years), iterate over all the chars from ['一-龥'] and try to build a new list. Or run two regex's, one to check it is Chinese, but is not simplified Chinese
According to wikipedia simplified Chinese v. traditional, kanji, or other formats is left up to the font rendering in many cases. So while you could have a selection of simplified Chinese codepoints, this list would not be at all complete since many characters are no longer distinct.
I don't believe that there's a table with only simplified code points. I think they're all lumped together in the CJK range of 0x4E00 through 0x9FFF

URL Shortening: What's the best encoding to use?

I'm adding a feature to my project where we are generating links to internal stuff of our website, and we want these links to be as short as possible, so we'll be making our own "URL Shortener".
I'm wondering what's the best encoding / alphabet to use for the generated short URLs.
This is largely a subjective question, I'd like to know what your opinions are regarding the best approach / trade-off.
Several options I've thought of:
- Digits, uppercase + lowercase (base 62)
- Digits, only lowercase (base 36)
- Base 32 (http://www.crockford.com/wrmg/base32.html)
- linkpot.net (using common short english words)
Of course, the second two are better for uses other than clicking, and the first two are better for Twitter.
Also, if I'm going with "clickable-only" URLs, I'd like to make the alphabet as large as possible, adding other symbols.
What symbols can I use in URLs that won't get URL encoded?
What symbols should I use? Could some of these prove problematic? I'm thinking slash and dot, for example.
What do you think?
NOTE: The main target for these URLs is Twitter. Keeping this in mind, we should probably have the largest alphabet possible, since most people will be clicking. However, I'm interested in your experience with people using short URLs in other ways (over the phone, in printed paper, etc). How likely is it this could happen?
NOTE 2: I'm not making "yet another URL shortener", please don't condemn me with downvotes. We are generating short URLs for internal stuff in our site, not allowing anyone to shorten any URL. Imagine Google Maps giving you short URLs when you generate a link to a specific coordinate.
I would go with Base-62, it's the shortest. Shortened URL is not meant for someone to manually enter anyway so don't worry about case-sensitivity.
If these are "clickable only URLS" I'd probably go with a base-64 encoding. MIME's base-64 uses a couple of characters you shouldn't use, but there are enough unreserved safe characters in URLs that you can just swap them out. (Also, you don't need the padding that MIME's base-64 uses, since you know when your URL ends.)
Here's a page that discusses one way to do this.
You can look at RFC2396 to figure out exactly what characters are safe in URIs if you want to double check.
I'd be curious to know a little more about the implementation. How will these URLs be "unshortened", or will the internal pages being accessed be saved as shortened URLs? In either case, even if you went with the encoding set of [A-Z] you'd be able to reference 26 * 26 * 26 = 17,576 pages with only 3 characters; how many internal web pages are you talking about?
In general I would lean on what your use case requirements are for picking the right encoding set. Are you planning on having these links available for "uses other than clicking"? What would those uses be, and how do you suspect they'll alter the encoding? (For example, using parts of the URL as a file name on a case-insensitive file system reduces the available character set.)
Here's an informative page on the character set you have available to you when writing a URL.