Different Languages in Mallet - mallet

I would like to use Mallet on Wikipedia articles in English, Spanish, German, French, Russian and Hindi. It seems to run well on the first five languages, but not Hindi. The results produce Hindi without vowels or the conjoint consonants. Does anyone have any advice?
Also, is there a library of stop-words for other languages?
Thanks

You need to modify the token regular expression. The default regex looks for groups of Unicode letter characters, possibly including punctuation (for example don't or multi-word). These are \p{L} and \p{P} in Java regexes.
South Asian scripts often include Unicode "mark" characters, which are \p{M} in regex. Here's an example using the Hindi Wikipedia article for South Korea:
$ bin/mallet import-file --input hindi.txt --print-output
name: 1
target: Hindi
input: 대한민국(0)=1.0
大韩民国(1)=1.0
सबस(2)=3.0
नगर(3)=2.0
लगत(4)=1.0
एकम(5)=1.0
सकल(6)=2.0
रहव(7)=2.0
यवस(8)=1.0
ययन(9)=1.0
करन(10)=1.0
eps(11)=1.0
करत(12)=1.0
$ bin/mallet import-file --input hindi.txt --print-output --token-regex '[\p{L}\p{M}]+'
name: 1
target: Hindi
input: दक्षिण(0)=4.0
कोरिया(1)=7.0
कोरियाई(2)=4.0
대한민국(3)=1.0
देहान्(4)=1.0
मिन्गुक(5)=1.0
大韩民国(6)=1.0
हंजा(7)=2.0
पूर्वी(8)=1.0
एशिया(9)=2.0
में(10)=7.0
स्थित(11)=2.0
एक(12)=4.0
देश(13)=6.0
...
There's currently no stoplist for Hindi. Looking for words that occur at least once in more than 10% of documents would be a reasonable start.

Related

What are the most difficult-to-render Unicode samples?

I'm trying to implement a cross-platform (desktop browsers, iOS, & Android) typography system that allows users to input any Unicode string.
What are some strings I should use to stress-test my system and ensure the most nines of users will have a good experience? Is there a standard or de-facto standard list that I can also use?
Here are some strings that I use in tests like that:
Vertically-stacked characters: Z̤͔ͧ̑̓ä͖̭̈̇lͮ̒ͫǧ̗͚̚o̙̔ͮ̇͐̇
Right-to-left words: اختبار النص
Mixed-direction words: من left اليمين to الى right اليسار
Mixed-direction characters: a‭b‮c‭d‮e‭f‮g
Very long characters: ﷽﷽﷽﷽﷽﷽﷽﷽﷽﷽﷽﷽﷽﷽﷽﷽
Emoji with skintone variations: 👱👱🏻👱🏼👱🏽👱🏾👱🏿
Emoji with gender variations: 🧟‍♀️🧟‍♂️
Emoji created by combining codepoints: 👨‍❤️‍💋‍👨👩‍👩‍👧‍👦🏳️‍⚧️🇵🇷
Some others:
Reversible characters in Right-to-Left scripts. Ex. Parentheses get reversed for display in Hebrew. Unicode spec has a whole list of these reversible characters.
Scripts with letter shaping: Arabic, Hindi, etc.
There are a lot of good examples in the Big List of Naughty Strings:
https://github.com/minimaxir/big-list-of-naughty-strings/blob/master/blns.txt
I cannot include the whole file, but here's a few lines:
# Unicode Subscript/Superscript/Accents
#
# Strings which contain unicode subscripts/superscripts; can cause rendering issues
ด้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็ ด้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็ ด้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็
# Two-Byte Characters
#
# Strings which contain two-byte characters: can cause rendering issues or character-length issues
田中さんにあげて下さい
# Strings which contain two-byte letters: can cause issues with naïve UTF-16 capitalizers which think that 16 bits == 1 character
𐐜 𐐔𐐇𐐝𐐀𐐡𐐇𐐓 𐐙𐐊𐐡𐐝𐐓/𐐝𐐇𐐗𐐊𐐤𐐔 𐐒𐐋𐐗 𐐒𐐌 𐐜 𐐡𐐀𐐖𐐇𐐤𐐓𐐝 𐐱𐑂 𐑄 𐐔𐐇𐐝𐐀𐐡𐐇𐐓 𐐏𐐆𐐅𐐤𐐆𐐚𐐊𐐡𐐝𐐆𐐓𐐆
# Special Unicode Characters Union
#
# A super string recommended by VMware Inc. Globalization Team: can effectively cause rendering issues or character-length issues to validate product globalization readiness.
表ポあA鷗ŒéB逍Üߪąñ丂㐀𠀀
# Ogham Text
#
# The only unicode alphabet to use a space which isn't empty but should still act like a space.
᚛ᚄᚓᚐᚋᚒᚄ ᚑᚄᚂᚑᚏᚅ᚜
᚛                 ᚜
# iOS Vulnerabilities
#
# Strings which crashed iMessage in various versions of iOS
Powerلُلُصّبُلُلصّبُررً ॣ ॣh ॣ ॣ冗
🏳0🌈️
జ్ఞ‌ా

Strategy for defining Unicode Ranges by Culture

I am new to Unicode have been given the requirement to look at some translated text, iterate over all of the characters of that translation and determine if all the characters are valid for the target culture (language and location).
For example, if I am translating a document from English to Greek, I want to detect if there are any English/ASCII "A"s in the Greek translation and report that as an error. This may likely be the case from corrupted data from a translation memory.
Is there any existing grouping of Unicode characters by culture? Or is there any existing strategy for developing this kind of grouping? I see that there is some grouping of characters at (http://www.unicode.org/charts/). But it seems that this is not quite what I am looking for at first glance.
Does any thing exist like "Here are the valid Unicode characters for Spanish - Spain: [some Unicode range(s)]" or "Here are the valid Unicode characters for Russian - Russia: [some Unicode range(s)]"
Or has anyone developed a strategy to define these?
If this is not the right place to ask this question, I would welcome any direction on where might be a good place to ask the question.
This is something that CLDR (Common Locale Data Repository) deals with. It is not part of the Unicode Standard, but it is an activity and a resource managed by the Unicode Consortium. The LDML specification defines the format of the locale data. The Character Elements define some sets of characters: “main/standard”, “auxiliary”, “index”, and “punctuation”.
The data for Greek includes only Greek letters and some basic punctuation. This, like all such data at CLDR, is largely subjective. And even though the CLDR process is meant to produce well-reviewed data based on consensus, the reality is different. It can be argued that in normal Greek texts, Latin letters are not uncommon, especially in technical areas. For example, the international symbol for the ampere is “A” as a Latin letter; the symbol for the kilogram is “kg”, in Latin letters, even though the word for it is written Greek letters in Greek.
Thus, no matter how you run the analysis, the occurrence of Latin “A” in Greek text could be flagged as potentially suspicious, but not an error.
There are C/C++ and Java libraries that implement access to CLDR data, as part of ICU.

Does American/British use non-ASCII characters?

I am a developer who is working with Chinese characters. I am trying to convert part of my project into English. I am currently rewriting the project's internationalization module.
I am unfamiliar with the standards for English, so I don't know if non-ascii is used widely?
If it is: Tell me some characters they use frequently.
Standard English spelling uses en dash (–), curly quotation marks (“, ”, ‘, ’); American English also uses em dash (—). Depending on conventions and preferences, several non-Ascii letters may be used, too, especially in words of French or Latin origin, such as é, ë, ç, and æ. Moreover, even in nonspecialized texts, various special character such as superscript two (²), micro sign (µ), and degree sign (°) may be seen.

Romanization of Unicode text

I am looking for a way to transliterate Unicode letter characters from any language into accented Latin letters. The intent is to allow foreigners to gain insight into the pronunciation of names and words written in any non-Latin script.
Examples:
Greek:Romanize("Αλφαβητικός") returns "Alphabētikós" (or "Alfavi̱tikós")
Japanese:Romanize("しんばし") returns "shimbashi" (or "sinbasi")
Russian:Romanize("яйца Фаберже") returns "yaytsa Faberzhe" (or "jajca Faberže")
It should ideally support characters in the following scripts: CJK, Indic, Cyrillic, Semitic, and Greek. It should to be data driven and extensible, using data from either the Unicode Consortium, the USA, the EU or the UN. The code should be open source written in .NET or Java.
Does such a library exist?
The problem is a lot more complex than you think.
Greek, Cyrillic, Indic scripts, Georgian -> trivial, you could program that in an hour
Thai, Japanese Kana -> doable with a bit more effort
Japanese Kanji, Chinese -> these are not alphabets/syllaberies, so you're not in fact transliterating, you're looking up the pronunciation of each symbol in a hopefully large dictionary (EDICT and CCDICT should work), and a lot of times you'll get it wrong unless you're also considering the context, especially in Japanese
Korean -> technically an alphabet, but computers can only handle the composed characters, so you need another large database, I'm not aware of any
Arabic, Hebrew -> these languages don't write down short vowels, so a lot of times your transliteration will be something unreadable like "bytlhm" (Bethlehem). I'm not aware of any large databases that map Arabic or Hebrew words to their pronunciation.
You can use Unidecode Sharp :
[a C#] port from Python Unidecode that itself port from Perl unidecode.
(there are also PHP and Ruby implementations available)
Usage;
using BinaryAnalysis.UnidecodeSharp;
.......................................
string _Greek="Αλφαβητικός";
MessageBox.Show(_Greek.Unidecode());
string _Japan ="しんばし";
MessageBox.Show(_Japan.Unidecode());
string _Russian ="яйца Фаберже";
MessageBox.Show(_Russian.Unidecode());
I hope, it will be good for you.
I am unaware of any open source solution here beyond ICU. If ICU works for you, great. If not, note that I am the CTO of a company that sells a commercial produce for this purpose that can deal with the icky cases like Chinese words, Japanese multiple reading, and Arabic incomplete orthography.
The Unicode Common Locale Data Repository has some transliteration mappings you could use.

How do I use unicode (UTF-8) characters in Clojure regular expressions?

This is a double question for you amazingly kind Stacked Overflow Wizards out there.
How do I set emacs/slime/swank to use UTF-8 when talking with Clojure, or use UTF-8 at the command-line REPL? At the moment I cannot send any non-roman characters to swank-clojure, and using the command-line REPL garbles things.
It's really easy to do regular expressions on latin text:
(re-seq #"[\w]+" "It's really true that Japanese sentences don't need spaces?")
But what if I had some japanese? I thought that this would work, but I can't test it:
(re-seq #"[(?u)\w]+" "日本語 の 文章 に は スペース が 必要 ない って、 本当?")
It gets harder if we have to use a dictionary to find word breaks, or to find a katakana-only word ourselves:
(re-seq #"[アイウエオ-ン]" "日本語の文章にはスペースが必要ないって、本当?")
Thanks!
Can't help with swank or Emacs, I'm afraid. I'm using Enclojure on NetBeans and it works well there.
On matching: As Alex said, \w doesn't work for non-English characters, not even the extended Latin charsets for Western Europe:
(re-seq #"\w+" "prøve") =>("pr" "ve") ; Norwegian
(re-seq #"\w+" "mañana") => ("ma" "ana") ; Spanish
(re-seq #"\w+" "große") => ("gro" "e") ; German
(re-seq #"\w+" "plaît") => ("pla" "t") ; French
The \w skips the extended chars. Using [(?u)\w]+ instead makes no difference, same with the Japanese.
But see this regex reference: \p{L} matches any Unicode character in category Letter, so it actually works for Norwegian
(re-seq #"\p{L}+" "prøve")
=> ("prøve")
as well as for Japanese (at least I suppose so, I can't read it but it seems to be in the ballpark):
(re-seq #"\p{L}+" "日本語 の 文章 に は スペース が 必要 ない って、 本当?")
=> ("日本語" "の" "文章" "に" "は" "スペース" "が" "必要" "ない" "って" "本当")
There are lots of other options, like matching on combining diacritical marks and whatnot, check out the reference.
Edit: More on Unicode in Java
A quick reference to other points of potential interest when working with Unicode.
Fortunately, Java generally does a very good job of reading and writing text in the correct encodings for the location and platform, but occasionally you need to override it.
This is all Java, most of this stuff does not have a Clojure wrapper (at least not yet).
java.nio.charset.Charset - represents a charset like US-ASCII, ISO-8859-1, UTF-8
java.io.InputStreamReader - lets you specify a charset to translate from bytes to strings when reading. There is a corresponding OutputStreamWriter.
java.lang.String - lets you specify a charset when creating a String from an array of bytes.
java.lang.Character - has methods for getting the Unicode category of a character and converting between Java chars and Unicode code points.
java.util.regex.Pattern - specification of regexp patterns, including Unicode blocks and categories.
Java characters/strings are UTF-16 internally. The char type (and its wrapper Character) is 16 bits, which is not enough to represent all of Unicode, so many non-Latin scripts need two chars to represent one symbol.
When dealing with non-Latin Unicode it's often better to use code points rather than characters. A code point is one Unicode character/symbol represented as an int. The String and Character classes have methods for converting between Java chars and Unicode code points.
unicode.org - the Unicode standard and code charts.
I'm putting this here since I occasionally need this stuff, but not often enough to actually remember the details from one time to the next. Sort of a note to my future self, and it might be useful to others starting out with international languages and encodings as well.
I'll answer half a question here:
How do I set emacs/slime/swank to use UTF-8 when talking with Clojure, or use UTF-8 at the command-line REPL?
A more interactive way:
M-x customize-group
"slime-lisp"
Find the option for slime coding system, and select utf-8-unix. Save this so Emacs picks it up in your next session.
Or place this in your .emacs:
(custom-set-variables '(slime-net-coding-system (quote utf-8-unix)))
That's what the interactive menu will do anyway.
Works on Emacs 23 and works on my machine
For katakana, Wikipedia shows you the Unicode ordering. So if you wanted to use a regex character class that caught all the katakana, I suppose you could do something like this:
user> (re-seq #"[\u30a0-\u30ff]+" "日本語の文章にはスペースが必要ないって、本当?")
("スペース")
Hiragana, for what it's worth:
user> (re-seq #"[\u3040-\u309f]+" "日本語の文章にはスペースが必要ないって、本当?")
("の" "には" "が" "ないって")
I'd be pretty amazed if any regex could detect Japanese word breaks.
for international characters you need to use Java Character classes, something like [\p{javaLowerCase}\p{javaUpperCase}]+ to match any word character... \w is used for ASCII - see java.util.Regex documentation
Prefix your regex with (?U) like so: (re-matches #"(?U)\w+" "ñé2_hi") => "ñé2_hi".
This sets the UNICODE_CHARACTER_CLASS flag to true so that the typical character classes do what you want with non-ASCII Unicode.
See here for more info: http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#UNICODE_CHARACTER_CLASS