Building an ngram frequency table and dealing with multibyte runes - unicode

I am currently learning Go and am making a lot of progress. One way I do this is to port past projects and prototypes from a prior language to a new one.
Right now I am busying myself with a "language detector" I prototyped in Python a while ago. In this module, I generate an ngram frequency table, where I then calculate the difference between a given text and a known corpora.
This allows one to effectively determine which corpus is the best match by returning the cosine of two vector representations of the given ngram tables. Yay. Math.
I have a prototype written in Go that works perfectly with plain ascii characters, but I would very much like to have it working with unicode multibyte support. This is where I'm doing my head in.
Here is a quick example of what I'm dealing with: http://play.golang.org/p/2bnAjZX3r0
I've only posted the table generating logic since everything already works just fine.
As you can see by running the snippet, the first text works quite well and builds an accurate table. The second text, which is German, has a few double-byte characters in it. Due to the way I am building the ngram sequence, and due to the fact that these specific runes are made of two bytes, there appear 2 ngrams where the first byte is cut off.
Could someone perhaps post a more efficient solution or, at the very least, guide me through a fix? I'm almost positive I am over analysing this problem.
I plan on open sourcing this package and implementing it as a service using Martini, thus providing a simple API people can use for simple linguistic computation.
As ever, thanks!

If I understand correctly, you want chars in your Parse function to hold the last n characters in the string. Since you're interested in Unicode characters rather than their UTF-8 representation, you might find it easier to manage it as a []rune slice, and only convert back to a string when you have your ngram ready to add to the table. This way you don't need to special case non-ASCII characters in your logic.
Here is a simple modification to your sample program that does the above: http://play.golang.org/p/QMYoSlaGSv

By keeping a circular buffer of runes, you can minimise allocations. Also note that reading a new key from a map returns the zero value (which for int is 0), which means the unknown key check in your code is redundant.
func Parse(text string, n int) map[string]int {
chars := make([]rune, 2 * n)
table := make(map[string]int)
k := 0
for _, chars[k] = range strings.Join(strings.Fields(text), " ") + " " {
chars[n + k] = chars[k]
k = (k + 1) % n
table[string(chars[k:k+n])]++
}
return table
}

Related

Phonetic Algorithms for Postgresql

please, I am working on a PoC for Person Real-time Identification, and one of the critical aspects of it is to support both minor misspelling and phonetic variations of First, Middle, and Last name. Like HarinGton == HarrinBton or RaphEAl == RafAEl. It's working for longer names, but it's a bit more imprecise for names like Lee and John.
I am using Double Metaphone through dmetaphone() and dmetaphone_alt() in PostgreSQL 13.3 (Supabase.io). And although I appreciate Double Metaphone it has a (too?) short string as the outcome. metaphone() has parameters to make the resulting phonetic representation longer. I investigated dmetaphone() and couldn't find anything other than the default function.
Is there a way of making dmetaphone() and dmetaphone_alt() return a longer phonetic representation similar to metaphone()'s, but with a ALT variation?.
Any help would be much appreciated.
Thanks
Looking at the postgres docs for these features you don't have parametric control over the length of the encoded string for Double Metaphone. In the case of single Metaphone, you can only truncate the output string:
max_output_length sets the maximum length of the output metaphone code; if longer, the output is truncated to this length.
However you may get much better results by using Trigram Similarity or Levenshtein Distance on the encoded output from either of the metaphone methods - this can be a more powerful way to handle phonetic permutations using Metaphones.
Example
Consider all the spelling permutations possible for the artist Cyndi Lauper, using double metaphone with trigram similarity we can achieve 100% similarity between the incorrect string cindy lorper and the correct spelling:
SELECT similarity(dmetaphone('cindy lorper'), dmetaphone('cyndi lauper'));
yields: similarity real: 1 (ie: 100% similarity)
Which means the encodings are identical for both input strings using Double Metaphone. When using Metaphone, they're slightly different. All of the following yield SNTLRPR
SELECT metaphone('cyndy lorper',10);
SELECT metaphone('sinday lorper', 10);
SELECT metaphone('cinday laurper', 10);
SELECT metaphone('cyndi lauper',10);
yields: SNTLPR which is only one character different to SNTLRPR
You can also use Levenshtein Distance to calculate it, which gives you a filterable parameter to work with:
SELECT levenshtein(metaphone('sinday lorper', 10), metaphone('cyndi lauper', 10));
yields: levenshtein integer: 1
It's working for longer names, but it's a bit more imprecise for names
like Lee and John.
It's a bit difficult to see exactly what you're having trouble with - without a more complete reprex.
SELECT similarity(dmetaphone('lee'), dmetaphone('leigh'));
SELECT similarity(dmetaphone('jon'), dmetaphone('john'));
both yield: similarity real: 1 (ie: 100% similarity)
Edit: here's a easy to follow guide for fuzzy matching with postgres

Unicode NFC quick check and supplemental characters example in UAX#15 section 9

If I look in UAX#15 Section 9, there is sample code to check for normalization. That code uses the NFC_QC property and checks CCC ordering, as expected. It looks great except this line puzzles me if (Character.isSupplementaryCodePoint(ch)) ++i;. It seems to be saying that if a character is supplemental (i.e. >= 0x10000), then I can just assume the next character passes the quick check without bothering to check the NFC_QC property or CCC ordering on it.
Theoretically, I can have, say, a starting code point, followed by a supplemental code point with CCC>0, followed by a third code point with CCC>0 and lower than that of the second code point or NFC_QC==NO, and it will STILL pass the NFC quick check, even tho that would seem to not be in NFC form. There are a bunch of supplemental code points with CCC=7,9,216,220,230, so it seems like there are a lot of possibilities to hit this case. I guess this can work if we can assume that it will always be the case throughout future versions of Unicode that all supplemental characters with CCC>0 will also have NFC_QC==No.
Is this sample code correct? If so, why is this supplemental check valid? Are there cases that would produce incorrect results if that supplemental check were removed?
Here is the code snippet copied directly from that link.
public int quickCheck(String source) {
short lastCanonicalClass = 0;
int result = YES;
for (int i = 0; i < source.length(); ++i) {
int ch = source.codepointAt(i);
if (Character.isSupplementaryCodePoint(ch)) ++i;
short canonicalClass = getCanonicalClass(ch);
if (lastCanonicalClass > canonicalClass && canonicalClass != 0) {
return NO; }
int check = isAllowed(ch);
if (check == NO) return NO;
if (check == MAYBE) result = MAYBE;
lastCanonicalClass = canonicalClass;
}
return result;
}
The sample code is correct,1 but the part that concerns you has little to do with Unicode normalization. No characters in the string are actually skipped, it’s just that Java makes iterating over a string’s characters somewhat awkward.
The extra increment is a workaround for a historical wart in Java (which it happens to share with JavaScript and Windows, two other early adopters of Unicode): Java Strings are arrays of Java chars, but Java chars are not Unicode (abstract) characters or (concrete numeric) code points, they are 16-bit UTF-16 code units. This means that every character with code point C < 1 0000h takes up one position in a Java String, containing C, but every character with code point C ≥ 1 0000h takes two, as specified by UTF-16: the high or leading surrogate D800h + (C − 1 0000h) div 400h and the low or trailing surrogate DC00h + (C − 1 0000h) mod 400h (no Unicode characters are or ever will be assigned code points in the range [D800h, DFFFh], so the two cases are unambiguously distinguishable).
Because Unicode normalization operates in terms of a sequence of Unicode characters and cares little for the particulars of UTF-16, the sample code calls String.codePointAt(i) to decode the code point that occupies either position i or the two positions i and i+1 in the provided string, processes it, and uses Character.isSupplementaryCodePoint to figure out whether it should advance one or two positions. The way the loop is written treats the “supplementary” two-unit case like an unwanted stepchild, but that’s the accepted Java way of treating them.
1 Well, correct up to a small spelling error: codepointAt should be codePointAt.

How do I split a piece of Chinese text into individual characters?

I'm working on a machine learning project, where I'm building a Naive Bayes classifier over Chinese text. I want to use n-grams of Chinese characters as features, so I need to be able to split the text into unigrams (individual characters), bigrams (sequences of two characters), and so forth. (I don't care about special tokenization and such -- I just want raw characters as n-grams.)
How do I do this in Scala? I tried text.sliding(2) to get bigrams, but this doesn't quite seem to work. (I'm guessing because Chinese characters are not a single byte like they are in English?)
In general this is a question about the proper handling of Unicode in Java and therefore Scala as well. From my cursory glance at the internet there doesn't seem to "one true way" to handle Unicode in Java. I'm not a NLP person, so my understanding of what you want to do may be incorrect.
val text = "囗土夊米"
val unigrams = text.toCharArray
/* With the constraint unigrams is even, without the toString you get weird coercions */
val bigrams =
for (i <- 0 until unigrams.length if i % 2 == 0) yield unigrams(i).toString + unigrams(i + 1)
Something like that should be easy to generalize to a set of n-gram functions that will extract what you need, these being simple naive implementations of course.
Try mecab. I use mecab to create tokens for japanese and chinese. Once mecab is installed...the the python api's.
See this reference.
n-gram name analysis in non-english languages (CJK, etc)
See this on how to install mecab.
http://nlp.solutions.asia/?tag=ubuntu

Count the number of words in NSString

I'm trying to implement a word count function for my app that uses UITextView.
There's a space between two words in English, so it's really easy to count the number of words in an English sentence.
The problem occurs with Chinese and Japanese word counting because usually, there's no any space in the entire sentence.
I checked with three different text editors in iPad that have a word count feature and compare them with MS Words.
For example, here's a series of Japanese characters meaning the world's idea: 世界(the world)の('s)アイデア(idea)
世界のアイデア
1) Pages for iPad and MS Words count each character as one word, so it contains 7 words.
2) iPad text editor P*** counts the entire as one word --> They just used space to separate words.
3) iPad text editor i*** counts them as three words --> I believe they used CFStringTokenizer with kCFStringTokenizerUnitWord because I could get the same result)
I've researched on the Internet, and Pages and MS Words' word counting seems to be correct because each Chinese character has a meaning.
I couldn't find any class that counts the words like Pages or MS Words, and it would be very hard to implement it from scratch because besides Japanese and Chinese, iPad supports a lot of different foreign languages.
I think CFStringTokenizer with kCFStringTokenizerUnitWord is the best option though.
Is there a way to count words in NSString like Pages and MSWords?
Thank you
I recommend keep using CFStringTokenizer. Because it's platform feature, so will be upgraded by platform upgrade. And many people in Apple are working hardly to reflect real cultural difference. Which are hard to know for regular developers.
This is hard because this is not a programming problem essentially. This is a human cultural linguistic problem. You need a human language specialist for each culture. For Japanese, you need Japanese culture specialist. However, I don't think Japanese people needs word count feature seriously, because as I heard, the concept of word itself is not so important in the Japanese culture. You should define concept of word first.
And I can't understand why you want to force concept of word count into the character count. The Kanji word that you instanced. This is equal with counting universe as 2 words by splitting into uni + verse by meaning. Not even a logic. Splitting word by it's meaning is sometimes completely wrong and useless by the definition of word. Because definition of word itself are different by the cultures. In my language Korean, word is just a formal unit, not a meaning unit. The idea that each word is matching to each meaning is right only in roman character cultures.
Just give another feature like character counting for the users in east-asia if you think need it. And counting character in unicode string is so easy with -[NSString length] method.
I'm a Korean speaker, (so maybe out of your case :) and in many cases we count characters instead of words. In fact, I never saw people counting words in my whole life. I laughed at word counting feature on MS word because I guessed nobody would use it. (However now I know it's important in roman character cultures.) I have used word counting feature only once to know it works really :) I believe this is similar in Chinese or Japanese. Maybe Japanese users use the word counting because their basic alphabet is similar with roman characters which have no concept of composition. However they're using Kanji heavily which are completely compositing, character-centric system.
If you make word counting feature works greatly on those languages (which are using by people even does not feel any needs to split sentences into smaller formal units!), it's hard to imagine someone who using it. And without linguistic specialist, the feature should not correct.
This is a really hard problem if your string doesn't contain tokens identifying word breaks (like spaces). One way I know derived from attempting to solve anagrams is this:
At the start of the string you start with one character. Is it a word? It could be a word like "A" but it could also be a part of a word like "AN" or "ANALOG". So the decision about what is a word has to be made considering all of the string. You would consider the next characters to see if you can make another word starting with the first character following the first word you think you might have found. If you decide the word is "A" and you are left with "NALOG" then you will soon find that there are no more words to be found. When you start finding words in the dictionary (see below) then you know you are making the right choices about where to break the words. When you stop finding words you know you have made a wrong choice and you need to backtrack.
A big part of this is having dictionaries sufficient to contain any word you might encounter. The English resource would be TWL06 or SOWPODS or other scrabble dictionaries, containing many obscure words. You need a lot of memory to do this because if you check the words against a simple array containing all of the possible words your program will run incredibly slow. If you parse your dictionary, persist it as a plist and recreate the dictionary your checking will be quick enough but it will require a lot more space on disk and more space in memory. One of these big scrabble dictionaries can expand to about 10MB with the actual words as keys and a simple NSNumber as a placeholder for value - you don't care what the value is, just that the key exists in the dictionary, which tells you that the word is recognised as valid.
If you maintain an array as you count you get to do [array count] in a triumphal manner as you add the last word containing the last characters to it, but you also have an easy way of backtracking. If at some point you stop finding valid words you can pop the lastObject off the array and replace it at the start of the string, then start looking for alternative words. If that fails to get you back on the right track pop another word.
I would proceed by experimentation, looking for a potential three words ahead as you parse the string - when you have identified three potential words, take the first away, store it in the array and look for another word. If you find it is too slow to do it this way and you are getting OK results considering only two words ahead, drop it to two. If you find you are running up too many dead ends with your word division strategy then increase the number of words ahead you consider.
Another way would be to employ natural language rules - for example "A" and "NALOG" might look OK because a consonant follows "A", but "A" and "ARDVARK" would be ruled out because it would be correct for a word beginning in a vowel to follow "AN", not "A". This can get as complicated as you like to make it - I don't know if this gets simpler in Japanese or not but there are certainly common verb endings like "ma su".
(edit: started a bounty, I'd like to know the very best way to do this if my way isn't it.)
If you are using iOS 4, you can do something like
__block int count = 0;
[string enumerateSubstringsInRange:range
options:NSStringEnumerationByWords
usingBlock:^(NSString *word,
NSRange wordRange,
NSRange enclosingRange,
BOOL *stop)
{
count++;
}
];
More information in the NSString class reference.
There is also WWDC 2010 session, number 110, about advanced text handling, that explains this, around minute 10 or so.
I think CFStringTokenizer with kCFStringTokenizerUnitWord is the best option though.
That's right, you have to iterate through text and simply count number of word tokens encontered on the way.
Not a native chinese/japanese speaker, but here's my 2cents.
Each chinese character does have a meaning, but concept of a word is combination of letters/characters to represent an idea, isn't it?
In that sense, there's probably 3 words in "sekai no aidia" (or 2 if you don't count particles like NO/GA/DE/WA, etc). Same as english - "world's idea" is two words, while "idea of world" is 3, and let's forget about the required 'the' hehe.
That given, counting word is not as useful in non-roman language in my opinion, similar to what Eonil mentioned. It's probably better to count number of characters for those languages.. Check around with Chinese/Japanese native speakers and see what they think.
If I were to do it, I would tokenize the string with spaces and particles (at least for japanese, korean) and count tokens. Not sure about chinese..
With Japanese you can create a grammar parser and I think it is the same with Chinese. However, that is easier said than done because natural language tends to have many exceptions, but it is not impossible.
Please note it won't really be efficient since you have to parse each sentence before being able to count the words.
I would recommend the use of a parser compiler rather than building one yourself as well to start at least you can concentrate on doing the grammar than creating the parser yourself. It's not efficient, but it should get the job done.
Also have a fallback algorithm in case your grammar didn't parse the input correctly (perhaps the input really didn't make sense to begin with) you can use the length of the string to make it easier on you.
If you build it, there could be a market opportunity for you to use it as a natural language Domain Specific Language for Japanese/Chinese business rules as well.
Just use the length method:
[#"世界のアイデア" length]; // is 7
That being said, as a Japanese speaker, I think 3 is the right answer.

How can I generate this hash?

I'm new to programming (just started!) and have hit a wall recently. I am making a fansite for World of Warcraft, and I want to link to a popular site (wowhead.com). The following page shows what I'm trying to figure out: http://www.wowhead.com/?talent#ozxZ0xfcRMhuVurhstVhc0c
From what I understand, the "ozxZ0xfcRMhuVurhstVhc0c" part of the link is a hash. It contains all the information about that particular talent spec on the page, and changes whenever I add or remove points into a talent. I want to be able to recreate this part, so that I can then link my users directly to wowhead to view their talent trees, but I havn't the foggiest idea how to do that. Can anyone provide some guidance?
The first character indicates the class:
0 Druid
c Hunter
o Mage
s Paladin
b Priest
f Rogue
h Shaman
I Warlock
L Warrior
j Death Knight
The remaining characters indicate where in each tree points have been allocated. Each tree is separate, delimited by 'Z'. So if e.g. all the points are in the third tree, then the 2nd and 3rd characters will be "ZZ" indicating "end of first tree" and "end of second tree".
To generate the code for a given tree, split the talents up into pairs, going left-to-right and top-to-bottom. Each pair of talents is represented by a single character. So for example, in the DK's Blood tree segment, the first character will indicate the number of points allocated to Butchery and Subversion, and the second character will stand for Blade Barrier and Bladed Armor.
What character represents each allocation among the pair? I'm sure there's an algorithm, probably based on the ASCII character set, but all I've worked out so far is this lookup table. Find the number of points in the first talent along the top, and the number of points in the second talent along the left side. The encoded character is at the intersection.
0 1 2 3 4 5
0 0 o b h L x
1 z k d u p t
2 M R r G T g
3 c s f I j e
4 m a w N n v
5 V q i A y E
So if our Death Knight has one point in Butchery and two points in Subversion, the first character is 'R'. If instead we put no points in those two and five in Blade Barrier, the first two characters will be "0x". Trailing '0's (all the other pairs in the tree with no points allocated) can be omitted, as can trailing 'Z' delimiters (when there are no points in the subsequent trees). For one final example, the entire code for a DK with just a single point in Toughness would be "jZ0o": "Death Knight", "End of the first tree", "No points in the first pair of talents", "one point in the first talent of the second pair".
Can anyone work out what function generates the lookup table above? There's probably a clue in the codes for the classes: in alphabetical order (except for the DK which was added to the game after the others), they correspond to a series in the lookup table of (0,0), (0,3), (1,0), (1,3), (2,0), etc.
If you go to http://www.wowhead.com/?talent and start using the talent tree you can see the mysterious code being built up in the address bar as you click on the various boxes. So it's definitely not a hash but some kind of encoded structure data.
As the code is built up as you click the logic for building the code will be in the JavaScript on that page.
So my advice is do a view source on the page, download the JavaScript files and have a look at them.
I think it isn't a hash value, because hash values are normally one-ways values. This means you cannot (easily) restore the original information from which the hash code was generated.
Best thing would be to contact someone from wowhead.com and ask them how to interpret this information. I am sure they will help you out with some information about what type of encoding they use for the parameters. But without any help of the developers from wowhead.com it is almost impossible to figure out what information is encoded into this parameter.
I am not even sure the parameter you mentioned contains the talents of your character. Maybe it's just a session id or something like that. Take a look into the post data your browser sends to the server, it may contain a hidden field with the value you are searching for (you can use Tamper Data Firefox Addon).
I don't think ozxZ0xfcRMhuVurhstVhc0c is a hash value. I think it is a key (probably encrypted/encoded in some way). The server uses this key to retrieve information from it database. Since you don't have access to the database you don't know which key is needed, let alone how to encode it.
You need the original function that generates the hash.
I don't think that's public though :(
Check this out: hash wikipedia
Good luck learning how to program!
These hashes are hard to 'reverse engineer' unless you know how it was generated.
For example, it could be:
s1 = "random_string-" + score;
hash = encrypt(s1)
...etc
so it is hard to get the original data back from the hash (that is the whole point anyway).
your best bet would be link to the profile that would have the latest score ..etc