How to use a multiset? (guava) - guava

Is it possible to use a Multiset for the purpose of counting letter frequencies of the first letter in a word. Those words exist in a list.
example. [the, quick, brown, fox, jumped, over, the, lazy, dog]
Output: most common first character: [t, q, b, f, j, o, l, d]
Output: Most common first character ignoring word frequency: [t]
I just started researching how to use guava solutions.

You'd just need to create a Multiset<Character>, then iterate over the words and add their first character to your multiset (note: there are i18n issues with this, as a general thing). You could either keep track of the most common character as you go or iterate over the multiset later to get it.

Related

Is there a way to Include underscores when using word movements (<e>, <b>, <w>) VSCode VIM?

I am always frustrated when using move keys such as e, b or w in python variables or strings that are separated with underscores, due to the fact, that instead of stopping in the last/next underscore, it stops and the start/end of the string/variable
Is there a way to Include underscores when using word movements (e, b, w)?

Eclipse CDT autocomplete for tradtional C identifiers

Eclipse autocomplete works OK for CamelCaseIdentifiers. But it is completely useless for MORE_TRADITONAL_style_identifiers which have upper case prefexes and are separated by "_"s.
Something like MTsi should match the latter, just like CCI matches the former.
Is there a way to do that? I could not find any preference.
Incidentally there is MTst*id.
It looks like this already works, as long as you capitalize every letter in the query:
int MORE_TRADITIONAL_style_identifier();
int main() {
int x = MTSI/*complete*/ // <-- completes MORE_TRADITIONAL_style_identifier
}
But it doesn't if some of the letters in the query are not capitalized, such as MTsi. I think capital letters are the signal to the matching algorithm that two subsequent letters are potentially the beginnings of two different segments, whereas a sequence of lowercase letters like si just expects to find that substring verbatim.
If you feel the matching algorithm could be improved to handle mixed-case queries like this better, you could consider filing a bug and/or contributing a patch.

ORD and CHAR in Scratch

I am coming across more and more situations in Scratch where I have to convert a number to its ACSII character or visa versa. There is no built function in the blocks for this.
My solution is to create a list of size 26 and append letters A-Z into each sequence using a variable called alphabet = "abcdefghijklmnopqrstuvwxyz" and iterating over it with a Repeat block and appending LETTER COUNT of ALPHABET to the list.The result is a list data structure with letters A_Z between location 1 to 26.In effect creating my own ASCII table.
To do a converson say from number 26 to 'Z' I have to iterate over the list to get the correct CHAR value. It really slows down the program that is heavily dependent on the CHR() feature. Is there a better or more efficient solution?
My solution is to create a list of size 26 and append letters A-Z into each sequence using a variable called alphabet = "abcdefghijklmnopqrstuvwxyz"
Stop right there. You don't even need to convert it into a list. You can just look it up directly from the string.
To get a character from an index is very easy. Just use the (letter () of []) block.
To get the index of a character is more complex. Unfortunately, Scratch doesn't have a built-in way to do that. What i would do here is define a index of [] in [] custom pseudo-reporter block:
define index of (char) in (str)
set [i v] to [1]
repeat until <<(i) = (length of (str))> or <(letter (i) of (str)) = (char)>>
change [i v] by (1)
view online
You can then call the block as index of [a] in (alphabet) and it will set the i variable to 1.
This code doesn't have any case for if the character isn't found, but the link i provided does include that, if you need.
You could also use Snap! which is similar to Scratch, but has more blocks. Snap! has a unicode block, that will convert a character to its ASCII or unicode value.

Building an ngram frequency table and dealing with multibyte runes

I am currently learning Go and am making a lot of progress. One way I do this is to port past projects and prototypes from a prior language to a new one.
Right now I am busying myself with a "language detector" I prototyped in Python a while ago. In this module, I generate an ngram frequency table, where I then calculate the difference between a given text and a known corpora.
This allows one to effectively determine which corpus is the best match by returning the cosine of two vector representations of the given ngram tables. Yay. Math.
I have a prototype written in Go that works perfectly with plain ascii characters, but I would very much like to have it working with unicode multibyte support. This is where I'm doing my head in.
Here is a quick example of what I'm dealing with: http://play.golang.org/p/2bnAjZX3r0
I've only posted the table generating logic since everything already works just fine.
As you can see by running the snippet, the first text works quite well and builds an accurate table. The second text, which is German, has a few double-byte characters in it. Due to the way I am building the ngram sequence, and due to the fact that these specific runes are made of two bytes, there appear 2 ngrams where the first byte is cut off.
Could someone perhaps post a more efficient solution or, at the very least, guide me through a fix? I'm almost positive I am over analysing this problem.
I plan on open sourcing this package and implementing it as a service using Martini, thus providing a simple API people can use for simple linguistic computation.
As ever, thanks!
If I understand correctly, you want chars in your Parse function to hold the last n characters in the string. Since you're interested in Unicode characters rather than their UTF-8 representation, you might find it easier to manage it as a []rune slice, and only convert back to a string when you have your ngram ready to add to the table. This way you don't need to special case non-ASCII characters in your logic.
Here is a simple modification to your sample program that does the above: http://play.golang.org/p/QMYoSlaGSv
By keeping a circular buffer of runes, you can minimise allocations. Also note that reading a new key from a map returns the zero value (which for int is 0), which means the unknown key check in your code is redundant.
func Parse(text string, n int) map[string]int {
chars := make([]rune, 2 * n)
table := make(map[string]int)
k := 0
for _, chars[k] = range strings.Join(strings.Fields(text), " ") + " " {
chars[n + k] = chars[k]
k = (k + 1) % n
table[string(chars[k:k+n])]++
}
return table
}

hash function to index similar text

I'm searching about a sort of hash function to index similar text. So for example if we have two very long text called "A" and "B" where A and B differ not so much, then the hash function (called H) applied to A and B should return the same number.
So H(A) = H(B) where A and B are similar text.
I tried the "DoubleMetaphone" (I use italian language text), but I saw that it depends very strong from the string prefixes. For example:
A = "This is the very long text that I want to hash"
B = "This is the very"
==> doubleMetaPhone(A) = doubleMetaPhone(B)
And this is not so good for me, beacause strings with the same prefix could be compared as similar and I don't want this.
Could anyone suggest me any other way?
see http://en.wikipedia.org/wiki/Locality_sensitive_hashing
You problem is (close to) insoluble for many distance functions between strings.
Most distance functions (e.g. edit distance) allow you to transform a string into another string via a sequence of 1-distance transformations:
"AAAA" -> "AAAB" -> "AAABC"
according to your requirements, the first and second strings should have the same hash value. But so must the second and the third, and so on. So all the strings will have to have the same hash, if we allow a pair with distance=1 to have the same hash value.
Even if we impose a higher threshold on the distance (maybe in relation to string length), we'll end up with a messy result.
A better (IMO) approach is to find an equivalence relation on the set of strings, such that each string in each equivalence class has the same hash. A possibility is to define classes by their distance to a predefined string (e.g. edit distance from "AAAAA"), and the distance itself would be the hash value. Probably this approach would not be the best in your case, but maybe with some extra info on the problem we can come up with a better equivalence relation.