Optimizing word count - hash

(This is rather hypothetical in nature as of right now, so I don't have too many details to offer.)
I have a flat file of random (English) words, one on each line. I need to write an efficient program to count the number of occurrences of each word. The file is big (perhaps about 1GB), but I have plenty of RAM for everything. They're stored on permanent media, so read speeds are slow, so I need to just read through it once linearly.
My two off-the-top-of-my-head ideas were to use a hash with words => no. of occurrences, or a trie with the no. of occurrences at the end node. I have enough RAM for a hash array, but I'm thinking that a trie would have as fast or faster lookups.
What approach would be best?

I think a trie with the count as the leaves could be faster.
Any decent hash table implementation will require reading the word fully, processing it using a hash function, and finally, a look-up in the table.
A trie can be implemented such that the search occurs as you are reading the word. This way, rather than doing a full look-up of the word, you could often find yourself skipping characters once you've established the unique word prefix.
For example, if you've read the characters: "torto", a trie would know that the only possible word that starts this way is tortoise.
If you can perform this inline searching faster on a word faster than the hashing algorithm can hash, you should be able to be faster.
However, this is total overkill. I rambled on since you said it was purely hypothetical, I figured you'd like a hypothetical-type of answer. Go with the most maintainable solution that performs the task in a reasonable amount of time. Micro-optimizations typically waste more time in man-hours than they save in CPU-hours.

I'd use a Dictionary object where the key is word converted to lower case and the value is the count. If the dictionary doesn't contain the word, add it with a value of 1. If it does contain the word, increment the value.

Given slow reading, it's probably not going to make any noticeable difference. The overall time will be completely dominated by the time to read the data anyway, so that's what you should work at optimizing. For the algorithm (mostly data structure, really) in memory, just use whatever happens to be most convenient in the language you find most comfortable.

A hash table is (if done right, and you said you had lots of RAM) O(1) to count a particular word, while a trie is going to be O(n) where n is the length of the word.
With a sufficiently large hash space, you'll get much better performance from a hash table than from a trie.

I think that a trie is overkill for your use case. A hash of word => # of occurrences is exactly what I would use. Even using a slow interpreted language like Perl, you can munge a 1GB file this way in just a few minutes. (I've done this before.)

I have enough RAM for a hash array, but I'm thinking that a trie would have as fast or faster lookups.
How many times will this code be run? If you're just doing it once, I'd say optimize for your time rather than your CPU's time, and just do whatever's fastest to implement (within reason). If you have a standard library function that implements a key-value interface, just use that.
If you're doing it many times, then grab a subset (or several subsets) of the data file, and benchmark your options. Without knowing more about your data set, it'd be dubious to recommend one over another.

Use Python!
Add these elements to a set data type as you go line by line, before asking whether it is in the hash table. After you know it is in the set, then add a dictionary value of 2, since you already added it to the set once before.
This will take some of the memory and computation away from asking the dictionary every single time, and instead will handle unique valued words better, at the end of the call just dump all the words that are not in the dictionary out of the set with a value of 1. (Intersect the two collections in respect to the set)

To a large extent, it depends on what you want you want to do with the data once you've captured it. See Why Use a Hash Table over a Trie (Prefix Tree)?

a simple python script:
import collections
f = file('words.txt')
counts = collections.defaultdict(int)
for line in f:
counts[line.strip()] +=1
print "\n".join("%s: %d" % (word, count) for (word, count) in counts.iteritems())

Related

what is the efficiency of an assign statement in progress-4gl

why is an assign statement more efficient than not using assign?
co-workers say that:
assign
a=3
v=7
w=8.
is more efficient than:
a=3.
v=7.
w=8.
why?
You could always test it yourself and see... but, yes, it is slightly more efficient. Or it was the last time I tested it. The reason is that the compiler combines the statements and the resulting r-code is a bit smaller.
But efficiency is almost always a poor reason to do it. Saving a micro-second here and there pales next to avoiding disk IO or picking a more efficient algorithm. Good reasons:
Back in the dark ages there was a limit of 63k of r-code per program. Combining statements with ASSIGN was a way to reduce the size of r-code and stay under that limit (ok, that might not be a "good" reason). One additional way this helps is that you could also often avoid a DO ... END pair and further reduce r-code size.
When creating or updating a record the fields that are part of an index will be written back to the database as they are assigned (not at the end of the transaction) -- grouping all assignments into a single statement helps to avoid inconsistent dirty reads. Grouping the indexed fields into a single ASSIGN avoids writing the index entries multiple times. (This is probably the best reason to use ASSIGN.)
Readability -- you can argue that grouping consecutive assignments more clearly shows your intent and is thus more readable. (I like this reason but not everyone agrees.)
basically doing:
a=3.
v=7.
w=8.
is the same as:
assign a=3.
assign v=7.
assign w=8.
which is 3 separate statements so a little more overhead. Therefore less efficient.
Progress does assign as one statement whether there is 1 or more variables being assigned. If you do not say Assign then it is assumed so you will do 3 statements instead of 1. There is a 20% - 40% reduction in R Code and a 15% - 20% performance improvement when using one assign statement. Why this is can only be speculated on as I can not find any source with information on why this is. For database fields and especially key/index fields it makes perfect sense. For variables I can only assume it has to do with how progress manages its buffers and copies data to and from buffers.
ASSIGN will combine multiple statements into one. If a, v and w are fields in your db, that means it will do something like INSERT INTO (a,v,w)...
rather than
INSERT INTO (a)...
INSERT INTO (v)
etc.

Lucene: Loading Index files while searching?

Can anyone explain how index files are loaded in memory while searching?
Is the whole file (fnm, tis, fdt etc) loaded at once or in chunks?
How individual segments are loaded and in which order?
How to encrypt Lucene index?
The main point of having the index segments is that you can rarely load the whole index in the memory.
The most important limitation that is taken into account while designing the index format is that disk seek time is relatively long (on plate-base hard drives, that are still most widely used). A good estimation is that the transfer time per byte is about 0.01 to 0.02 μs, while average seek time of disk head is about 5 ms!
So the part that is kept in memory is typically only the dictionary, used to find out the beginning block of the postings list on the disk*. The other parts are loaded only on-demand and then purged from the memory to make room for other searches.
As for encryption, it depends on whether you need to keep the index encrypted all the time (even when in memory) or if it suffices to encrypt only the index files. As for the latter, I think that an encrypted file system will be enough. As for the former, it is also certainly possible, as different index compression techniques are already in place. However, I don't think it's widely used, as the first and foremost requirement for full-text engine is speed.
[*] It's not really such simple, as we're performing binary searches against the dictionary, so we need to ensure that all entries in the first structure have equal length. As it's clearly not the case with normal words in dictionary and applying padding is too much costly (think of word lengths for some chemical substances), we actually maintain two levels of dictionary, the first one (which needs to fit in the memory and is stored in .tii files) keeps sorted list of starting positions of terms in the second index (.tis files). The second index is then a concatenated array of all terms in an increasing order, along with pointer to the sector in the .frq file. The second index often fits in the memory and is loaded at the start, but it can be impossible e.g. for bigram indexes. Also note that for some time Lucene by default doesn't use individual files, but so called compound files (with .cfs extension) to cut down the number of open files.

Merging huge sets (HashSet) in Scala

I have two huge (as in millions of entries) sets (HashSet) that have some (<10%) overlap between them. I need to merge them into one set (I don't care about maintaining the original sets).
Currently, I am adding all items of one set to the other with:
setOne ++= setTwo
This takes several minutes to complete (after several attempts at tweaking hashCode() on the members).
Any ideas how to speed things up?
You can get slightly better performance with Parallel Collections API in Scala 2.9.0+:
setOne.par ++ setTwo
or
(setOne.par /: setTwo)(_ + _)
There are a few things you might wanna try:
Use the sizeHint method to keep your sets at the expected size.
Call useSizeMap(true) on it to get better hash table resizing.
It seems to me that the latter option gives better results, though both show improvements on tests here.
Can you tell me a little more about the data inside the sets? The reason I ask is that for this kind of thing, you usually want something a bit specialized. Here's a few things that can be done:
If the data is (or can be) sorted, you can walk pointers to do a merge, similar to what's done using merge sort. This operation is pretty trivially parallelizable since you can partition one data set and then partition the second data set using binary search to find the correct boundary.
If the data is within a certain numeric range, you can instead use a bitset and just set bits whenever you encounter that number.
If one of the data sets is smaller than the other, you could put it in a hash set and loop over the other dataset quickly, checking for containment.
I have used the first strategy to create a gigantic set of about 8 million integers from about 40k smaller sets in about a second (on beefy hardware, in Scala).

Perl Multi hash vs Single hash

I want to read and process sets of input from a file and then print it out.
There are 3 keys which I need to use to store data.
Assume the 3 keys are k1, k2, k3
Which of the following will give better performance
$hash{k1}->{k2}->{k3} = $val;
or
$hash{"k1,k2,k3"} = $val;
For my previous question I got the answer that all perl hash keys are treated as strings.
Unless you're really dealing with large datasets, use whichever one produces cleaner code. I may be wrong but this reeks of premature optimization.
If it isn't, this may depend on the range of possible keys. If ordering is not an issue, arrange your data in order so that k1 is the smallest set of keys and k3 is the largest. I suspect you'll use less memory on hashes that way. Depending on your datasets it may even be prudent to presize your hashes (I think %hash = 100 does the trick).
As to which is faster, only profiling will tell. Try both and see for yourself.
Also, note that $hash{k1}->{k2}-{k3} is unnecessary. You can write $hash{k1}{k2}{k3}. Dereferences aren't needes in between brackets, either square or curly.
Hash lookup speed is independent of the number of items in the hash, so the version which only does one hash lookup will perform the hash lookup portion of the operation faster than the version which does three hash lookups. But, on the other hand, the single-lookup version has to concatenate the three keys into a single string before they can be used as a combined key; if this string is anonymous (e.g., $hash{"$a,$b,$c"}), this will likely involve some fun stuff like memory allocation. Overall, I would expect the concatenation to be quick enough that the one-lookup version would be faster than the three-lookup version in most cases, but the only way to know which is faster in your case would be to write the same code in both styles and Benchmark the difference.
However, like everyone else has already said, this is a premature and worthless micro-optimization. Unless you know that you have a performance problem (or you have historical performance data which shows that a problem is developing and will be upon you in the near future) and you have profiled your code to determine that hash lookups are the cause of your performance problem, you're wasting your time worrying about this. Hash lookups are fast. It's hardly a real benchmark, but:
$ time perl -e '$foo{bar} for 1 .. 1_000_000'
real 0m0.089s
user 0m0.088s
sys 0m0.000s
In this trivial (and, admittedly, highly flawed) example, I got a rate equivalent to roughly 11 million hash lookups per second. In the time you spent asking the question, your computer could have done hundreds of millions, if not billions, of hash lookups.
Write your hash lookups in whatever style is most readable and most maintainable in your application. If you try to optimize this to be as fast as possible, the wasted programmer time will be (many!) orders of magnitude larger than any processing time that you could ever hope to save with the optimizations.
If you have memory concerns I would suggest use Devel::Size from CPAN in a early fase of development to get the size of both alternatives.
Otherwise use the one which seems friendly for you!

Cost of isEqualToString: vs. Numerical comparisons

I'm working on a project with designing a core data system for searching and cataloguing images and documents. One of the objects in my data model is a 'key word' object. Every time I add a new key word I first want to first run though all of the existing keywords to make sure it doesn't already exist in the current context.
I've read in posts here and in a lot of my reading that doing string comparisons is a far more expensive processing than some other comparison operations. Since I could easily end up having to check many thousands of words before a new addition I'm wondering if it would be worth using some method that would represent the key word strings numerically for the purpose of this process. Possibly breaking down each character in the string into a number formed from the UTF code for each character and then storing that in an ID property for each key word.
I was wondering if anyone else thought any benefit might come from this approach or if anyone else had any better ideas.
What you might useful is a suitable hash function to convert your text strings into (probably) unique numbers. (You might still have to check for collision effects.)
Comparing intrinsic numbers in C code is a much faster for several reasons. It avoids the Objective C runtime dispatch overhead. It requires accessing less total memory. And the executable code for each comparison is usually just an instruction or 3, rather than a loop with incrementers and several decision points.