When does it make sense to presize a hash? - perl

From perldata:
You can preallocate space for a hash by assigning to the keys() function.
This rounds up the allocated buckets to the next power of two:
keys(%users) = 1000; # allocate 1024 buckets
Is there a rule of thumb for when presizing a hash will improve performance?

The rule of thumb is that the larger you know the Hash will be, the more likely you'll get value out of pre-sizing it. Consider if your hash has 10 slots, and you start adding one after the other, the number of expansions will a) be few (if at all), and b) small (since there is little data).
But if you KNOW you're going to need at least 1M items, then there's no reason to expand, and copy the underlying and ever expanding data structures over and over while the table grows.
Will you NOTICE this expansion? Eh, maybe. Modern machines are pretty darn fast, it may not come up. But it's a grand opportunity for heap expansion, thus causing a GC and a cascade of all sorts of things. So, if you know you're going to use it, it's a "cheap" fix to tweak out a few more milibleems of performance.

I tried to benchmark expansion cost on hash growing:
use Benchmark qw(cmpthese);
# few values
cmpthese(-4, {
prealloc => sub {
my %hash;
keys(%hash) = 17576;
$hash{$_} = $_ for 'aaa' .. 'zzz';
},
normal => sub {
my %hash;
$hash{$_} = $_ for 'aaa' .. 'zzz';
},
});
# more values
cmpthese(-8, {
prealloc => sub {
my %hash;
keys(%hash) = 456976;
$hash{$_} = $_ for 'aaaa' .. 'zzzz';
},
normal => sub {
my %hash;
$hash{$_} = $_ for 'aaaa' .. 'zzzz';
},
});
Results does not sound like big optimization, however reducing heap fragmentation mentioned by Will Hartung might be benefit. Running perl 5.12 on WinXP machine.
Rate normal prealloc
normal 48.3/s -- -2%
prealloc 49.4/s 2% --
(warning: too few iterations for a reliable count)
s/iter normal prealloc
normal 3.62 -- -1%
prealloc 3.57 1% --

Basically it is the door to optimize hash performance. Hash performance depends heavily both on the hashing algorithm used and on the data you are handling, so it is almost impossible to come up with rule of thumbs. Anyway, something can be said.
You know that each data structure offers a given balance between space and time efficiency. Hash tables are especially good as to time efficiency, offering an appealing constant (0(1)) time access.
This holds true unless there is a collision. When a collision happens, then access time is linear with the size of the bucket corresponding to the collision value. (Have a look at this for more details). Collisions, apart from being "slower", are mostly a disruption of the access time guarantee that is the single most important aspect that often leads to choosing a hash table in the first place.
Ideally, hash tables could aim at what is known as "perfect hashing" (which is actually feasible only when you can fine-tune the algorithm to the kind of data you will handle), but this is not so easy to attain in the general case (this is an euphemism, actually). Anyway, it is a matter of fact that bigger hash tables (together with a good hashing algorithm) can reduce the frequency of collisions, and thus improve performance, at the expense of memory. Smaller hash tables will see more collisions (hence will have less performance and a lesser quality access time guarantee) but occupy less memory.
So, if you profile your program and see that hash table access is a bottleneck (for any reasons) you have a chance to solve this by reserving more memory for the hash space (if you have memory to give).
In any case, I would not increase this value at random, but only after thorough profiling, since it is also true that the algorithm perl uses is compiled in (AFAIK) and this also has a big effect on hash performance (in other words, you could have a lot of collisions even if you make the hash space bigger).
As usual with performance related things, it could be useful or not, it depends on your concrete case.

Related

One billion length List in Scala?

Just as a load test, I was playing with different data structures in Scala. Just wondering what it takes to work or even create a one billion length array. 100 million seems to be no problem, of course there's no real magic about the number 1,000,000,000. I'm just seeing how far you can push it.
I had to bump up memory on most of the tests. export JAVA_OPTS="-Xms4g -Xmx8g"
// insanity begins ...
val buf = (0 to 1000000000 - 1).par.map { i => i }.toList
// java.lang.OutOfMemoryError: GC overhead limit exceeded
However preallocating an ArrayInt works pretty well. It takes about 9 seconds to iterate and build the object. Interestingly, doing almost anything with ListBuffer seems to automatically take advantage of all cores. However, the code above will not finish (at least with 8gb Xmx).
I understand that this is not a common case and I'm just messing around. But if you had to pull some massive thing into memory, is there a more efficient technique? Is Array with type as efficient as it gets?
The per-element overhead of a List is considerable. Each element is held in a cons cell (case class ::) which means there is one object with two fields for every element. On a 32-bit JVM that's 16 bytes per element (not counting the element value itself). On a 64-bit JVM it's going to be even higher.
List is not a good container type for extremely large contents. Its primary feature is very efficient head / tail decomposition. If that's something you need then you may just have to deal with the memory cost. If it's not, try to choose a more efficient representation.
For what it's worth, I consider memory overhead considerations to be one thing that justifies using Array. There are lots of caveats around using arrays, so be careful if you go that way.
Given that the JVM can sensibly arrange an Array of Ints in memory, if you really need to iterate over them it would indeed be the most efficient approach. It would generate much the same code if you did exactly the same thing with Java.

collection.mutable.OpenHashMap vs collection.mutable.HashMap

For put and get operations OpenHashMap outperform HashMap by about 5 times: https://gist.github.com/1423303
Are any cases when HashMap should be preferred over OpenHashMap?
Your code exactly matches one of the use cases for OpenHashMap. Your code:
println ("scala OpenHashMap: " + time (warmup) {
val m = new scala.collection.mutable.OpenHashMap[Int,Int];
var i = 0;
var start = System.currentTimeMillis();
while(i<100000) { m.put(i,i);i=i+1;};
})
The explanation for OpenHashMap (scaladoc):
A mutable hash map based on an open hashing scheme. The precise scheme
is undefined, but it should make a reasonable effort to ensure that an
insert with consecutive hash codes is not unneccessarily penalised. In
particular, mappings of consecutive integer keys should work without
significant performance loss.
My emphasis. Which explains your findings. When to use OpenHashMap rather than HashMap? See Wikipedia. From there:
Chained hash tables with linked lists are popular because they require
only basic data structures with simple algorithms, and can use simple
hash functions that are unsuitable for other methods.
The cost of a table operation is that of scanning the entries of the
selected bucket for the desired key. If the distribution of keys is
sufficiently uniform, the average cost of a lookup depends only on the
average number of keys per bucket—that is, on the load factor.
Chained hash tables remain effective even when the number of table
entries n is much higher than the number of slots. Their performance
degrades more gracefully (linearly) with the load factor. For example,
a chained hash table with 1000 slots and 10,000 stored keys (load
factor 10) is five to ten times slower than a 10,000-slot table (load
factor 1); but still 1000 times faster than a plain sequential list,
and possibly even faster than a balanced search tree.
For separate-chaining, the worst-case scenario is when all entries
were inserted into the same bucket, in which case the hash table is
ineffective and the cost is that of searching the bucket data
structure. If the latter is a linear list, the lookup procedure may
have to scan all its entries; so the worst-case cost is proportional
to the number n of entries in the table.
This is a generic explanation. As ever with these things, your performance will vary depending upon the use case, if you care about it, you need to measure it.

Perl Multi hash vs Single hash

I want to read and process sets of input from a file and then print it out.
There are 3 keys which I need to use to store data.
Assume the 3 keys are k1, k2, k3
Which of the following will give better performance
$hash{k1}->{k2}->{k3} = $val;
or
$hash{"k1,k2,k3"} = $val;
For my previous question I got the answer that all perl hash keys are treated as strings.
Unless you're really dealing with large datasets, use whichever one produces cleaner code. I may be wrong but this reeks of premature optimization.
If it isn't, this may depend on the range of possible keys. If ordering is not an issue, arrange your data in order so that k1 is the smallest set of keys and k3 is the largest. I suspect you'll use less memory on hashes that way. Depending on your datasets it may even be prudent to presize your hashes (I think %hash = 100 does the trick).
As to which is faster, only profiling will tell. Try both and see for yourself.
Also, note that $hash{k1}->{k2}-{k3} is unnecessary. You can write $hash{k1}{k2}{k3}. Dereferences aren't needes in between brackets, either square or curly.
Hash lookup speed is independent of the number of items in the hash, so the version which only does one hash lookup will perform the hash lookup portion of the operation faster than the version which does three hash lookups. But, on the other hand, the single-lookup version has to concatenate the three keys into a single string before they can be used as a combined key; if this string is anonymous (e.g., $hash{"$a,$b,$c"}), this will likely involve some fun stuff like memory allocation. Overall, I would expect the concatenation to be quick enough that the one-lookup version would be faster than the three-lookup version in most cases, but the only way to know which is faster in your case would be to write the same code in both styles and Benchmark the difference.
However, like everyone else has already said, this is a premature and worthless micro-optimization. Unless you know that you have a performance problem (or you have historical performance data which shows that a problem is developing and will be upon you in the near future) and you have profiled your code to determine that hash lookups are the cause of your performance problem, you're wasting your time worrying about this. Hash lookups are fast. It's hardly a real benchmark, but:
$ time perl -e '$foo{bar} for 1 .. 1_000_000'
real 0m0.089s
user 0m0.088s
sys 0m0.000s
In this trivial (and, admittedly, highly flawed) example, I got a rate equivalent to roughly 11 million hash lookups per second. In the time you spent asking the question, your computer could have done hundreds of millions, if not billions, of hash lookups.
Write your hash lookups in whatever style is most readable and most maintainable in your application. If you try to optimize this to be as fast as possible, the wasted programmer time will be (many!) orders of magnitude larger than any processing time that you could ever hope to save with the optimizations.
If you have memory concerns I would suggest use Devel::Size from CPAN in a early fase of development to get the size of both alternatives.
Otherwise use the one which seems friendly for you!

How can I efficiently group a large list of URLs by their host name in Perl?

I have text file that contains over one million URLs. I have to process this file in order to assign URLs to groups, based on host address:
{
'http://www.ex1.com' => ['http://www.ex1.com/...', 'http://www.ex1.com/...', ...],
'http://www.ex2.com' => ['http://www.ex2.com/...', 'http://www.ex2.com/...', ...]
}
My current basic solution takes about 600 MB of RAM to do this (size of file is about 300 MB). Could you provide some more efficient ways?
My current solution simply reads line by line, extracts host address by regex and puts the url into a hash.
EDIT
Here is my implementation (I've cut off irrelevant things):
while($line = <STDIN>) {
chomp($line);
$line =~ /(http:\/\/.+?)(\/|$)/i;
$host = "$1";
push #{$urls{$host}}, $line;
}
store \%urls, 'out.hash';
One approach that you could take is tieing your url hash to a DBM like BerkeleyDB. You can explicitly give it options for how much memory it can use.
If you read 600MB from two files and store them in memory (in the hash) there is not much room for optimization in terms of memory use (short of compressing the data, which is probably not a viable option).
But depending on how you are going to use the data in the hash, it might be worth to consider storing the data in a database, and querying it for the information you need.
EDIT:
Based on the code you have posted, a quick optimization would be to not store the entire line but just the relative url. After all you already have the host name as a key in your hash.
Other than by storing your data structures to disk (tied DBM hash as suggested by Leon Timmermans, an SQL database such as SQLite3, etc.), you're not going to be able to reduce memory consumption much. 300M of actual data, plus the perl interpreter, plus the bytecode representation of your program, plus metadata on each of the extracted strings is going to add up to substantially more than 300M of total memory used if you keep it all in memory. If anything, I'm mildly surprised that it's only double the size of the input file.
One other thing to consider is that, if you're going to be processing the same file more than once, storing the parsed data structure on disk means that you'll never have to take the time to re-parse it on future runs of the program.
What exactly you are trying to acheive? If you are going for some complex analysis, storing to database is a good idea, of the grouping is just and intermediary step, you might just sort the text file and than process it sequentially directly deriving the results you are looking for.

Optimizing word count

(This is rather hypothetical in nature as of right now, so I don't have too many details to offer.)
I have a flat file of random (English) words, one on each line. I need to write an efficient program to count the number of occurrences of each word. The file is big (perhaps about 1GB), but I have plenty of RAM for everything. They're stored on permanent media, so read speeds are slow, so I need to just read through it once linearly.
My two off-the-top-of-my-head ideas were to use a hash with words => no. of occurrences, or a trie with the no. of occurrences at the end node. I have enough RAM for a hash array, but I'm thinking that a trie would have as fast or faster lookups.
What approach would be best?
I think a trie with the count as the leaves could be faster.
Any decent hash table implementation will require reading the word fully, processing it using a hash function, and finally, a look-up in the table.
A trie can be implemented such that the search occurs as you are reading the word. This way, rather than doing a full look-up of the word, you could often find yourself skipping characters once you've established the unique word prefix.
For example, if you've read the characters: "torto", a trie would know that the only possible word that starts this way is tortoise.
If you can perform this inline searching faster on a word faster than the hashing algorithm can hash, you should be able to be faster.
However, this is total overkill. I rambled on since you said it was purely hypothetical, I figured you'd like a hypothetical-type of answer. Go with the most maintainable solution that performs the task in a reasonable amount of time. Micro-optimizations typically waste more time in man-hours than they save in CPU-hours.
I'd use a Dictionary object where the key is word converted to lower case and the value is the count. If the dictionary doesn't contain the word, add it with a value of 1. If it does contain the word, increment the value.
Given slow reading, it's probably not going to make any noticeable difference. The overall time will be completely dominated by the time to read the data anyway, so that's what you should work at optimizing. For the algorithm (mostly data structure, really) in memory, just use whatever happens to be most convenient in the language you find most comfortable.
A hash table is (if done right, and you said you had lots of RAM) O(1) to count a particular word, while a trie is going to be O(n) where n is the length of the word.
With a sufficiently large hash space, you'll get much better performance from a hash table than from a trie.
I think that a trie is overkill for your use case. A hash of word => # of occurrences is exactly what I would use. Even using a slow interpreted language like Perl, you can munge a 1GB file this way in just a few minutes. (I've done this before.)
I have enough RAM for a hash array, but I'm thinking that a trie would have as fast or faster lookups.
How many times will this code be run? If you're just doing it once, I'd say optimize for your time rather than your CPU's time, and just do whatever's fastest to implement (within reason). If you have a standard library function that implements a key-value interface, just use that.
If you're doing it many times, then grab a subset (or several subsets) of the data file, and benchmark your options. Without knowing more about your data set, it'd be dubious to recommend one over another.
Use Python!
Add these elements to a set data type as you go line by line, before asking whether it is in the hash table. After you know it is in the set, then add a dictionary value of 2, since you already added it to the set once before.
This will take some of the memory and computation away from asking the dictionary every single time, and instead will handle unique valued words better, at the end of the call just dump all the words that are not in the dictionary out of the set with a value of 1. (Intersect the two collections in respect to the set)
To a large extent, it depends on what you want you want to do with the data once you've captured it. See Why Use a Hash Table over a Trie (Prefix Tree)?
a simple python script:
import collections
f = file('words.txt')
counts = collections.defaultdict(int)
for line in f:
counts[line.strip()] +=1
print "\n".join("%s: %d" % (word, count) for (word, count) in counts.iteritems())