One billion length List in Scala? - scala

Just as a load test, I was playing with different data structures in Scala. Just wondering what it takes to work or even create a one billion length array. 100 million seems to be no problem, of course there's no real magic about the number 1,000,000,000. I'm just seeing how far you can push it.
I had to bump up memory on most of the tests. export JAVA_OPTS="-Xms4g -Xmx8g"
// insanity begins ...
val buf = (0 to 1000000000 - 1).par.map { i => i }.toList
// java.lang.OutOfMemoryError: GC overhead limit exceeded
However preallocating an ArrayInt works pretty well. It takes about 9 seconds to iterate and build the object. Interestingly, doing almost anything with ListBuffer seems to automatically take advantage of all cores. However, the code above will not finish (at least with 8gb Xmx).
I understand that this is not a common case and I'm just messing around. But if you had to pull some massive thing into memory, is there a more efficient technique? Is Array with type as efficient as it gets?

The per-element overhead of a List is considerable. Each element is held in a cons cell (case class ::) which means there is one object with two fields for every element. On a 32-bit JVM that's 16 bytes per element (not counting the element value itself). On a 64-bit JVM it's going to be even higher.
List is not a good container type for extremely large contents. Its primary feature is very efficient head / tail decomposition. If that's something you need then you may just have to deal with the memory cost. If it's not, try to choose a more efficient representation.
For what it's worth, I consider memory overhead considerations to be one thing that justifies using Array. There are lots of caveats around using arrays, so be careful if you go that way.

Given that the JVM can sensibly arrange an Array of Ints in memory, if you really need to iterate over them it would indeed be the most efficient approach. It would generate much the same code if you did exactly the same thing with Java.

Related

Handling large iterators - aggregation

Let's say we have an Iterator of a (String, String)-Tuple.
Said Iterator has a kazillion elements, possibly exhausting main memory.
What would you do if you had to aggregate it like the following:
The tuples have the form of (entityname, attributename) and you would have to populate a list of the attributenames. Also the iterator would be completely unordered and would never fit into memory.
(for instance the last and the first attibutename could correspond to the same entityname).
A concrete example:
("stackoverflow","users"),
("bear","claws"),
("stackoverflow","usesAjaxTechnology"),
("bear","eyes")
after the aggregation -> :
("stackoverflow",List("users","usesAjaxTechnology")),
("bear",List("claws","eyes")).
I know that there are statemenst like groupBy and so on but this would assuming the iterator has a lof ot elements never work due to memory problems?
Well, lets take a look at what groupBy does:
scala> res0.groupBy(x => x._1)
res2: scala.collection.immutable.Map[String,List[(String, String)]] =
Map( bear -> List((bear,claws), (bear,eyes)),
stackoverflow -> List((stackoverflow,users), (stackoverflow,usesAjaxTechnology))
)
As you can see, it creates a Map of elements. Since it does so in memory, you'd obviously run into memory issues as the data grows larger than your RAM.
On the other hand, it'd be possible to construct a Map-like structure that, instead of holding all data in memory, writes them to the filesystem. The simplest such Map would create a file for every key (e.g. "bear" or "stackoverflow") in a certain directory, and write all the attributes into the corresponding files. This would require almost no memory usage, replacing it with very high disk usage.
I'm wondering if this is sort of an artificial requirement, or if you are actually facing a real problem where this is an issue. Also, I'm really interested in hearing what the real functional programming pros here have to say :)
If you have that many elements I would assume that they are in a database or file of some type . I would take them out in manageable chunks and process them that way, writing them back to a db or new file. That would solve your problem with memory and allow you to perform this kind of processing.
If you are using MongoDb( which I recommend) your find query could easily pull out only the stackoverflow users and then your next statement could write that to a new collection. Same with the bear.

Write performance scala immutable collections

Quick question. I'm currently designing some database queries to extract reasonably large, but not massive datasets into memory, say approximately 10k-100k records.
So far I've been testing loading these resultsets into a scala.collection.immutable.Seq and have discovered it seems to take an incredibly long time to build the collection. Whereas if I change to a Vector or List the write into memory takes fractions of a second.
MY question is therefore why is Seq so slow in this case? If so in what cases would using Seq be more appropriate than Vector?
Thanks
It would help if you'd post the relevant snippet and which operations you call on the sequence -- immutable.Seq is represented using a List (see https://github.com/scala/scala/blob/v2.10.2/src/library/scala/collection/immutable/Seq.scala#L42). My guess is that you've been using :+ on the immutable.Seq, which under the hood appends to the end of the list by copying it (probably giving you quadratic overall performance), and when you switched to using immutable.List directly, you've been attaching to the beginning using :: (giving you linear performance).
Since Seq is just a List under the hood, you should use it when you attach to the beginning of the sequence -- the cons operator :: only creates a one node and links it to the rest of the list, which is as fast as it can get when it comes to immutable data structures. Otherwise, if you add to the end, and you insist on immutability, you should use a Vector (or the upcoming Conc lists!).
If you would like a validation of these claims, see this link where the performance of the two operations is compared using ScalaMeter -- lists are 8 times faster than vectors when you add to the beginning.
However, the most appropriate data structure should be either an ArrayBuffer or a VectorBuilder. These are mutable data structures that resize dynamically and if you build them using += you will get a reasonable performance. This is assuming that you are not storing primitives.

Merging huge sets (HashSet) in Scala

I have two huge (as in millions of entries) sets (HashSet) that have some (<10%) overlap between them. I need to merge them into one set (I don't care about maintaining the original sets).
Currently, I am adding all items of one set to the other with:
setOne ++= setTwo
This takes several minutes to complete (after several attempts at tweaking hashCode() on the members).
Any ideas how to speed things up?
You can get slightly better performance with Parallel Collections API in Scala 2.9.0+:
setOne.par ++ setTwo
or
(setOne.par /: setTwo)(_ + _)
There are a few things you might wanna try:
Use the sizeHint method to keep your sets at the expected size.
Call useSizeMap(true) on it to get better hash table resizing.
It seems to me that the latter option gives better results, though both show improvements on tests here.
Can you tell me a little more about the data inside the sets? The reason I ask is that for this kind of thing, you usually want something a bit specialized. Here's a few things that can be done:
If the data is (or can be) sorted, you can walk pointers to do a merge, similar to what's done using merge sort. This operation is pretty trivially parallelizable since you can partition one data set and then partition the second data set using binary search to find the correct boundary.
If the data is within a certain numeric range, you can instead use a bitset and just set bits whenever you encounter that number.
If one of the data sets is smaller than the other, you could put it in a hash set and loop over the other dataset quickly, checking for containment.
I have used the first strategy to create a gigantic set of about 8 million integers from about 40k smaller sets in about a second (on beefy hardware, in Scala).

Immutable Map implementation for huge maps

If I have an immutable Map which I might expect (over a very short period of time - like a few seconds) to be adding/removing hundreds of thousands of items from, is the standard HashMap a bad idea? Let's say I want to pass 1Gb of data through the Map in <10 seconds in such a way that the maximum size of the Map at any once instant is only 256Mb.
I get the impression that the map keeps some kind of "history" but I will always be accessing the last-updated table (i.e. I do not pass the map around) because it is a private member variable of an Actor which is updated/accessed only from within reactions.
Basically I suspect that this data structure may be (partly) at fault for issues I am seeing around JVMs going out of memory when reading in large amounts of data in a short time.
Would I be better off with a different map implementation and, if so, what is it?
Ouch. Why do you have to use an immutable map? Poor garbage collector! Immutable maps generally require (log n) new objects per operation in addition to (log n) time, or they really just wrap mutable hash maps and layer changesets on top (which slows things down and can increase the number of object creations).
Immutability is great, but this does not seem to me like the time to use it. If I were you, I'd stick with scala.collection.mutable.HashMap. If you need concurrent access, wrap the Java util.concurrent one instead.
You also might want to increase the size of the young generation in the JVM: -Xmn1G or more (assuming you're running with -Xmx3G). Also, use the throughput (parallel) garbage collector.
That would be awful. You say you always want to access the last-updated table, that means you only need an ephemeral data structure, there is no need to pay the cost for a persistent data structure - it's like trading time and memory to gain completely arguable "style points". You are not building your karma by using blindly persistent structures when they are not called for.
Also, a hashtable is a particularly difficult structure to make persistent. In other words, "very, very slow" (basically it is usable when reads greatly outnumber writes - and you seem to talk about many writes).
By the way, a ConcurrentHashMap wouldn't make sense in this design, given that the map is accessed from a single actor (that's what I understand from the description).
Scala's so-called(*) immutable Map is broken beyond basic usage up to Scala 2.7. Don't trust me, just look up the number of open tickets for it. And the solution is just "it will be replaced with something else on Scala 2.8" (which it did).
So, if you want an immutable map for Scala 2.7.x, I'd advise looking for it in something other than Scala. Or just use TreeHashMap instead.
(*) Scala's immutable Map isn't really immutable. It is a mutable data structure internally, which requires lot of synchronization.

Optimizing word count

(This is rather hypothetical in nature as of right now, so I don't have too many details to offer.)
I have a flat file of random (English) words, one on each line. I need to write an efficient program to count the number of occurrences of each word. The file is big (perhaps about 1GB), but I have plenty of RAM for everything. They're stored on permanent media, so read speeds are slow, so I need to just read through it once linearly.
My two off-the-top-of-my-head ideas were to use a hash with words => no. of occurrences, or a trie with the no. of occurrences at the end node. I have enough RAM for a hash array, but I'm thinking that a trie would have as fast or faster lookups.
What approach would be best?
I think a trie with the count as the leaves could be faster.
Any decent hash table implementation will require reading the word fully, processing it using a hash function, and finally, a look-up in the table.
A trie can be implemented such that the search occurs as you are reading the word. This way, rather than doing a full look-up of the word, you could often find yourself skipping characters once you've established the unique word prefix.
For example, if you've read the characters: "torto", a trie would know that the only possible word that starts this way is tortoise.
If you can perform this inline searching faster on a word faster than the hashing algorithm can hash, you should be able to be faster.
However, this is total overkill. I rambled on since you said it was purely hypothetical, I figured you'd like a hypothetical-type of answer. Go with the most maintainable solution that performs the task in a reasonable amount of time. Micro-optimizations typically waste more time in man-hours than they save in CPU-hours.
I'd use a Dictionary object where the key is word converted to lower case and the value is the count. If the dictionary doesn't contain the word, add it with a value of 1. If it does contain the word, increment the value.
Given slow reading, it's probably not going to make any noticeable difference. The overall time will be completely dominated by the time to read the data anyway, so that's what you should work at optimizing. For the algorithm (mostly data structure, really) in memory, just use whatever happens to be most convenient in the language you find most comfortable.
A hash table is (if done right, and you said you had lots of RAM) O(1) to count a particular word, while a trie is going to be O(n) where n is the length of the word.
With a sufficiently large hash space, you'll get much better performance from a hash table than from a trie.
I think that a trie is overkill for your use case. A hash of word => # of occurrences is exactly what I would use. Even using a slow interpreted language like Perl, you can munge a 1GB file this way in just a few minutes. (I've done this before.)
I have enough RAM for a hash array, but I'm thinking that a trie would have as fast or faster lookups.
How many times will this code be run? If you're just doing it once, I'd say optimize for your time rather than your CPU's time, and just do whatever's fastest to implement (within reason). If you have a standard library function that implements a key-value interface, just use that.
If you're doing it many times, then grab a subset (or several subsets) of the data file, and benchmark your options. Without knowing more about your data set, it'd be dubious to recommend one over another.
Use Python!
Add these elements to a set data type as you go line by line, before asking whether it is in the hash table. After you know it is in the set, then add a dictionary value of 2, since you already added it to the set once before.
This will take some of the memory and computation away from asking the dictionary every single time, and instead will handle unique valued words better, at the end of the call just dump all the words that are not in the dictionary out of the set with a value of 1. (Intersect the two collections in respect to the set)
To a large extent, it depends on what you want you want to do with the data once you've captured it. See Why Use a Hash Table over a Trie (Prefix Tree)?
a simple python script:
import collections
f = file('words.txt')
counts = collections.defaultdict(int)
for line in f:
counts[line.strip()] +=1
print "\n".join("%s: %d" % (word, count) for (word, count) in counts.iteritems())