Merging huge sets (HashSet) in Scala - scala

I have two huge (as in millions of entries) sets (HashSet) that have some (<10%) overlap between them. I need to merge them into one set (I don't care about maintaining the original sets).
Currently, I am adding all items of one set to the other with:
setOne ++= setTwo
This takes several minutes to complete (after several attempts at tweaking hashCode() on the members).
Any ideas how to speed things up?

You can get slightly better performance with Parallel Collections API in Scala 2.9.0+:
setOne.par ++ setTwo
or
(setOne.par /: setTwo)(_ + _)

There are a few things you might wanna try:
Use the sizeHint method to keep your sets at the expected size.
Call useSizeMap(true) on it to get better hash table resizing.
It seems to me that the latter option gives better results, though both show improvements on tests here.

Can you tell me a little more about the data inside the sets? The reason I ask is that for this kind of thing, you usually want something a bit specialized. Here's a few things that can be done:
If the data is (or can be) sorted, you can walk pointers to do a merge, similar to what's done using merge sort. This operation is pretty trivially parallelizable since you can partition one data set and then partition the second data set using binary search to find the correct boundary.
If the data is within a certain numeric range, you can instead use a bitset and just set bits whenever you encounter that number.
If one of the data sets is smaller than the other, you could put it in a hash set and loop over the other dataset quickly, checking for containment.
I have used the first strategy to create a gigantic set of about 8 million integers from about 40k smaller sets in about a second (on beefy hardware, in Scala).

Related

Is it a good idea to have a big set/list as column in `Scylla DB`?

Is it a good idea to have a table in Scylla DB with column type set with couple of thousands elements in it, e.g 5000 elements?
In Scylla documentation it's stated that:
Collections are meant for storing/denormalizing a relatively small amount of data. They work well for things like “the phone numbers of a given user”, “labels applied to an email”, etc. But when items are expected to grow unbounded (“all messages sent by a user”, “events registered by a sensor”…), then collections are not appropriate, and a specific table (with clustering columns) should be used. ~ [source]
My column is much bigger than "the phone numbers of a given user", but much smaller than “all messages sent by a user” (column set is going to be 'frozen', if that matters), so I am confused what to do?
If your set is frozen, you can be a little more relaxed about it. This is because ScyllaDB will not have to break it into components and re-create it so often as it does with non-frozen sets.
So if you're sure the frozen set won't be larger than a megabyte or so, it will be fine. For simple read/write queries it will be treated as a blob.
The main downside of having a large individual cell - frozen set, string, or a even an unfrozen set - is that the CQL API does not give you an efficient way to read or write only part of that cell. For example, every time you want to access your set, Scylla will need to read it entirely into memory. This takes time and effort. Even worse, it also increases the latency of other requests because Scylla's scheduling is cooperative and does not switch tasks in the middle of handling a single cell because it is assumed to be fairly small.
Whether or not 5,000 elements specifically is too much or not also depends on the size of each element - 5,000 elements of 10 bytes each totals 50K, but if it's 100 bytes each they total 500K. A 500K cell will certainly increase tail latency noticeably, but this may or may not be important for your application. If you can't think of a data model that doesn't involve large collections, then you can definitely try the one you thought of, and check if the performance is acceptable to you or not.
In any case, if your use case involves unbounded collections - i.e., 5,000 elements is not a hard limit but some sort of average, and if in some rows you actually have a million elements, you're in for a world of pain :-( You can start to see huge latencies (as one single 1-million-cell row delays many other requests waiting in line) and in extreme cases even allocation failures. So you will somehow need to avoid this problem. Avoiding it isn't always easy - Scylla doesn't have a feature that prevents your 5,000-element set growing into a million-element set (see https://github.com/scylladb/scylladb/issues/10070).

Write performance scala immutable collections

Quick question. I'm currently designing some database queries to extract reasonably large, but not massive datasets into memory, say approximately 10k-100k records.
So far I've been testing loading these resultsets into a scala.collection.immutable.Seq and have discovered it seems to take an incredibly long time to build the collection. Whereas if I change to a Vector or List the write into memory takes fractions of a second.
MY question is therefore why is Seq so slow in this case? If so in what cases would using Seq be more appropriate than Vector?
Thanks
It would help if you'd post the relevant snippet and which operations you call on the sequence -- immutable.Seq is represented using a List (see https://github.com/scala/scala/blob/v2.10.2/src/library/scala/collection/immutable/Seq.scala#L42). My guess is that you've been using :+ on the immutable.Seq, which under the hood appends to the end of the list by copying it (probably giving you quadratic overall performance), and when you switched to using immutable.List directly, you've been attaching to the beginning using :: (giving you linear performance).
Since Seq is just a List under the hood, you should use it when you attach to the beginning of the sequence -- the cons operator :: only creates a one node and links it to the rest of the list, which is as fast as it can get when it comes to immutable data structures. Otherwise, if you add to the end, and you insist on immutability, you should use a Vector (or the upcoming Conc lists!).
If you would like a validation of these claims, see this link where the performance of the two operations is compared using ScalaMeter -- lists are 8 times faster than vectors when you add to the beginning.
However, the most appropriate data structure should be either an ArrayBuffer or a VectorBuilder. These are mutable data structures that resize dynamically and if you build them using += you will get a reasonable performance. This is assuming that you are not storing primitives.

what is the efficiency of an assign statement in progress-4gl

why is an assign statement more efficient than not using assign?
co-workers say that:
assign
a=3
v=7
w=8.
is more efficient than:
a=3.
v=7.
w=8.
why?
You could always test it yourself and see... but, yes, it is slightly more efficient. Or it was the last time I tested it. The reason is that the compiler combines the statements and the resulting r-code is a bit smaller.
But efficiency is almost always a poor reason to do it. Saving a micro-second here and there pales next to avoiding disk IO or picking a more efficient algorithm. Good reasons:
Back in the dark ages there was a limit of 63k of r-code per program. Combining statements with ASSIGN was a way to reduce the size of r-code and stay under that limit (ok, that might not be a "good" reason). One additional way this helps is that you could also often avoid a DO ... END pair and further reduce r-code size.
When creating or updating a record the fields that are part of an index will be written back to the database as they are assigned (not at the end of the transaction) -- grouping all assignments into a single statement helps to avoid inconsistent dirty reads. Grouping the indexed fields into a single ASSIGN avoids writing the index entries multiple times. (This is probably the best reason to use ASSIGN.)
Readability -- you can argue that grouping consecutive assignments more clearly shows your intent and is thus more readable. (I like this reason but not everyone agrees.)
basically doing:
a=3.
v=7.
w=8.
is the same as:
assign a=3.
assign v=7.
assign w=8.
which is 3 separate statements so a little more overhead. Therefore less efficient.
Progress does assign as one statement whether there is 1 or more variables being assigned. If you do not say Assign then it is assumed so you will do 3 statements instead of 1. There is a 20% - 40% reduction in R Code and a 15% - 20% performance improvement when using one assign statement. Why this is can only be speculated on as I can not find any source with information on why this is. For database fields and especially key/index fields it makes perfect sense. For variables I can only assume it has to do with how progress manages its buffers and copies data to and from buffers.
ASSIGN will combine multiple statements into one. If a, v and w are fields in your db, that means it will do something like INSERT INTO (a,v,w)...
rather than
INSERT INTO (a)...
INSERT INTO (v)
etc.

One billion length List in Scala?

Just as a load test, I was playing with different data structures in Scala. Just wondering what it takes to work or even create a one billion length array. 100 million seems to be no problem, of course there's no real magic about the number 1,000,000,000. I'm just seeing how far you can push it.
I had to bump up memory on most of the tests. export JAVA_OPTS="-Xms4g -Xmx8g"
// insanity begins ...
val buf = (0 to 1000000000 - 1).par.map { i => i }.toList
// java.lang.OutOfMemoryError: GC overhead limit exceeded
However preallocating an ArrayInt works pretty well. It takes about 9 seconds to iterate and build the object. Interestingly, doing almost anything with ListBuffer seems to automatically take advantage of all cores. However, the code above will not finish (at least with 8gb Xmx).
I understand that this is not a common case and I'm just messing around. But if you had to pull some massive thing into memory, is there a more efficient technique? Is Array with type as efficient as it gets?
The per-element overhead of a List is considerable. Each element is held in a cons cell (case class ::) which means there is one object with two fields for every element. On a 32-bit JVM that's 16 bytes per element (not counting the element value itself). On a 64-bit JVM it's going to be even higher.
List is not a good container type for extremely large contents. Its primary feature is very efficient head / tail decomposition. If that's something you need then you may just have to deal with the memory cost. If it's not, try to choose a more efficient representation.
For what it's worth, I consider memory overhead considerations to be one thing that justifies using Array. There are lots of caveats around using arrays, so be careful if you go that way.
Given that the JVM can sensibly arrange an Array of Ints in memory, if you really need to iterate over them it would indeed be the most efficient approach. It would generate much the same code if you did exactly the same thing with Java.

Optimizing word count

(This is rather hypothetical in nature as of right now, so I don't have too many details to offer.)
I have a flat file of random (English) words, one on each line. I need to write an efficient program to count the number of occurrences of each word. The file is big (perhaps about 1GB), but I have plenty of RAM for everything. They're stored on permanent media, so read speeds are slow, so I need to just read through it once linearly.
My two off-the-top-of-my-head ideas were to use a hash with words => no. of occurrences, or a trie with the no. of occurrences at the end node. I have enough RAM for a hash array, but I'm thinking that a trie would have as fast or faster lookups.
What approach would be best?
I think a trie with the count as the leaves could be faster.
Any decent hash table implementation will require reading the word fully, processing it using a hash function, and finally, a look-up in the table.
A trie can be implemented such that the search occurs as you are reading the word. This way, rather than doing a full look-up of the word, you could often find yourself skipping characters once you've established the unique word prefix.
For example, if you've read the characters: "torto", a trie would know that the only possible word that starts this way is tortoise.
If you can perform this inline searching faster on a word faster than the hashing algorithm can hash, you should be able to be faster.
However, this is total overkill. I rambled on since you said it was purely hypothetical, I figured you'd like a hypothetical-type of answer. Go with the most maintainable solution that performs the task in a reasonable amount of time. Micro-optimizations typically waste more time in man-hours than they save in CPU-hours.
I'd use a Dictionary object where the key is word converted to lower case and the value is the count. If the dictionary doesn't contain the word, add it with a value of 1. If it does contain the word, increment the value.
Given slow reading, it's probably not going to make any noticeable difference. The overall time will be completely dominated by the time to read the data anyway, so that's what you should work at optimizing. For the algorithm (mostly data structure, really) in memory, just use whatever happens to be most convenient in the language you find most comfortable.
A hash table is (if done right, and you said you had lots of RAM) O(1) to count a particular word, while a trie is going to be O(n) where n is the length of the word.
With a sufficiently large hash space, you'll get much better performance from a hash table than from a trie.
I think that a trie is overkill for your use case. A hash of word => # of occurrences is exactly what I would use. Even using a slow interpreted language like Perl, you can munge a 1GB file this way in just a few minutes. (I've done this before.)
I have enough RAM for a hash array, but I'm thinking that a trie would have as fast or faster lookups.
How many times will this code be run? If you're just doing it once, I'd say optimize for your time rather than your CPU's time, and just do whatever's fastest to implement (within reason). If you have a standard library function that implements a key-value interface, just use that.
If you're doing it many times, then grab a subset (or several subsets) of the data file, and benchmark your options. Without knowing more about your data set, it'd be dubious to recommend one over another.
Use Python!
Add these elements to a set data type as you go line by line, before asking whether it is in the hash table. After you know it is in the set, then add a dictionary value of 2, since you already added it to the set once before.
This will take some of the memory and computation away from asking the dictionary every single time, and instead will handle unique valued words better, at the end of the call just dump all the words that are not in the dictionary out of the set with a value of 1. (Intersect the two collections in respect to the set)
To a large extent, it depends on what you want you want to do with the data once you've captured it. See Why Use a Hash Table over a Trie (Prefix Tree)?
a simple python script:
import collections
f = file('words.txt')
counts = collections.defaultdict(int)
for line in f:
counts[line.strip()] +=1
print "\n".join("%s: %d" % (word, count) for (word, count) in counts.iteritems())