Write performance scala immutable collections - scala

Quick question. I'm currently designing some database queries to extract reasonably large, but not massive datasets into memory, say approximately 10k-100k records.
So far I've been testing loading these resultsets into a scala.collection.immutable.Seq and have discovered it seems to take an incredibly long time to build the collection. Whereas if I change to a Vector or List the write into memory takes fractions of a second.
MY question is therefore why is Seq so slow in this case? If so in what cases would using Seq be more appropriate than Vector?
Thanks

It would help if you'd post the relevant snippet and which operations you call on the sequence -- immutable.Seq is represented using a List (see https://github.com/scala/scala/blob/v2.10.2/src/library/scala/collection/immutable/Seq.scala#L42). My guess is that you've been using :+ on the immutable.Seq, which under the hood appends to the end of the list by copying it (probably giving you quadratic overall performance), and when you switched to using immutable.List directly, you've been attaching to the beginning using :: (giving you linear performance).
Since Seq is just a List under the hood, you should use it when you attach to the beginning of the sequence -- the cons operator :: only creates a one node and links it to the rest of the list, which is as fast as it can get when it comes to immutable data structures. Otherwise, if you add to the end, and you insist on immutability, you should use a Vector (or the upcoming Conc lists!).
If you would like a validation of these claims, see this link where the performance of the two operations is compared using ScalaMeter -- lists are 8 times faster than vectors when you add to the beginning.
However, the most appropriate data structure should be either an ArrayBuffer or a VectorBuilder. These are mutable data structures that resize dynamically and if you build them using += you will get a reasonable performance. This is assuming that you are not storing primitives.

Related

Difference between range() and xrange()

I am reading through a tutorial on Python and its says "There is a variant xrange() which avoids the cost of building the whole list for performance sensitive cases (in Python 3000, range() will have the good performance behavior and you can forget about xrange())." Source.
I am asking about Python 2.x and not Python 3.
I'm not sure what this means. Am I to understand that range(a, b) creates a list of all the values from a to b and then iterates over them, while xrange(a, b) only creates each value as it iterates over? If this is the case then performance is only improved if code does not actually iterate over the entire list and breaks early.
Can someone comment on this?
You understood correctly.
The difference resides in the memory : when using range() the entire list will be allocated in memory whereas xrange() returns a generator (actually it returns an xrange object which acts as a generator)
The Python's documentation about xrange() is quite clear on this subject:
The advantage of xrange() over range() is minimal (since xrange() still has to create the values when asked for them) except when a very large range is used on a memory-starved machine or when all of the range’s elements are never used
In python3, range() doesn't generate the whole sequence at once, so it's also performant:
The advantage of the range type over a regular list or tuple is that a range object will always take the same (small) amount of memory, no matter the size of the range it represents (as it only stores the start, stop and step values, calculating individual items and subranges as needed).
Documentation

Handling large iterators - aggregation

Let's say we have an Iterator of a (String, String)-Tuple.
Said Iterator has a kazillion elements, possibly exhausting main memory.
What would you do if you had to aggregate it like the following:
The tuples have the form of (entityname, attributename) and you would have to populate a list of the attributenames. Also the iterator would be completely unordered and would never fit into memory.
(for instance the last and the first attibutename could correspond to the same entityname).
A concrete example:
("stackoverflow","users"),
("bear","claws"),
("stackoverflow","usesAjaxTechnology"),
("bear","eyes")
after the aggregation -> :
("stackoverflow",List("users","usesAjaxTechnology")),
("bear",List("claws","eyes")).
I know that there are statemenst like groupBy and so on but this would assuming the iterator has a lof ot elements never work due to memory problems?
Well, lets take a look at what groupBy does:
scala> res0.groupBy(x => x._1)
res2: scala.collection.immutable.Map[String,List[(String, String)]] =
Map( bear -> List((bear,claws), (bear,eyes)),
stackoverflow -> List((stackoverflow,users), (stackoverflow,usesAjaxTechnology))
)
As you can see, it creates a Map of elements. Since it does so in memory, you'd obviously run into memory issues as the data grows larger than your RAM.
On the other hand, it'd be possible to construct a Map-like structure that, instead of holding all data in memory, writes them to the filesystem. The simplest such Map would create a file for every key (e.g. "bear" or "stackoverflow") in a certain directory, and write all the attributes into the corresponding files. This would require almost no memory usage, replacing it with very high disk usage.
I'm wondering if this is sort of an artificial requirement, or if you are actually facing a real problem where this is an issue. Also, I'm really interested in hearing what the real functional programming pros here have to say :)
If you have that many elements I would assume that they are in a database or file of some type . I would take them out in manageable chunks and process them that way, writing them back to a db or new file. That would solve your problem with memory and allow you to perform this kind of processing.
If you are using MongoDb( which I recommend) your find query could easily pull out only the stackoverflow users and then your next statement could write that to a new collection. Same with the bear.

What are the real advantages of immutable collections?

Scala provides immutable collections, such as Set, List, Map. I understand that the immutability has advantages in concurrent programs. However what are exactly the advantages of the immutability in regular data processing?
What if I enumerate subsets, permutations and combinations for example? Does the immutable collections have any advantage here?
What are exactly the advantages of the immutability in regular data processing?
Generally speaking, immutable objects are easier/simpler to reason about.
It does. Since you're enumerating on a collection, presumably you'd want to be certain that elements are not inadvertently added or removed while you're enumerating.
Immutability is very much a paradigm in functional programming. Making collections immutable allows one to think of them much like primitive data types (i.e. modifying a collection or any other object results in creating a different object just as adding 2 to 3 doesn't modify 3, but creates 5)
To expand Matt's answer: From my personal experience I can say that implementations of algorithms based on search trees (e.g. breadth first, depth first, backtracking) using mutable collections end up regularly as a steaming pile of crap: Either you forget to copy a collection before a recursive call, or you fail to take back changes correctly if you get the collection back. In that area immutable collections are clearly superior. I ended up writing my own immutable list in Java when I couldn't get a problem right with Java's collections. Lo and behold, the first "immutable" implementation worked immediately.
If your data doesn't change after creation, use immutable data structures. The type you choose will identify the intent of usage. Anything more specific would require knowledge about your particular problem space.
You may really be looking for a subset, permutation, or combination generator, and then the discussion of data structures is moot.
Also, you mentioned that you understand the concurrent advantages. Presumably, you're throwing some algorithm at permutations and subsets, and there's a good chance that algorithm can be parallelized to some extent. If that's the case, using immutable structures up front ensures your initial implementation of algorithm X will be easily transformed into concurrent algorithm X.
I have a couple advantages to add to the list:
Immutable collections can't be invalidated out from under you
That is, it's totally fine to have immutable public val members of a Scala class. They are read-only by definition. Compare to Java where not only do you have to remember to make the member private but also write a get method that returns a copy of the object so the original is not modified by the calling code.
Immutable data structures are persistent. This means that the immutable collection obtained by calling filter on your TreeSet actually shares some of its nodes with the original. This translates to time and space savings and offsets some of the penalties incurred by using immutability.
some of immutability advantages :
1 - smaller margin for error (you always know what’s in your collections and read-only variables).
2 - you can write concurrent programs without worrying about threads stepping on each other when modifying variables and collections.

Merging huge sets (HashSet) in Scala

I have two huge (as in millions of entries) sets (HashSet) that have some (<10%) overlap between them. I need to merge them into one set (I don't care about maintaining the original sets).
Currently, I am adding all items of one set to the other with:
setOne ++= setTwo
This takes several minutes to complete (after several attempts at tweaking hashCode() on the members).
Any ideas how to speed things up?
You can get slightly better performance with Parallel Collections API in Scala 2.9.0+:
setOne.par ++ setTwo
or
(setOne.par /: setTwo)(_ + _)
There are a few things you might wanna try:
Use the sizeHint method to keep your sets at the expected size.
Call useSizeMap(true) on it to get better hash table resizing.
It seems to me that the latter option gives better results, though both show improvements on tests here.
Can you tell me a little more about the data inside the sets? The reason I ask is that for this kind of thing, you usually want something a bit specialized. Here's a few things that can be done:
If the data is (or can be) sorted, you can walk pointers to do a merge, similar to what's done using merge sort. This operation is pretty trivially parallelizable since you can partition one data set and then partition the second data set using binary search to find the correct boundary.
If the data is within a certain numeric range, you can instead use a bitset and just set bits whenever you encounter that number.
If one of the data sets is smaller than the other, you could put it in a hash set and loop over the other dataset quickly, checking for containment.
I have used the first strategy to create a gigantic set of about 8 million integers from about 40k smaller sets in about a second (on beefy hardware, in Scala).

Immutable Map implementation for huge maps

If I have an immutable Map which I might expect (over a very short period of time - like a few seconds) to be adding/removing hundreds of thousands of items from, is the standard HashMap a bad idea? Let's say I want to pass 1Gb of data through the Map in <10 seconds in such a way that the maximum size of the Map at any once instant is only 256Mb.
I get the impression that the map keeps some kind of "history" but I will always be accessing the last-updated table (i.e. I do not pass the map around) because it is a private member variable of an Actor which is updated/accessed only from within reactions.
Basically I suspect that this data structure may be (partly) at fault for issues I am seeing around JVMs going out of memory when reading in large amounts of data in a short time.
Would I be better off with a different map implementation and, if so, what is it?
Ouch. Why do you have to use an immutable map? Poor garbage collector! Immutable maps generally require (log n) new objects per operation in addition to (log n) time, or they really just wrap mutable hash maps and layer changesets on top (which slows things down and can increase the number of object creations).
Immutability is great, but this does not seem to me like the time to use it. If I were you, I'd stick with scala.collection.mutable.HashMap. If you need concurrent access, wrap the Java util.concurrent one instead.
You also might want to increase the size of the young generation in the JVM: -Xmn1G or more (assuming you're running with -Xmx3G). Also, use the throughput (parallel) garbage collector.
That would be awful. You say you always want to access the last-updated table, that means you only need an ephemeral data structure, there is no need to pay the cost for a persistent data structure - it's like trading time and memory to gain completely arguable "style points". You are not building your karma by using blindly persistent structures when they are not called for.
Also, a hashtable is a particularly difficult structure to make persistent. In other words, "very, very slow" (basically it is usable when reads greatly outnumber writes - and you seem to talk about many writes).
By the way, a ConcurrentHashMap wouldn't make sense in this design, given that the map is accessed from a single actor (that's what I understand from the description).
Scala's so-called(*) immutable Map is broken beyond basic usage up to Scala 2.7. Don't trust me, just look up the number of open tickets for it. And the solution is just "it will be replaced with something else on Scala 2.8" (which it did).
So, if you want an immutable map for Scala 2.7.x, I'd advise looking for it in something other than Scala. Or just use TreeHashMap instead.
(*) Scala's immutable Map isn't really immutable. It is a mutable data structure internally, which requires lot of synchronization.