Immutable Map implementation for huge maps - scala

If I have an immutable Map which I might expect (over a very short period of time - like a few seconds) to be adding/removing hundreds of thousands of items from, is the standard HashMap a bad idea? Let's say I want to pass 1Gb of data through the Map in <10 seconds in such a way that the maximum size of the Map at any once instant is only 256Mb.
I get the impression that the map keeps some kind of "history" but I will always be accessing the last-updated table (i.e. I do not pass the map around) because it is a private member variable of an Actor which is updated/accessed only from within reactions.
Basically I suspect that this data structure may be (partly) at fault for issues I am seeing around JVMs going out of memory when reading in large amounts of data in a short time.
Would I be better off with a different map implementation and, if so, what is it?

Ouch. Why do you have to use an immutable map? Poor garbage collector! Immutable maps generally require (log n) new objects per operation in addition to (log n) time, or they really just wrap mutable hash maps and layer changesets on top (which slows things down and can increase the number of object creations).
Immutability is great, but this does not seem to me like the time to use it. If I were you, I'd stick with scala.collection.mutable.HashMap. If you need concurrent access, wrap the Java util.concurrent one instead.
You also might want to increase the size of the young generation in the JVM: -Xmn1G or more (assuming you're running with -Xmx3G). Also, use the throughput (parallel) garbage collector.

That would be awful. You say you always want to access the last-updated table, that means you only need an ephemeral data structure, there is no need to pay the cost for a persistent data structure - it's like trading time and memory to gain completely arguable "style points". You are not building your karma by using blindly persistent structures when they are not called for.
Also, a hashtable is a particularly difficult structure to make persistent. In other words, "very, very slow" (basically it is usable when reads greatly outnumber writes - and you seem to talk about many writes).
By the way, a ConcurrentHashMap wouldn't make sense in this design, given that the map is accessed from a single actor (that's what I understand from the description).

Scala's so-called(*) immutable Map is broken beyond basic usage up to Scala 2.7. Don't trust me, just look up the number of open tickets for it. And the solution is just "it will be replaced with something else on Scala 2.8" (which it did).
So, if you want an immutable map for Scala 2.7.x, I'd advise looking for it in something other than Scala. Or just use TreeHashMap instead.
(*) Scala's immutable Map isn't really immutable. It is a mutable data structure internally, which requires lot of synchronization.

Related

Scala: Lensing vs mutable design

My basic understanding of lensing is that, "a lens is a value representing maps between a complex type and one of its constituents. This map works both ways—we can get or "access" the constituent and set or "mutate" it"
I came across this when I was designing a machine learning library (neural nets), which demands keeping a big datastructure of parameters, groups of which need to be updated at different stages of the algorithm. I wanted to create the whole parameter data structure immutable, but changing a single group of parameters requires copying all the parameters, and recreating a new datastructure, which sounds inefficient. Not surprisingly other people have thought it too. Some people suggest using lensing, which in a sense, let you modify immutable datastructures. While some other people suggested just using mutables for these. Unfortunately I couldn't find anything on comparing these two paradigms, speed-wise, space-wise, code-complexity-wise etc.
Now question is, what are the pros/cons of using lensing vs mutable design?
The trade offs between the two are pretty much as you surmised. Lenses are less complex than tracking the changes to a large immutable data structure manually, but still require more complex code than a mutable data structure, and there is some amount of runtime overhead. To know how much, you would have to measure, but it's probably less than you think, because a lot of the updated structure isn't copied but shared.
Mutable data structures are simpler and somewhat faster to modify, but harder to reason about, because now you have to take the order functions are called into account, worry about concurrency, and so forth.
Your third option is to make a bunch of small immutable data structures instead of one big one. Mutability often forces a single large data structure because of the need for a single source of truth, and to ensure that all references to data change at the same time. With immutability, this is a lot easier to control.
For example, you can have two separate Maps with the same type of key and different types of simple values, instead of one Map with a more complex value. Not only does this have performance benefits, it also makes it much easier to modularize your code.

Best PostgreSQL hiearchical tree for both performance and moving nodes from GUI?

Since I'm using PostgreSQL there is a module which is called ltree, which satisfies at least one of my needs, performance (I don't know about scalability? Someone says materialized path trees does not scale well..).
Since the application I'm developing is a CMS built entirely around a big tree, nodes, subtrees etc performance in queuering these nodes is absolutely essential, but since it's a hiearchical large (as it grows) tree you're working on and manipulating from the GUI (CRUD), I also want to make it possible for users to drag and drop to reorder nodes, subtrees etc while updating the tree (child records) in the database correctly.
As I understand moving and reordering nodes/subtrees in a tree is not really what ltree/materialized path trees are good for, so what I hope you can help be with is to either point me to the correct tree-structure-model that is best for performance AND moving subtrees and nodes, or perhaps... if ltree is indeed not a leftover from the past but worth still using, how could you achieve this with PostgreSQL's ltree module? And why/why not use ltree in this case?
Requirements:
Query performance is of course my top priority (all nodes, subtrees, leafs).
The tree should support deep level nesting, and sorting
And of course the tree should have support for growing large and
scaling big
I can live with a little waiting time while reordering from the GUI,
if 1 "jack-of-all-trades" tree implementation doesn't exist, or is
too complex for being worth it.
I'm also considering the Closure tables aka Bridge tables (alot!), Nested Intervals (not sure I understand exactly how to implement it, and no good examples or gists currently exists?) or B-tree models. I'm just not quite sure yet, how these will satisfy my above 4 requirements. Re-organizing subtrees and nodes in nested intervals seems straightforward and performance seems good.. Quite hard to choose the right one to go with.
Since I definitely need performance (query / read performance), scalability, sorting I kinda thought that Closure tables WITH sort order could be very close, but I just cant imagine how big the closure tables and disk-space-overhead will become as my tree and nodes grow large. Closure tables and scalability, I'm just not too sure of. Am I wrong in worrying about this, and what might the best solution for this task be?
The typical data structures used to index trees stored in SQL are designed and optimized for read performance on sets that don't change often.
As an example, if you're using the nested set model, adding or deleting a node would involve updating the entire tree (which typically means rewriting the entire table): great for reads, not so great for writes.
When write performance is important for you, you'll usually be better off working on the raw (id, parent_id) tuples with recursive queries, while setting tree indexes you know for sure are dirty to null as you go. In those areas of the app where read-performance is more important, do a sanity check by checking for null values in the tree index, and re-index the tree as needed before actually using it. That way, you'll avoid incessant rewrites of your tree, and instead re-index it only when needed for a read.
An alternative albeit (much) more difficult approach is to use a variation of e.g. nested sets or nested intervals, but using reals or floats instead of integers. This allows to insert, move and delete nodes for free, at the cost of some storage and arithmetic/read overhead and the loss of some properties such as child node counts in the case of nested sets. However, it also requires that you keep an eye out for pathological edge-cases. Namely you'll need to periodically -- and sometimes preemptively -- "garbage collect" and re-index large enough chunks of the tree's index in order to fit new nodes when you run into the floating point type's precision limits.
(A variation of the latter is to use a numeric without any precision in order to try to dodge the problem. But it's actually kicking the can down the road, in the sense that you'll still be limited by Postgres internals of a few thousand digits of precision. And the storage and arithmetic overheads became material compared to just using a floating point type long before you run into that limit in my own tests from a few years back.)
As for a "The Best" structure or approach, there really is no magic bullet... Each has pros and cons based on the use-case (frequency of reads vs writes) and the size of the set. There's plenty of literature on the web that compare and explain each of them, which I'm sure you've found already.
That being said, for a CMS I'd advise that you go with whichever method you're most comfortable with. Either re-index the tree on the fly as writes occur, or mark the tree as dirty on writes and then re-indexing it on demand. The point here is that, if re-indexing is done right (= using a plpgsql function or equivalent, rather than a gazillion queries issued by your app), re-indexing an entire tree of a few hundred thousand nodes will a few hundred milliseconds at most. Assuming the tree isn't constantly getting updated, that's a perfectly acceptable overhead for end-users.

What are the real advantages of immutable collections?

Scala provides immutable collections, such as Set, List, Map. I understand that the immutability has advantages in concurrent programs. However what are exactly the advantages of the immutability in regular data processing?
What if I enumerate subsets, permutations and combinations for example? Does the immutable collections have any advantage here?
What are exactly the advantages of the immutability in regular data processing?
Generally speaking, immutable objects are easier/simpler to reason about.
It does. Since you're enumerating on a collection, presumably you'd want to be certain that elements are not inadvertently added or removed while you're enumerating.
Immutability is very much a paradigm in functional programming. Making collections immutable allows one to think of them much like primitive data types (i.e. modifying a collection or any other object results in creating a different object just as adding 2 to 3 doesn't modify 3, but creates 5)
To expand Matt's answer: From my personal experience I can say that implementations of algorithms based on search trees (e.g. breadth first, depth first, backtracking) using mutable collections end up regularly as a steaming pile of crap: Either you forget to copy a collection before a recursive call, or you fail to take back changes correctly if you get the collection back. In that area immutable collections are clearly superior. I ended up writing my own immutable list in Java when I couldn't get a problem right with Java's collections. Lo and behold, the first "immutable" implementation worked immediately.
If your data doesn't change after creation, use immutable data structures. The type you choose will identify the intent of usage. Anything more specific would require knowledge about your particular problem space.
You may really be looking for a subset, permutation, or combination generator, and then the discussion of data structures is moot.
Also, you mentioned that you understand the concurrent advantages. Presumably, you're throwing some algorithm at permutations and subsets, and there's a good chance that algorithm can be parallelized to some extent. If that's the case, using immutable structures up front ensures your initial implementation of algorithm X will be easily transformed into concurrent algorithm X.
I have a couple advantages to add to the list:
Immutable collections can't be invalidated out from under you
That is, it's totally fine to have immutable public val members of a Scala class. They are read-only by definition. Compare to Java where not only do you have to remember to make the member private but also write a get method that returns a copy of the object so the original is not modified by the calling code.
Immutable data structures are persistent. This means that the immutable collection obtained by calling filter on your TreeSet actually shares some of its nodes with the original. This translates to time and space savings and offsets some of the penalties incurred by using immutability.
some of immutability advantages :
1 - smaller margin for error (you always know what’s in your collections and read-only variables).
2 - you can write concurrent programs without worrying about threads stepping on each other when modifying variables and collections.

scala/akka/stm design for large shared state?

I am new to Scala and Akka and am considering using it to solve a problem. Suppose I have a calculation engine (that searches for a solution). I'd like to parallelize that search both across cpus and across nodes by giving each cpu on each node its own engine instance.
The engine inputs consist of a small number of scalar inputs and a very large hash table. Each engine instance would use its scalar inputs to make some small local change to the hash table, calculate a goodness, then discard its changes (they do not need to be committed/seen by any other engine instance). The goodness value would be returned to some coordinator that would choose among the results.
I was reading some about the STM TransactionalMap as a vehicle for shared state. This seems ideal, but I don't really see any complete examples using it as shared state.
Questions:
Does the actor/stm model seem right for this problem?
Can you show a specific example of how to distribute the shared state? (is it Ref[TransactionalMap[,]] as a message?
Is there anything different about distributing the shared state within a node as opposed to across different nodes?
Inquiring Minds Want to Know,
Allan
In terms of handling shared memory it doesn't sound like STM would be the right fit here because you don't want the changes made in engine instances to commit to the shared copy of the hash table.
Instead, an immutable HashMap might be a better fit. The elements that do not change in the map can be shared by the engine instances with only the differences in each map taking additional memory space.
The actor model would fit very well what you want to do. Set up one actor for each engine instance you want and pass it a message with the scalar values and the hashmap. Then have it return the results to the coordinator.

Memcached best practices - small objects and lots of keys or big objects and few keys?

I use memcached to store the integer result of a complex calculation. I've got hundreds of integer objects that I could cache! Should I cache them under a single key in a more complex object or should I use hundreds of different keys for the objects? (the objects I'm caching do not need to be invalidated more than once a day)
I would say lots of little keys. This way you can get the exact result you want in 1 call with minimal serialization effort.
If you store it in another object (an array for example) you will have to fetch the array from cache and then fetch the item you actually want again from that array, plus you have the overhead of serializing/deserializing the whole complex object again. Depending on your language of choice this might mean manually writing a serialization/deserialization function from scratch.
I wrote somewhat large analysis at http://dammit.lt/2008/12/25/memcached-for-small-objects/ - it outlines how to optimize memcached for small object storage - it may shed quite some light on the issue.
It depends on your application. While memcached is very fast, it does require some request transmission and memory lookup time per request. Those numbers increase depending on whether or not the server is on the local machine (localhost), on the local network, or across a wide area. The size of your cache generally doesn't affect the lookup speed.
So, if your application is using MANY objects per processing unit (per request, method, or what-have-you), then it's generally better to define your cache in a way which lowers total number of hits to the cache while at the same time trying not to duplicate cache data. Like everything else, it's a balance.
i.e. If you have a web request which pulls a list of blog posts, it would be more beneficial to cache the entire object list as one memcached key, rather than (and this is a somewhat bad example, obviously) caching an array of cache keys for that list, which relate to individually memcached objects.
The less processing you have to do of the cached values, the better. So why not just dump them into the cache individually?
I would say you should store values individually and use some kind of helper class to retrieve values with multiget and generate a complex dataobject for you.
It depends on what are those numbers. If you could, for example, group them in ranges, then you could optimize the storage. If you could hash them, into a map, or hashtable and store that map serialized in memcached would be good to.
Anyway, you can save many little keys, just make sure you configure the slabs to have chunks with small size, so you will not waste memory space.