Scala collections flowchart - scala

There is a nice flowchart (taken from here) for choosing a particular container in C++:
Is there something similar for the Scala collections? I'm still somewhat overwhelmed with the options.

I am not aware of such flowcharts for Scala, but I guess one would be useful.
I made one for you -- larger picture here.
Note that there is some added complexity, since Scala has more collections and there is both the mutable and the immutable package. Where possible, I added both alternatives to the rectangle.
I tried to follow the C++ STL flow diagram as much as possible, but I thought that the lower left part was complicating things a bit too much, so I changed the flow there slightly.
EDIT: fixed some typos.
EDIT: As Travis, suggested, note that in a majority of situations, you only need to pick between a Map, Set, List, ArrayBuffer or a Vector.
if you need key-value lookup, use a Map
if you need to check for presence of elements, use a Set
if you need to store elements and traverse them, use a List or an ArrayBuffer
if you don't need a persistent collection, but random access is really important, use ArrayBuffer
if you need relatively fast random access and persistent sequences, use Vector
If that does not help and you have a more exotic use-case, use this chart.

Related

Scala: Lensing vs mutable design

My basic understanding of lensing is that, "a lens is a value representing maps between a complex type and one of its constituents. This map works both ways—we can get or "access" the constituent and set or "mutate" it"
I came across this when I was designing a machine learning library (neural nets), which demands keeping a big datastructure of parameters, groups of which need to be updated at different stages of the algorithm. I wanted to create the whole parameter data structure immutable, but changing a single group of parameters requires copying all the parameters, and recreating a new datastructure, which sounds inefficient. Not surprisingly other people have thought it too. Some people suggest using lensing, which in a sense, let you modify immutable datastructures. While some other people suggested just using mutables for these. Unfortunately I couldn't find anything on comparing these two paradigms, speed-wise, space-wise, code-complexity-wise etc.
Now question is, what are the pros/cons of using lensing vs mutable design?
The trade offs between the two are pretty much as you surmised. Lenses are less complex than tracking the changes to a large immutable data structure manually, but still require more complex code than a mutable data structure, and there is some amount of runtime overhead. To know how much, you would have to measure, but it's probably less than you think, because a lot of the updated structure isn't copied but shared.
Mutable data structures are simpler and somewhat faster to modify, but harder to reason about, because now you have to take the order functions are called into account, worry about concurrency, and so forth.
Your third option is to make a bunch of small immutable data structures instead of one big one. Mutability often forces a single large data structure because of the need for a single source of truth, and to ensure that all references to data change at the same time. With immutability, this is a lot easier to control.
For example, you can have two separate Maps with the same type of key and different types of simple values, instead of one Map with a more complex value. Not only does this have performance benefits, it also makes it much easier to modularize your code.

Faster alternative to Containers.map?

This question is related to: Matlab: dynamically storing objects, alternatives to containers.Map class
I'm building a data structure that needs to have key-value functionality, where the key is an int and the value is an object. And also needs to be able to dinamically add elements to this key-value map.
So, Containers.map would a good option, but it is extremely slow (I have measured retrieval of values on a map of ~450 elements to be around 0.1s on my Linux machine). That's really strange, as I thought that they would implement this class as a hashmap or something like that.
I need something a lot faster. I'm thinking on implementing myself a balanced binary search tree or something like that, but I don't know if this kind if dynamic recursive object would be fast on MATLAB (probably not).
Is it possible to bind std::map into my application, or something else, that is faster than Containers.map?
Edit, clarifications and code sample:
I'm running this on a Matlab 2015a on linux. Here's a reproduction of the bad performance. In this program, the performance is not as bad because on my program I have a much more complex class hierarchy which generates a lot of overhead (the simple act of having a for to iterate for each element of the map and simply retrieve it takes almost 1 minute when there was ~450 elements). Here, I created a very simple graph class to illustrate the problem. pastebin.com/TvyzJxgK

What kind of data structure is used for immutable maps?

I know how normal mutable maps work (using hashtables), and I know how immutable lists work (recursive linked lists) and their advantage over mutable lists (constant time appending without messing up the original) but how do immutable maps (e.g. Scala's) work?
I know the advantage of not messing with the original map when generating new maps, but how does the underlying data structure work, and what kind of performance characteristics do they have, for example compared to mutable hash tables? Is there any standard data structure which people use to implement these, that I could go look up in CLRS/wikipedia?
Persistent Hash maps are implemented using a structure called a Hash trie. It was originally proposed by Phil Bagwell (who is a member of the Scala group at EPFL) but actually implemented by Clojure first. It hit scala when 2.8 came out in 2010.
There is a great talk on functional data structures by Dan Spiewak where the mechanics of the hash trie are explained extremely lucidly (along with other things such as banker's queues)! He also explains asymptotic big-O performance very well in the talk.
Last October saw Phil give another talk at the London scala Lift Off, this time on parallel persistent data structures.
Persistent sorted maps are implemented via a Red-Black tree
It could be a tree (red-black) or a hash map. Their access characteristics depend on the underlying implementation. A tree is O(log n) for read access; a hash map is O(1).

What are the real advantages of immutable collections?

Scala provides immutable collections, such as Set, List, Map. I understand that the immutability has advantages in concurrent programs. However what are exactly the advantages of the immutability in regular data processing?
What if I enumerate subsets, permutations and combinations for example? Does the immutable collections have any advantage here?
What are exactly the advantages of the immutability in regular data processing?
Generally speaking, immutable objects are easier/simpler to reason about.
It does. Since you're enumerating on a collection, presumably you'd want to be certain that elements are not inadvertently added or removed while you're enumerating.
Immutability is very much a paradigm in functional programming. Making collections immutable allows one to think of them much like primitive data types (i.e. modifying a collection or any other object results in creating a different object just as adding 2 to 3 doesn't modify 3, but creates 5)
To expand Matt's answer: From my personal experience I can say that implementations of algorithms based on search trees (e.g. breadth first, depth first, backtracking) using mutable collections end up regularly as a steaming pile of crap: Either you forget to copy a collection before a recursive call, or you fail to take back changes correctly if you get the collection back. In that area immutable collections are clearly superior. I ended up writing my own immutable list in Java when I couldn't get a problem right with Java's collections. Lo and behold, the first "immutable" implementation worked immediately.
If your data doesn't change after creation, use immutable data structures. The type you choose will identify the intent of usage. Anything more specific would require knowledge about your particular problem space.
You may really be looking for a subset, permutation, or combination generator, and then the discussion of data structures is moot.
Also, you mentioned that you understand the concurrent advantages. Presumably, you're throwing some algorithm at permutations and subsets, and there's a good chance that algorithm can be parallelized to some extent. If that's the case, using immutable structures up front ensures your initial implementation of algorithm X will be easily transformed into concurrent algorithm X.
I have a couple advantages to add to the list:
Immutable collections can't be invalidated out from under you
That is, it's totally fine to have immutable public val members of a Scala class. They are read-only by definition. Compare to Java where not only do you have to remember to make the member private but also write a get method that returns a copy of the object so the original is not modified by the calling code.
Immutable data structures are persistent. This means that the immutable collection obtained by calling filter on your TreeSet actually shares some of its nodes with the original. This translates to time and space savings and offsets some of the penalties incurred by using immutability.
some of immutability advantages :
1 - smaller margin for error (you always know what’s in your collections and read-only variables).
2 - you can write concurrent programs without worrying about threads stepping on each other when modifying variables and collections.

Immutable Map implementation for huge maps

If I have an immutable Map which I might expect (over a very short period of time - like a few seconds) to be adding/removing hundreds of thousands of items from, is the standard HashMap a bad idea? Let's say I want to pass 1Gb of data through the Map in <10 seconds in such a way that the maximum size of the Map at any once instant is only 256Mb.
I get the impression that the map keeps some kind of "history" but I will always be accessing the last-updated table (i.e. I do not pass the map around) because it is a private member variable of an Actor which is updated/accessed only from within reactions.
Basically I suspect that this data structure may be (partly) at fault for issues I am seeing around JVMs going out of memory when reading in large amounts of data in a short time.
Would I be better off with a different map implementation and, if so, what is it?
Ouch. Why do you have to use an immutable map? Poor garbage collector! Immutable maps generally require (log n) new objects per operation in addition to (log n) time, or they really just wrap mutable hash maps and layer changesets on top (which slows things down and can increase the number of object creations).
Immutability is great, but this does not seem to me like the time to use it. If I were you, I'd stick with scala.collection.mutable.HashMap. If you need concurrent access, wrap the Java util.concurrent one instead.
You also might want to increase the size of the young generation in the JVM: -Xmn1G or more (assuming you're running with -Xmx3G). Also, use the throughput (parallel) garbage collector.
That would be awful. You say you always want to access the last-updated table, that means you only need an ephemeral data structure, there is no need to pay the cost for a persistent data structure - it's like trading time and memory to gain completely arguable "style points". You are not building your karma by using blindly persistent structures when they are not called for.
Also, a hashtable is a particularly difficult structure to make persistent. In other words, "very, very slow" (basically it is usable when reads greatly outnumber writes - and you seem to talk about many writes).
By the way, a ConcurrentHashMap wouldn't make sense in this design, given that the map is accessed from a single actor (that's what I understand from the description).
Scala's so-called(*) immutable Map is broken beyond basic usage up to Scala 2.7. Don't trust me, just look up the number of open tickets for it. And the solution is just "it will be replaced with something else on Scala 2.8" (which it did).
So, if you want an immutable map for Scala 2.7.x, I'd advise looking for it in something other than Scala. Or just use TreeHashMap instead.
(*) Scala's immutable Map isn't really immutable. It is a mutable data structure internally, which requires lot of synchronization.