Indexable data structures behind Scala's for comprehension - scala

I've just finished watching Week 6 of Martin Odersky's lectures about Scala on Coursera. In Lecture 5 he says that
"...the translation of for is not limited to lists or
sequences, or even collections;
It is based solely on the presence of the methods map, flatMap and
withFilter.
This lets you use the for syntax for your own types as well – you
must only define map, flatMap and withFilter for these types."
The problem I'm trying to solve is that we have a batch process that loads data from a couple of databases, combines the data and exports the results in some way. The data is small enough to fit in memory (a couple of 100,000 records from each source system), but large enough that it is important to think about performance.
I could use a traditional in-memory database (like H2) and access it via ScalaQuery or something similar, but what I really need is just a way to be able to search and join data from the different source systems efficiently - equal to SQL's indexes and JOINs. It feels really awkward to use a full-blown relational database + a Scala ORM for something that could be solved easily and efficiently by some data structure that is native to Scala.
My first naive approach would be a Vector data structure (for fast direct access) combined with one or more "indexes" (which could be implemented as B-Trees just like in database systems). The map, flatMap, withFilter methods of this combined data structure could be intelligent enough to use an index if they have one for the queried field(s) - or they could have a "hint" to use one.
I was just wondering if such data structures already exist and are available, or do I need to implement them myself? Is there a library or a collection framework for Scala that solves this problem?

Not in the standard library (except for Vector, of course), and I don't know of any non-standard library providing them.

Related

Scala: Lensing vs mutable design

My basic understanding of lensing is that, "a lens is a value representing maps between a complex type and one of its constituents. This map works both ways—we can get or "access" the constituent and set or "mutate" it"
I came across this when I was designing a machine learning library (neural nets), which demands keeping a big datastructure of parameters, groups of which need to be updated at different stages of the algorithm. I wanted to create the whole parameter data structure immutable, but changing a single group of parameters requires copying all the parameters, and recreating a new datastructure, which sounds inefficient. Not surprisingly other people have thought it too. Some people suggest using lensing, which in a sense, let you modify immutable datastructures. While some other people suggested just using mutables for these. Unfortunately I couldn't find anything on comparing these two paradigms, speed-wise, space-wise, code-complexity-wise etc.
Now question is, what are the pros/cons of using lensing vs mutable design?
The trade offs between the two are pretty much as you surmised. Lenses are less complex than tracking the changes to a large immutable data structure manually, but still require more complex code than a mutable data structure, and there is some amount of runtime overhead. To know how much, you would have to measure, but it's probably less than you think, because a lot of the updated structure isn't copied but shared.
Mutable data structures are simpler and somewhat faster to modify, but harder to reason about, because now you have to take the order functions are called into account, worry about concurrency, and so forth.
Your third option is to make a bunch of small immutable data structures instead of one big one. Mutability often forces a single large data structure because of the need for a single source of truth, and to ensure that all references to data change at the same time. With immutability, this is a lot easier to control.
For example, you can have two separate Maps with the same type of key and different types of simple values, instead of one Map with a more complex value. Not only does this have performance benefits, it also makes it much easier to modularize your code.

Is the datastore object in Matlab like a NoSQL database?

I see that you can use datastore to hold key value pairs, process data in chunks, and pass it to mapreduce. Does this mean that the datastore object in Matlab is like a NoSQL database? If not, how does it differ?
In case of any ambiguity about what characterises a NoSQL database, I am considering as a starting point these characteristics obtained from dba.stackexchange: https://dba.stackexchange.com/a/25/35729
You'll find that NoSQL database have few common characteristics. They
can be roughly divided into a few categories:
key/value stores
Bigtable inspired databases (based on the Google Bigtable paper)
Dynamo inspired databases
distributed databases
document databases
In Matlab you can always import java Classes and use any java library, (with the one difference that there is no multithreading). So typically you won’t find many libraries written in matlab that do the same thing as a java library for this reason. In general I would also say it’s harder to write a library in matlab which may be a factor for the lack of libraries as well.
I think your only option is to use a java library, IMHO is a much better choice anyway because java is so much more popular to programmers working with databases, it will always have better libraries which are maintained. The one drawback is that you can’t implement java interfaces in matlab (correct me if I’m wrong). This can become a massive pain.
So not really, here is a Mongo examples on github https://github.com/HanOostdijk/matlab_mongodb

Write performance scala immutable collections

Quick question. I'm currently designing some database queries to extract reasonably large, but not massive datasets into memory, say approximately 10k-100k records.
So far I've been testing loading these resultsets into a scala.collection.immutable.Seq and have discovered it seems to take an incredibly long time to build the collection. Whereas if I change to a Vector or List the write into memory takes fractions of a second.
MY question is therefore why is Seq so slow in this case? If so in what cases would using Seq be more appropriate than Vector?
Thanks
It would help if you'd post the relevant snippet and which operations you call on the sequence -- immutable.Seq is represented using a List (see https://github.com/scala/scala/blob/v2.10.2/src/library/scala/collection/immutable/Seq.scala#L42). My guess is that you've been using :+ on the immutable.Seq, which under the hood appends to the end of the list by copying it (probably giving you quadratic overall performance), and when you switched to using immutable.List directly, you've been attaching to the beginning using :: (giving you linear performance).
Since Seq is just a List under the hood, you should use it when you attach to the beginning of the sequence -- the cons operator :: only creates a one node and links it to the rest of the list, which is as fast as it can get when it comes to immutable data structures. Otherwise, if you add to the end, and you insist on immutability, you should use a Vector (or the upcoming Conc lists!).
If you would like a validation of these claims, see this link where the performance of the two operations is compared using ScalaMeter -- lists are 8 times faster than vectors when you add to the beginning.
However, the most appropriate data structure should be either an ArrayBuffer or a VectorBuilder. These are mutable data structures that resize dynamically and if you build them using += you will get a reasonable performance. This is assuming that you are not storing primitives.

What kind of data structure is used for immutable maps?

I know how normal mutable maps work (using hashtables), and I know how immutable lists work (recursive linked lists) and their advantage over mutable lists (constant time appending without messing up the original) but how do immutable maps (e.g. Scala's) work?
I know the advantage of not messing with the original map when generating new maps, but how does the underlying data structure work, and what kind of performance characteristics do they have, for example compared to mutable hash tables? Is there any standard data structure which people use to implement these, that I could go look up in CLRS/wikipedia?
Persistent Hash maps are implemented using a structure called a Hash trie. It was originally proposed by Phil Bagwell (who is a member of the Scala group at EPFL) but actually implemented by Clojure first. It hit scala when 2.8 came out in 2010.
There is a great talk on functional data structures by Dan Spiewak where the mechanics of the hash trie are explained extremely lucidly (along with other things such as banker's queues)! He also explains asymptotic big-O performance very well in the talk.
Last October saw Phil give another talk at the London scala Lift Off, this time on parallel persistent data structures.
Persistent sorted maps are implemented via a Red-Black tree
It could be a tree (red-black) or a hash map. Their access characteristics depend on the underlying implementation. A tree is O(log n) for read access; a hash map is O(1).

What are the real advantages of immutable collections?

Scala provides immutable collections, such as Set, List, Map. I understand that the immutability has advantages in concurrent programs. However what are exactly the advantages of the immutability in regular data processing?
What if I enumerate subsets, permutations and combinations for example? Does the immutable collections have any advantage here?
What are exactly the advantages of the immutability in regular data processing?
Generally speaking, immutable objects are easier/simpler to reason about.
It does. Since you're enumerating on a collection, presumably you'd want to be certain that elements are not inadvertently added or removed while you're enumerating.
Immutability is very much a paradigm in functional programming. Making collections immutable allows one to think of them much like primitive data types (i.e. modifying a collection or any other object results in creating a different object just as adding 2 to 3 doesn't modify 3, but creates 5)
To expand Matt's answer: From my personal experience I can say that implementations of algorithms based on search trees (e.g. breadth first, depth first, backtracking) using mutable collections end up regularly as a steaming pile of crap: Either you forget to copy a collection before a recursive call, or you fail to take back changes correctly if you get the collection back. In that area immutable collections are clearly superior. I ended up writing my own immutable list in Java when I couldn't get a problem right with Java's collections. Lo and behold, the first "immutable" implementation worked immediately.
If your data doesn't change after creation, use immutable data structures. The type you choose will identify the intent of usage. Anything more specific would require knowledge about your particular problem space.
You may really be looking for a subset, permutation, or combination generator, and then the discussion of data structures is moot.
Also, you mentioned that you understand the concurrent advantages. Presumably, you're throwing some algorithm at permutations and subsets, and there's a good chance that algorithm can be parallelized to some extent. If that's the case, using immutable structures up front ensures your initial implementation of algorithm X will be easily transformed into concurrent algorithm X.
I have a couple advantages to add to the list:
Immutable collections can't be invalidated out from under you
That is, it's totally fine to have immutable public val members of a Scala class. They are read-only by definition. Compare to Java where not only do you have to remember to make the member private but also write a get method that returns a copy of the object so the original is not modified by the calling code.
Immutable data structures are persistent. This means that the immutable collection obtained by calling filter on your TreeSet actually shares some of its nodes with the original. This translates to time and space savings and offsets some of the penalties incurred by using immutability.
some of immutability advantages :
1 - smaller margin for error (you always know what’s in your collections and read-only variables).
2 - you can write concurrent programs without worrying about threads stepping on each other when modifying variables and collections.