How to append to Rose Trees in Scala - scala

I'm in need of a Rose Tree data structure like scalaz.Tree or the following:
case class Tree[A](root: A, children: Stream[Tree[A]])
However I'm having a hard time understanding how to write a function for appending nodes. In general, I understand that appending a node involves rebuilding the tree, and doing that with immutable data structures requires recursive functions, but I just haven't been able to put it all together. I did see Scala: Tree Insert Tail Recursion With Complex Structure but since that involves binary trees, I didn't quite grasp how to implement it for a multi-way tree.
Traditionally, I would implement this mutably using Array or such. Is there some book or resource that I should read to understand functional data structures more? Or is there some example code that could be recommended for me to read over?

It isn't obvious what are your requirements for "appending nodes". You can do it in the trivial way, inserting the second node as a first child:
def append[A](tree1: Tree[A], tree2: Tree[A]) = tree1 match {
case Tree(root, children) => Tree(root, tree2 #:: children)
}
If that's not what you want, can you provide an example?
Is there some book or resource that I should read to understand functional data structures more? Or is there some example code that could be recommended for me to read over?
The standard recommendation is Structure and Interpretation of Computer Programs. The code examples aren't in Scala, but it should be easy enough to translate the knowledge.

Related

scala: data structure for multiple-root tree or forest

I am in search of a good data structure for my requirement:
Support multiple roots.
Should have links to parent and children.
Assuming each Node can be uniquely identified, I have come up with a following data structure which seems a bit immature.
case class Node[T](data: T, children: List[Node[T]], parents: List[Node[T]])
class Forest[T](val roots: List[Node[T]]) {
// other helper methods to create the Forest
}
object Forest {
def apply[T](list: List[T]): Forest[T] = {
val roots = for (l <- list) yield Node[T](l, List.empty[Node[T]], List.empty[Node[T]])
new Forest[T](roots)
}
}
Does anyone have better suggestions?
TIA
I suggest you check out Graph for Scala. The library isn't being very actively developed, but the basic functionality doesn't need it and once you get used to the way it traverses nodes, you get a lot of functionality without much effort. (It might take a little thought to best fit the parent/child relationship into the graph; you could make a directed graph to represent that, or two graphs one in each direction, etc.)
Otherwise, the basics of what you have now is missing one detail--how do you actually create the forest? Right now, a node is immutable with immutable lists of parents and children, which means that the parents and children of every node must be created before the node is, which means...uh-oh.
You can solve this by adding an extra layer of indirection (have lists of node IDs and a map associating IDs with actual nodes), or by adding mutability somewhere (generally a little iffy in case classes, but in this case maybe it's what you need).
And then, of course, you need a bunch of methods to actually do something useful with the forest and provide higher-level ways of constructing it.

Scalaz Bind[Seq] typeclass

I'm currently porting some code from traditional Scala to Scalaz style.
It's fairly common through most of my code to use the Seq trait in my exposed API signatures rather than a concrete type (i.e. List, Vector) directly. However, this poses some problem with Scalaz, since it doesn't provide an implementation of a Bind[Seq] typeclass.
i.e. This will work correctly.
List(1,2,3,4) >>= bindOperation
But this will not
Seq(1,2,3,4) >>= bindOperation
failing with the error could not find implicit value for parameter F0: scalaz.Bind[Seq]
I assume this is an intentional design decision in Scalaz - however am unsure about intended/best practice on how to precede.
Should I instead write my code directly to List/Vector as appropriate instead of using the more flexible Seq interface? Or should I simply define my own Bind[Seq] typeclass?
The collections library does backflips to accommodate subtyping: when you use map on a specific collection type (list, map, etc.), you'll (usually) get the same type back. It manages this through the use of an extremely complex inheritance hierarchy together with type classes like CanBuildFrom. It gets the job done (at least arguably), but the complexity doesn't feel very principled. It's a mess. Lots of people hate it.
The complexity is generally pretty easy to avoid as a library user, but for a library designer it's a nightmare. If I provide a monad instance for Seq, that means all of my users' types get bumped up the hierarchy to Seq every type they use a monadic operation.
Scalaz folks tend not to like subtyping very much, anyway, so for the most part Scalaz stays around the leaves of the hierarchy—List, Vector, etc. You can see some discussion of this decision on the mailing list, for example.
When I first started using Scalaz I wrote a lot of utility code that tried to provide instances for Seq, etc. and make them usable with CanBuildFrom. Then I stopped, and now I tend to follow Scalaz in only ever using List, Vector, Map, and Set in my own code. If you're committed to "Scalaz style", you should do that as well (or even adopt Scalaz's own IList, ISet, ==>>, etc.). You're not going to find clear agreement on best practices more generally, though, and both approaches can be made to work, so you'll just need to experiment to find which you prefer.

Functional implementation of Tarjan's Strongly Connected Components algorithm

I went ahead and implemented the textbook version of Tarjan's SCC algorithm in Scala. However, I dislike the code - it is very imperative/procedural with lots of mutating states and book-keeping indices. Is there a more "functional" version of the algorithm? I believe imperative versions of algorithms hide the core ideas behind the algorithm unlike the functional versions. I found someone else encountering the same problem with this particular algorithm but I have not been able to translate his Clojure code into idomatic Scala.
Note: If anyone wants to experiment, I have a good setup that generates random graphs and tests your SCC algorithm vs running Floyd-Warshall
See Lazy Depth-First Search and Linear Graph Algorithms in Haskell by David King and John Launchbury. It describes many graph algorithms in a functional style, including SCC.
The following functional Scala code generates a map that assigns a representative to each node of a graph. Each representative identifies one strongly connected component. The code is based on Tarjan's algorithm for strongly connected components.
In order to understand the algorithm it might suffice to understand the fold and the contract of the dfs function.
def scc[T](graph:Map[T,Set[T]]): Map[T,T] = {
//`dfs` finds all strongly connected components below `node`
//`path` holds the the depth for all nodes above the current one
//'sccs' holds the representatives found so far; the accumulator
def dfs(node: T, path: Map[T,Int], sccs: Map[T,T]): Map[T,T] = {
//returns the earliest encountered node of both arguments
//for the case both aren't on the path, `old` is returned
def shallowerNode(old: T,candidate: T): T =
(path.get(old),path.get(candidate)) match {
case (_,None) => old
case (None,_) => candidate
case (Some(dOld),Some(dCand)) => if(dCand < dOld) candidate else old
}
//handle the child nodes
val children: Set[T] = graph(node)
//the initially known shallowest back-link is `node` itself
val (newState,shallowestBackNode) = children.foldLeft((sccs,node)){
case ((foldedSCCs,shallowest),child) =>
if(path.contains(child))
(foldedSCCs, shallowerNode(shallowest,child))
else {
val sccWithChildData = dfs(child,path + (node -> path.size),foldedSCCs)
val shallowestForChild = sccWithChildData(child)
(sccWithChildData, shallowerNode(shallowest, shallowestForChild))
}
}
newState + (node -> shallowestBackNode)
}
//run the above function, so every node gets visited
graph.keys.foldLeft(Map[T,T]()){ case (sccs,nextNode) =>
if(sccs.contains(nextNode))
sccs
else
dfs(nextNode,Map(),sccs)
}
}
I've tested the code only on the example graph found on the Wikipedia page.
Difference to imperative version
In contrast to the original implementation, my version avoids explicitly unwinding the stack and simply uses a proper (non tail-) recursive function. The stack is represented by a persistent map called path instead. In my first version I used a List as stack; but this was less efficient since it had to be searched for containing elements.
Efficiency
The code is rather efficient. For each edge, you have to update and/or access the immutable map path, which costs O(log|N|), for a total of O(|E| log|N|). This is in contrast to O(|E|) achieved by the imperative version.
Linear Time implementation
The paper in Chris Okasaki's answer gives a linear time solution in Haskell for finding strongly connected components. Their implementation is based on Kosaraju's Algorithm for finding SCCs, which basically requires two depth-first traversals. The paper's main contribution appears to be a lazy, linear time DFS implementation in Haskell.
What they require to achieve a linear time solution is having a set with O(1) singleton add and membership test. This is basically the same problem that makes the solution given in this answer have a higher complexity than the imperative solution. They solve it with state-threads in Haskell, which can also be done in Scala (see Scalaz). So if one is willing to make the code rather complicated, it is possible to implement Tarjan's SCC algorithm to a functional O(|E|) version.
Have a look at https://github.com/jordanlewis/data.union-find, a Clojure implementation of the algorithm. It's sorta disguised as a data structure, but the algorithm is all there. And it's purely functional, of course.

Should Scala immutable case classes be defined to hold Seq[T], immutable.Seq[T], List[T] or Vector[T]?

If we want to define a case class that holds a single object, say a tuple, we can do it easily:
sealed case class A(x: (Int, Int))
In this case, retrieving the "x" value will take a small constant amount of time, and this class will only take a small constant amount of space, regardless of how it was created.
Now, let's assume we want to hold a sequence of values instead; we could it like this:
sealed final case class A(x: Seq[Int])
This might seem to work as before, except that now storage and time to read all of x is proportional to x.length.
However, this is not actually the case, because someone could do something like this:
val hugeList = (1 to 1000000000).toList
val a = A(hugeList.view.filter(_ == 500000000))
In this case, the a object looks like an innocent case class holding a single int in a sequence, but in fact it requires gigabytes of memory, and it will take on the order of seconds to access that single element every time.
This could be fixed by specifying something like List[T] as the type instead of Seq[T]; however, this seems ugly since it adds a reference to a specific implementation, while in fact other well behaved implementations, like Vector[T], would also do.
Another worrying issue is that one could pass a mutable Seq[T], so it seems that one should at least use immutable.Seq instead of scala.collection.Seq (although the compiler can't actually enforce the immutability at the moment).
Looking at most libraries it seems that the common pattern is to use scala.collection.Seq[T], but is this really a good idea?
Or perhaps Seq is being used just because it's the shortest to type, and in fact it would be best to use immutable.Seq[T], List[T], Vector[T] or something else?
New text added in edit
Looking at the class library, some of the most core functionality like scala.reflect.api.Trees does in fact use List[T], and in general using a concrete class seems a good idea.
But then, why use List and not Vector?
Vector has O(1)/O(log(n)) length, prepend, append and random access, is asymptotically smaller (List is ~3-4 times bigger due to vtable and next pointers), and supports cache efficient and parallelized computation, while List has none of those properties except O(1) prepend.
So, personally I'm leaning towards Vector[T] being the correct choice for something exposed in a library data structure, where one doesn't know what operations the library user will need, despite the fact that it seems less popular.
First of all, you talk both about space and time requirements. In terms of space, your object will always be as large as the collection. It doesn't matter whether you wrap a mutable or immutable collection, that collection for obvious reasons needs to be in memory, and the case class wrapping it doesn't take any additional space (except its own small object reference). So if your collection takes "gigabytes of memory", that's a problem of your collection, not whether you wrap it in a case class or not.
You then go on to argue that a problem arises when using views instead of eager collections. But again the question is what the problem actually is? You use the example of lazily filtering a collection. In general running a filter will be an O(n) operation just as if you were iterating over the original list. In that example it would be O(1) for successive calls if that collection was made strict. But that's a problem of the calling site of your case class, not the definition of your case class.
The only valid point I see is with respect to mutable collections. Given the defining semantics of case classes, you should really only use effectively immutable objects as arguments, so either pure immutable collections or collections to which no instance has any more write access.
There is a design error in Scala in that scala.Seq is not aliased to collection.immutable.Seq but a general seq which can be either mutable or immutable. I advise against any use of unqualified Seq. It is really wrong and should be rectified in the Scala standard library. Use collection.immutable.Seq instead, or if the collection doesn't need to be ordered, collection.immutable.Traversable.
So I agree with your suspicion:
Looking at most libraries it seems that the common pattern is to use scala.collection.Seq[T], but is this really a good idea?
No! Not good. It might be convenient, because you can pass in an Array for example without explicit conversion, but I think a cleaner design is to require immutability.

How is immutability practically implemented in the design of Scala applications?

Being new to scala and a current java developer, scala was designed to encourage the use of immutability to class design.
How does this translate practically to the design of classes? The only thing that is brought to my mind is case classes. Are case classes strongly encouraged for defining data? Example? How else is immutability encouraged in Scala design of classes?
As a java developer, classes defining data were mutable. The equivalent Scala classes should be defined as case classes?
Well, case classes certainly help, but the biggest contributor is probably the collection library. The default collections are immutable, and the methods are geared toward manipulating collections by producing new ones instead of mutating. Since the immutable collections are persistent, that doesn't require copying the whole collection, which is something one often has to do in Java.
Beyond that, for-comprehensions are monadic comprehensions, which is helpful in doing immutable tasks, there's tail recursion optimization, which is very important in immutable algorithms, and general attention to immutability in many libraries, such as parser combinators and xml.
Finally, note that you have to ask for a var to get some mutability. Parameters are immutable, and val is just as short as var. Contrast this with Java, where parameters are mutable, and you need to add a final keyword to get immutability. Whereas in Scala it is as easy or easier to stay immutable, in Java it is easier to stay mutable.
Addendum
Persistent data structures are data structures that share parts between modified versions of it. This might be a bit difficult to understand, so let's consider Scala's List, which is pretty basic and easy to understand.
A Scala List is composed of two classes, known as cons and Nil. The former is actually written :: in Scala, but I'll refer to it by the traditional name.
Nil is the empty list. It doesn't contain anything. Methods that depend on the list not being empty, such as head and tail throw exceptions, while others work ok.
Naturally, cons must then represent a non-empty list. In fact, cons has exactly two elements: a value, and a list. These elements are known as head and tail.
So a list with three elements is composed of three cons, since each cons will hold only one value, plus a Nil. It must have a Nil because a cons must point to a list. As lists are not circular, then one of the cons must point to something other than a cons.
One example of such list is this:
val list = 1 :: 2 :: 3 :: Nil
Now, the components of a Scala List are immutable. One cannot change neither the value nor the list of a cons. One benefit of immutability is that you never need to copy the collection before passing or after receiving it from some other method: you know that list cannot change.
Now, let's consider what would happen if I modified that list. Let's consider two modifications: removing the first element and prepending a new element.
We can remove one element with the method tail, whose name is not a coincidence at all. So, we write:
val list2 = list.tail
And list2 will point to the same list that list's tail is pointing. Nothing at all was created: we simply reused part of list. So, let's prepend an element to list2 then:
val list3 = 0 :: list2
We created a new cons there. This new cons has a value (a head) equal to 0, and its tail points to list2. Note that both list and list3 point to the same list2. These elements are being shared by both list and list3.
There are many other persistent data structures. The very fact that the data you are manipulating is immutable makes it easy to share components.
One can find more information about this subject on the book by Chris Okasaki, Purely Functional Data Structures, or on his freely available thesis by the same name.