I want to implement a tree in Scala. My particular tree uses Swing Split panes to give multiple views of a geographical map. Any pane within a split pane can itself be further divided to give an extra view. Am I right in saying that neither TreeMap nor TreeSet provide Tree functionality? Please excuse me if I've misunderstood this. It strikes me there should be standard Tree collections and it is bad practice to keep reinventing the wheel. Are there any Tree implementation out there that might be the future Scala standard?
All Trees have three types of elements: a Root, Nodes and Leaves. Leaves and Nodes must have a single reference to a parent. Root and Nodes can have multiple references to child nodes and leaves. Leaves have zero children. Nodes and Root can not be deleted without their children being deleted. there's probably other rules / constraints that I've missed.
This seems like enough common specification to justify a standard collection. I would also suggest that there should be a standard subclass collection for the case where Root and Nodes can only have 2 children or a single leaf child. Which is what I want in my particular case.
Actually, a tree by itself is both pretty useless and pretty difficult to specify.
Starting with the latter, speaking strictly about the data structure, how many children can a tree have? Do nodes store values or not? Do nodes store metadata? Do children have pointers to their parents? Do you store the tree as nodes with pointers, or as positional elements on an array?
These are all questions to which the answer is "it depends". In fact, you stated that children have pointers to their parents, but that is not true for any immutable tree! You also seem to assume trees are always stored as node objects with references, when some trees are actually stored as nodes on a single array (such as a Heap).
Furthermore, not all these requirements can be accommodated -- some are mutually exclusive. Even if you ignore those, you are still left with a data structure which is not optimized for anything and clumsy to use because you have to deal with lots of details not relevant to you.
And, then, there's the second problem, which is that a tree, by itself, is useless. TreeSet and TreeMap take advantage of specific trees whose insertion/deletion/lookup algorithm makes them good data structures for sorted data. That, however, is not at all the only use for trees. Trees can be used to search for spatial algorithms, to represent tree-like real world information, to make up filesystems, etc. Sometimes the task is finding a tree inside a graph. Each of these uses require different representations and different algorithms -- the algorithms being what make them at all useful.
And, to top it off, writing a tree class is trivial. The problem is writing the algorithms to manipulate it.
There is a bit of mismatch between the notion of "tree" as a GUI widget—which you seem to be referring to—and tree as an ordered data structure. In the former case it is just a nested sequence, in the latter the aim is to provide for instance fast search algorithms, and you don't arbitrary manipulate the internal structure, where often the branching factor is constant and the tree height is kept balanced, etc. An example of the latter is collection.immutable.TreeMap which uses a self-balancing binary tree structure called Red-Black-Tree.
So these data structures are rather useless for bridging to javax.swing.TreeModel. There is little that can be done about this interface, so you'll probably stick with the default implementation DefaultTreeModel, a mutable non-generic structure (which is all that single threaded Swing needs).
For a discussion about having a scala-swing JTree wrapper, see this question. It also has a link to a Scala library for JTree.
Since you can use java classes with Scala, take a look at the javax.swing.tree package: http://docs.oracle.com/javase/6/docs/api/javax/swing/tree/package-summary.html, especially TreeModel and TreeNode, MutableTreeNode and DefaultMutableTreeNode. They were designed to be used with Swing, but are pretty much a standard tree implementation.
Other than that, implementing a (generic) tree should be pretty straightforward.
Since gui application imposes low performance demands on the tree collection you may use general graph library constrained to represent only tree structured-graph: http://scala-graph.org/
TreeSet and TreeMap are both based on RedBlack:
Red-black trees are a form of balanced binary trees where some nodes
are designated “red” and others designated “black.” Like any balanced
binary tree, operations on them reliably complete in time logarithmic
to the size of the tree.
(quote from Scala 2.8 Collections)
RedBlack is not documented very well but if you look at the source of TreeSet and TreeMap it's pretty easy to figure out how to use it, though it doesn't fill all (most?) of your requirements (nodes don't have references to the parent, etc).
Related
I am getting started in dataflow/apache beam, and I'm struggling to understand a concept. According to the documentation :
A PCollection is an immutable collection of values of type T. A PCollection can contain either a bounded or unbounded number of elements.
It is easy to understand that bounded PCollections are immutable. You get a file, you put it in a PCollection, you can't change it: Immutable.
What about unbounded PCollections? They are by definition, without a limit of number of elements, so stuff always gets added to them indefinitely; i.e. How can something be changed perpetually and also be immutable?
An explanation would be great.
That's a good question! I believe the Programming Guide explains PCollection's immutability better than the JavaDoc. The immutability has to do with individual elements:
A PCollection is immutable. Once created, you cannot add, remove, or change individual elements. A Beam Transform might process each element of a PCollection and generate new pipeline data (as a new PCollection), but it does not consume or modify the original input collection.
Note: Beam SDKs avoid unnecessary copying of elements, so PCollection contents are logically immutable, not physically immutable. Changes to input elements may be visible to other DoFns executing within the same bundle, and may cause correctness issues. As a rule, it’s not safe to modify values provided to a DoFn.
Another way to look at it is that the set is logically immutable, it's just your view into it that's changing over time (due to the inability to see into the future). E.g. ReadFromPubSub returns the (immutable, unbounded) set of all message that will ever be published to this topic. From the Beam API you can't modify this set as a PCollection, but you can create other immutable, unbounded PCollections that are derived from it.
This is similar to lazy, infinite structures that exist in functional language like Haskell--you can only ever observe a portion of it, but that doesn't mean the whole thing doesn't exist as an immutable object.
I am really, really, new to Apache Spark.
I am working on implementing Approximate LOCI (or ALOCI), an anomaly detection algorithm, on a distributed way over Spark. This algorithm is based on storing points in a QuadTree that is used to find a point's number of neighbors.
I know exactly how QuadTrees work. In fact, I have implemented such a structure in Java recently. But I am completely lost as far as it concerns the way that such a structure can work in a distributed way over Spark.
Something similar to what I need can be found in Geospark.
https://github.com/DataSystemsLab/GeoSpark/tree/b2b6f1d7f0015d5c9d663a7b28d5e1bb1043c413/core/src/main/java/org/datasyslab/geospark/spatialPartitioning/quadtree
GeoSpark uses in many cases a PointRDD class, that extends a SpatialRDD class which I can see that uses the QuadTree that can be found in the link above to partition the Spatial objects. That is what I understood, at least, in theory.
In practice, I still cannot figure this out. Let's say for example that I have millions of records in a csv and I want to read and load them in a QuadTree.
I could read a csv to an RDD, but then what? How does this RDD logically connects to the QuadTree I am trying to build?
Of course, I don't expect a working solution here. I just need the logic here to fill the gap in my mind. How do I implement a distributed QuadTree and how do I use it?
Ok, sadly there are no answers to this, but here I am two weeks later with a working solution. Not 100% sure if it is the right approach here, though.
I created a class named Element and turned each line of my csv to an RDD[Element]. I then created a serializable class named QuadNode which has a List[Elements] and an Array[String] of size 4. On adding elements to a node, these elements are added in the node's List. If the list get more than X elements (20 in my case), the node breaks into 4 children and the elements are sent to the children. Finally, I created a class QuadTree which has an RDD[QuadNodes] among its rest properties. Every time a node breaks to children then these children-nodes are added in the tree's RDD.
In a non-functional language, each node would have 4 pointers, one for each child. Since, we are in a distributed environment this approach could not work. So, I gave each node a unique Id. Root node has an id = "0". Root's nodes have ids "00", "01", "02" and "03". Node-"00" children have ids "000","001","002","003". In this way if we want to find all the descendants of a node, we filter our tree's RDD[QuadNode] by checking if nodes' ids startWith out node id. Reversing this logic helps us to find a node's parent node.
This is how I implemented my QuadTree, at least for now. If someone knows a better way of implementing this I would love to hear his/her opinion.
In the Eclipse APIs, the return and argument types are mostly arrays instead of collections. An example is the members method on IContainer, which returns IResources[].
I am interested in why this is the case. Maybe it is one of the following:
The APIs were designed before generics generics were available, so IResource[] was better than just Collection or List
Memory concerns, e.g. ArrayList internally holds an array which has more space than is needed (to offer an efficient implementation of add), whereas an array is always constructed for just the needed target size
It's not possible to add/remove elements on an array, so it is safe for iterating (but defensive copying is still necessary, because one can still change elements, e.g. set them to null)
Does anyone have any insights or other ideas why the API was developed that way?
Posting this as an answer, so it can be accepted.
Eclipse predates generics and they are really serious about API stability. Also, at the low level of SWT passing arrays seems to be used to reflect the operating system APIs that are being wrapped. Once you have a bunch of tooling using Arrays I guess it makes sense to keep things consistent. Also note that arrays aren't subject to all of the type erasure issues when using reflection.
Yeah, I hear you as far as the collections api being generally much easier to work with for dynamic lists of items.
Looking at this question, where the questioner is interested in the first and last instances of some element in a List, it seems a more efficient solution would be to use a DoubleLinkedList that could search backwards from the end of the list. However there is only one implementation in the collections API and it's mutable.
Why is there no immutable version?
Because you would have to copy the whole list each time you want to make a change. With a normal linked list, you can at least prepend to the list without having to copy everything. And if you do want to copy everything on every change, you don't need a linked list for that. You can just use an immutable array.
There are many impediments to such a structure, but one is very pressing: a doubly linked list cannot be persistent.
The logic behind this is pretty simple: from any node on the list, you can reach any other node. So, if I added an element X to this list DL, and tried to use a part of DL, I'd face this contradiction: from the node pointing to X one can reach every element in part(DL), but, by the properties of the doubly linked list, that means from any element of part(DL) I can reach the node pointing to X. Since part(DL) is supposed to be immutable and part of DL, and since DL did not include the node pointing to X, that just cannot be.
Non-persistent immutable data structures might have some uses, but they are generally bad for most operations, since they need to be recreated whenever a derivative is produced.
Now, there's the minor matter of creating mutually referencing strict objects, but this is surmountable. One can use by-name parameters and lazy vals, or one can do like Scala's List: actually create a mutable collection, and then "freeze" it in immutable state (see ListBuffer and it's toList method).
Because it is logically impossible to create a mutually (circular) referential data-structure with strict immutability.
You cannot create two nodes that point to each other due to simple existential ordering priority, in that at least one of the nodes will not exist when the other is created.
It is possible to get this circularity with tricks involving laziness (which is implemented with mutation), but the real question then becomes why you would want this thing in the first place?
As others have noted, there is no persistent implementation of a double-linked list. You will need some kind of tree to get close to the characteristics you want.
In particular, you may want to look at finger trees, which provide O(1) access to the front and back, amortized O(1) insertion to the front and back, and O(log n) insertion elsewhere. (That's in contrast to most other commonly-used trees which have O(log n) access and insertion everywhere.)
See also:
video explanation of finger trees (by the implementor of finger trees in clojure.contrib)
finger tree implementation in Scala (I haven't used it personally, but it's the top google hit)
As a supplemental to the answer of #KimStebel I like to add:
If you are searching for a data structure suitable for the question that motivated you to ask this question, then you might have a look at Extreme Cleverness: Functional Data Structures in Scala by #DanielSpiewak.
At first I assumed that every collection class would receive an additional par method which would convert the collection to a fitting parallel data structure (like map returns the best collection for the element type in Scala 2.8).
Now it seems that some collection classes support a par method (e. g. Array) but others have toParSeq, toParIterable methods (e. g. List). This is a bit weird, since Array isn't used or recommended that often.
What is the reason for that? Wouldn't it be better to just have a par available on all collection classes doing the "right thing"?
If I have data which might be processed in parallel, what types should I use? The traits in scala.collection or the type of the implementation directly?
Or should I prefer Arrays now, because they seem to be cheaper to parallelize?
Lists aren't that well suited for parallel processing. The reason is that to get to the end of the list, you have to walk through every single element. Thus, you may as well just treat the list as an iterator, and thus may as well just use something more generic like toParIterable.
Any collection that has a fast index is a good candidate for parallel processing. This includes anything implementing LinearSeqOptimized, plus trees and hash tables. Array has as fast of an index as you can get, so it's a fairly natural choice. You can also use things like ArrayBuffer (which has a par method returning a ParArray).