How can an unbounded PCollection be immutable? - apache-beam

I am getting started in dataflow/apache beam, and I'm struggling to understand a concept. According to the documentation :
A PCollection is an immutable collection of values of type T. A PCollection can contain either a bounded or unbounded number of elements.
It is easy to understand that bounded PCollections are immutable. You get a file, you put it in a PCollection, you can't change it: Immutable.
What about unbounded PCollections? They are by definition, without a limit of number of elements, so stuff always gets added to them indefinitely; i.e. How can something be changed perpetually and also be immutable?
An explanation would be great.

That's a good question! I believe the Programming Guide explains PCollection's immutability better than the JavaDoc. The immutability has to do with individual elements:
A PCollection is immutable. Once created, you cannot add, remove, or change individual elements. A Beam Transform might process each element of a PCollection and generate new pipeline data (as a new PCollection), but it does not consume or modify the original input collection.
Note: Beam SDKs avoid unnecessary copying of elements, so PCollection contents are logically immutable, not physically immutable. Changes to input elements may be visible to other DoFns executing within the same bundle, and may cause correctness issues. As a rule, it’s not safe to modify values provided to a DoFn.

Another way to look at it is that the set is logically immutable, it's just your view into it that's changing over time (due to the inability to see into the future). E.g. ReadFromPubSub returns the (immutable, unbounded) set of all message that will ever be published to this topic. From the Beam API you can't modify this set as a PCollection, but you can create other immutable, unbounded PCollections that are derived from it.
This is similar to lazy, infinite structures that exist in functional language like Haskell--you can only ever observe a portion of it, but that doesn't mean the whole thing doesn't exist as an immutable object.

Related

Creating attributes for elements of a PCollection in Apache Beam

I'm fairly new to Apache Beam and was wondering if I can create my own attributes for elements in a PCollection.
I went through the docs but could not find anything.
Example 2 in this ParDo doc shows how to access the TimestampParam and the WindowParam, which are - from what I understand - attributes of each element in a PCollection:
class AnalyzeElement(beam.DoFn):
def process(
self,
elem,
timestamp=beam.DoFn.TimestampParam,
window=beam.DoFn.WindowParam):
yield [...]
So my question is, if it is possible to create such attributes (e.g. TablenameParam) for the elements in a PCollection and if not, if there is some kind of workaround to achieve that?
What you are describing would simply be part of the element. For your example, the TablenameParam would be a field of the type you add to the PCollection.
The reason that WindowParam and TimestampParam are treated differently is that they are often propagated implicitly, and are a part of the Beam model for every element no matter the actual data. Most code that is only operating on the main data does not need to touch them.

How is Scala so efficient with lists?

It's usually considered a bad practice to make unnecessary collections in Java as it consumes some memory and CPU. Scala seems to be pretty efficient with it and encourages to use immutable data structures.
How is Scala so efficient with Lists? What techniques are used to achieve that?
While the comments are correct that the claim that list is particularly efficient is a dubious one, it does much better than doing full copies of the collection for every operation like you would do with Java's standard collections.
The reason for this is List and the other immutable collections are not just mutable collections with mutation methods returning a copy, but are designed differently to with immutability in mind. They Take advantage of something called "structural sharing". If parts of a collection remain the same after a change, then those parts don't need to be copied and the same object can be shared across multiple collections. This works because of immutability, there is no change that they could be altered so it's safe to share.
Imagine the simplest example, prepending to a list.
You have a List(1,2,3) and you want to prepend 0
val original = List(1,2,3)
val updated = 0 :: original
You list would then look something like this
updated original
\ \
0 - - - 1 - - - 2 - - - 3
All that's needed is to create a new node and point it's tail to the head of your original list. Nothing needs to be copied. Similarly the tail and drop operations just need to return a reference to the appropriate node and nothing needs to be copied. This is why List can be quite good with the prepend and tail operations, because it doesn't do any copying even though it creates a "new" List.
Other List operations do require some amount copying, but always as little as possible. As long as part of the tail of a list is unchanged it doesn't need to be copied. For example when concatenating lists, the first list needs to be copied, but then it's tail can just point to the head of the second, so the second list doesn't need to be copied at all. This is why, when concatenating a long and short list it's better to put the shorter list on the "left" as it is the only one that needs to be copied.
Other types of collections are better at different operations. Vector for example can to both prepend and append in amortized constant time, as well as having good random access and update capabilities (though still much worse than a raw mutable array). In most cases it will be more efficient than List while still being immutable. It's implementation is quite complicated. It uses a trie datastructure, with many internal arrays to store data. The unchanged ones can be shared and only the ones that need to be altered by an update operation need to be copied.

How can I get the index of an item in an IOrderedQueryable?

Background:
I'm designing a list-like control (WinForms) that's backed by a DbSet. A chief requirement is that it doesn't load the entire list into local memory. I'm using a DataGridView in virtual mode as the underlying UI. I'm planning to implement the CellValueNeeded function as orderedQueryable.ElementAt(n).
Problem:
I need to allow the control's consumer to get/set the currently-selected value, by value rather than by index. Getting is easy--it's the same as the CellValueNeeded operation--but setting is harder: it requires me to get the index of a given element. There's not a built-in orderedQueryable.FirstIndexOf(value) operation, and although I could theoretically fake it with some sort of orderedQueryable.SkipWhile shenanigans where the expression has a side-effect, in practice the DbSet's query provider probably doesn't support doing that.
Questions:
Is there an efficient way to get the index of a particular value within an IOrderedQueryable? How?
(If this approach turns out to be untenable, I'd settle for suggestions on how I might restructure the problem to make it solvable.)
Side notes:
Elements can be inserted and removed from the list, in which case the old indices will be invalid--that's acceptable, since they're never exposed to the consumer. It's an error for the consumer to attempt to select an item that isn't actually in the list, and actually the consumer would have gotten the item from the list in the first place (although perhaps the indices have changed since then).

Tree collections in Scala

I want to implement a tree in Scala. My particular tree uses Swing Split panes to give multiple views of a geographical map. Any pane within a split pane can itself be further divided to give an extra view. Am I right in saying that neither TreeMap nor TreeSet provide Tree functionality? Please excuse me if I've misunderstood this. It strikes me there should be standard Tree collections and it is bad practice to keep reinventing the wheel. Are there any Tree implementation out there that might be the future Scala standard?
All Trees have three types of elements: a Root, Nodes and Leaves. Leaves and Nodes must have a single reference to a parent. Root and Nodes can have multiple references to child nodes and leaves. Leaves have zero children. Nodes and Root can not be deleted without their children being deleted. there's probably other rules / constraints that I've missed.
This seems like enough common specification to justify a standard collection. I would also suggest that there should be a standard subclass collection for the case where Root and Nodes can only have 2 children or a single leaf child. Which is what I want in my particular case.
Actually, a tree by itself is both pretty useless and pretty difficult to specify.
Starting with the latter, speaking strictly about the data structure, how many children can a tree have? Do nodes store values or not? Do nodes store metadata? Do children have pointers to their parents? Do you store the tree as nodes with pointers, or as positional elements on an array?
These are all questions to which the answer is "it depends". In fact, you stated that children have pointers to their parents, but that is not true for any immutable tree! You also seem to assume trees are always stored as node objects with references, when some trees are actually stored as nodes on a single array (such as a Heap).
Furthermore, not all these requirements can be accommodated -- some are mutually exclusive. Even if you ignore those, you are still left with a data structure which is not optimized for anything and clumsy to use because you have to deal with lots of details not relevant to you.
And, then, there's the second problem, which is that a tree, by itself, is useless. TreeSet and TreeMap take advantage of specific trees whose insertion/deletion/lookup algorithm makes them good data structures for sorted data. That, however, is not at all the only use for trees. Trees can be used to search for spatial algorithms, to represent tree-like real world information, to make up filesystems, etc. Sometimes the task is finding a tree inside a graph. Each of these uses require different representations and different algorithms -- the algorithms being what make them at all useful.
And, to top it off, writing a tree class is trivial. The problem is writing the algorithms to manipulate it.
There is a bit of mismatch between the notion of "tree" as a GUI widget—which you seem to be referring to—and tree as an ordered data structure. In the former case it is just a nested sequence, in the latter the aim is to provide for instance fast search algorithms, and you don't arbitrary manipulate the internal structure, where often the branching factor is constant and the tree height is kept balanced, etc. An example of the latter is collection.immutable.TreeMap which uses a self-balancing binary tree structure called Red-Black-Tree.
So these data structures are rather useless for bridging to javax.swing.TreeModel. There is little that can be done about this interface, so you'll probably stick with the default implementation DefaultTreeModel, a mutable non-generic structure (which is all that single threaded Swing needs).
For a discussion about having a scala-swing JTree wrapper, see this question. It also has a link to a Scala library for JTree.
Since you can use java classes with Scala, take a look at the javax.swing.tree package: http://docs.oracle.com/javase/6/docs/api/javax/swing/tree/package-summary.html, especially TreeModel and TreeNode, MutableTreeNode and DefaultMutableTreeNode. They were designed to be used with Swing, but are pretty much a standard tree implementation.
Other than that, implementing a (generic) tree should be pretty straightforward.
Since gui application imposes low performance demands on the tree collection you may use general graph library constrained to represent only tree structured-graph: http://scala-graph.org/
TreeSet and TreeMap are both based on RedBlack:
Red-black trees are a form of balanced binary trees where some nodes
are designated “red” and others designated “black.” Like any balanced
binary tree, operations on them reliably complete in time logarithmic
to the size of the tree.
(quote from Scala 2.8 Collections)
RedBlack is not documented very well but if you look at the source of TreeSet and TreeMap it's pretty easy to figure out how to use it, though it doesn't fill all (most?) of your requirements (nodes don't have references to the parent, etc).

Why no immutable double linked list in Scala collections?

Looking at this question, where the questioner is interested in the first and last instances of some element in a List, it seems a more efficient solution would be to use a DoubleLinkedList that could search backwards from the end of the list. However there is only one implementation in the collections API and it's mutable.
Why is there no immutable version?
Because you would have to copy the whole list each time you want to make a change. With a normal linked list, you can at least prepend to the list without having to copy everything. And if you do want to copy everything on every change, you don't need a linked list for that. You can just use an immutable array.
There are many impediments to such a structure, but one is very pressing: a doubly linked list cannot be persistent.
The logic behind this is pretty simple: from any node on the list, you can reach any other node. So, if I added an element X to this list DL, and tried to use a part of DL, I'd face this contradiction: from the node pointing to X one can reach every element in part(DL), but, by the properties of the doubly linked list, that means from any element of part(DL) I can reach the node pointing to X. Since part(DL) is supposed to be immutable and part of DL, and since DL did not include the node pointing to X, that just cannot be.
Non-persistent immutable data structures might have some uses, but they are generally bad for most operations, since they need to be recreated whenever a derivative is produced.
Now, there's the minor matter of creating mutually referencing strict objects, but this is surmountable. One can use by-name parameters and lazy vals, or one can do like Scala's List: actually create a mutable collection, and then "freeze" it in immutable state (see ListBuffer and it's toList method).
Because it is logically impossible to create a mutually (circular) referential data-structure with strict immutability.
You cannot create two nodes that point to each other due to simple existential ordering priority, in that at least one of the nodes will not exist when the other is created.
It is possible to get this circularity with tricks involving laziness (which is implemented with mutation), but the real question then becomes why you would want this thing in the first place?
As others have noted, there is no persistent implementation of a double-linked list. You will need some kind of tree to get close to the characteristics you want.
In particular, you may want to look at finger trees, which provide O(1) access to the front and back, amortized O(1) insertion to the front and back, and O(log n) insertion elsewhere. (That's in contrast to most other commonly-used trees which have O(log n) access and insertion everywhere.)
See also:
video explanation of finger trees (by the implementor of finger trees in clojure.contrib)
finger tree implementation in Scala (I haven't used it personally, but it's the top google hit)
As a supplemental to the answer of #KimStebel I like to add:
If you are searching for a data structure suitable for the question that motivated you to ask this question, then you might have a look at Extreme Cleverness: Functional Data Structures in Scala by #DanielSpiewak.