Combine vs ParDo in apache beam - apache-beam

May I know the exact difference between a ParDo and a Combine transformation in Apache Beam?
Can I see ParDo as the Map phase in the map/shuffle/reduce while Combine as the reduce phase?
Thank you!

As far as I have understand Apache Beam, there are no explicit Map and Reduce phases.
You can apply several element-wise map functions in a row, where ParDo is the most general class that can be used for own implementation.
The term reduce has been replaced by aggregation and there the corresponding class is Combine.

MapReduce is limited to graphs of the shape Map-Shuffle-Reduce, where Reduce is an elementwise operation, just like map, that is distinguished only by following the shuffle.
In Apache Beam, one can have arbitrary topologies, e.g.
Map-Map-Shuffle-Map-Shuffle-Map-Map-Shuffle-Map
so the notion of breaking phases down by that which follows shuffle no longer holds. (Beam calls Map/Shuffle ParDo and GroupByKey respectively.)
Combine operations are a special kind of Map operations that are known to be associative (think sum, max, etc. but they can be much more complicated) which allow us to push part of the work before the shuffle, e.g.
Shuffle-Sum
becomes
PartialSum-Shuffle-Sum
(Most MapReduce systems also have this notion, named combining or semi-reducing or similar.)
Note that Beam's CombinePerKey and GlobalCombine operations pair the shuffle with the CombineFn, no need to GroupByKey first.

Related

PySpark MLLib APproximate nearest neighbour search for multiple keys

I want to use ANN from PySpark. I have a DataFrame of 100K keys for which I want to perform top-10 ANN searches on an already transformed Spark DataFrame. But it seems that API of BucketedRandomProjectionLSH expects only one key at a time. I also want to avoid using approxSimilarityJoin, because it only allows you to set a threshold, but this would lead to a variable k for each key (it also fails in my case saying that for some records there are no NNs for a given threshold).
Currently, the best thing I came up with is .collect() the keys and call approxNearestNeighbors in a for loop on the driver, but it is terribly inefficient.
Does anyone know how I can get top-10 ANN searches for my 100K keys in parallel?
Thank you.

Converting Dataframe from Spark to the type used by DL4j

Is there any convenient way to convert Dataframe from Spark to the type used by DL4j? Currently using Daraframe in algorithms with DL4j I get an error:
"type mismatch, expected: RDD[DataSet], actual: Dataset[Row]".
In general, we use datavec for that. I can point you at examples for that if you want. Dataframes make too many assumptions that make it too brittle to be used for real world deep learning.
Beyond that, a data frame is not typically a good abstraction for representing linear algebra. (It falls down when dealing with images for example)
We have some interop with spark.ml here: https://github.com/deeplearning4j/deeplearning4j/blob/master/deeplearning4j/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/test/java/org/deeplearning4j/spark/ml/impl/SparkDl4jNetworkTest.java
But in general, a dataset is just a pair of ndarrays just like numpy. If you have to use spark tools, and want to use ndarrays on the last mile only, then my advice would be to get the dataframe to match some form of schema that is purely numerical, map that to an ndarray "row".
In general, a big reason we do this is because all of our ndarrays are off heap.
Spark has many limitations when it comes to working with their data pipelines and using the JVM for things it shouldn't be(matrix math) - we took a different approach that allows us to use gpus and a bunch of other things efficiently.
When we do that conversion, it ends up being:
raw data -> numerical representation -> ndarray
What you could do is map dataframes on to a double/float array and then use Nd4j.create(float/doubleArray) or you could also do:
someRdd.map(inputFloatArray -> new DataSet(Nd4j.create(yourInputArray),yourLabelINDARray))
That will give you a "dataset" You need a pair of ndarrays matching your input data and a label.
The label from there is relative to the kind of problem you're solving whether that be classification or regression though.

Scala -- are map and filter operations in linear time?

Are Scala map and filter operations in linear time, or is there some parallelism under the hood for data structures like Arrays?
Independently of parallelism, map and filter operations always have to do at least O(n) work where n is the number of elements in the collection.
If the collection is e.g. Array, List, ArrayBuffer, HashMap or HashSet, then filter and map do O(n) work.
For specific collections like balanced trees -- e.g. mutable.TreeSet, immutable.TreeMap, immutable.HashSet or immutable.Vector, the filter and map take O(n logn) time, because updating them to add all the elements takes more and more work as the collection grows.
Independently of how much work is needed to traverse all the elements, many Scala collections (typically those based on trees, maps, tries and arrays) support parallel filter and map, so the total amount of work done per processor is O(n / p), where p is the number of processors your machine has. To use them call par on the collection before calling filter or map.
Read more about parallel collections here.
No, there is no parallelism at all, unless you're doing parallel collections which is kinda explicit. Even with parallelism, the map and filter are linear time operations (but spread among many workers)

When should I choose Vector in Scala?

It seems that Vector was late to the Scala collections party, and all the influential blog posts had already left.
In Java ArrayList is the default collection - I might use LinkedList but only when I've thought through an algorithm and care enough to optimise. In Scala should I be using Vector as my default Seq, or trying to work out when List is actually more appropriate?
As a general rule, default to using Vector. It’s faster than List for almost everything and more memory-efficient for larger-than-trivial sized sequences. See this documentation of the relative performance of Vector compared to the other collections. There are some downsides to going with Vector. Specifically:
Updates at the head are slower than List (though not by as much as you might think)
Another downside before Scala 2.10 was that pattern matching support was better for List, but this was rectified in 2.10 with generalized +: and :+ extractors.
There is also a more abstract, algebraic way of approaching this question: what sort of sequence do you conceptually have? Also, what are you conceptually doing with it? If I see a function that returns an Option[A], I know that function has some holes in its domain (and is thus partial). We can apply this same logic to collections.
If I have a sequence of type List[A], I am effectively asserting two things. First, my algorithm (and data) is entirely stack-structured. Second, I am asserting that the only things I’m going to do with this collection are full, O(n) traversals. These two really go hand-in-hand. Conversely, if I have something of type Vector[A], the only thing I am asserting is that my data has a well defined order and a finite length. Thus, the assertions are weaker with Vector, and this leads to its greater flexibility.
Well, a List can be incredibly fast if the algorithm can be implemented solely with ::, head and tail. I had an object lesson of that very recently, when I beat Java's split by generating a List instead of an Array, and couldn't beat that with anything else.
However, List has a fundamental problem: it doesn't work with parallel algorithms. I cannot split a List into multiple segments, or concatenate it back, in an efficient manner.
There are other kinds of collections that can handle parallelism much better -- and Vector is one of them. Vector also has great locality -- which List doesn't -- which can be a real plus for some algorithms.
So, all things considered, Vector is the best choice unless you have specific considerations that make one of the other collections preferable -- for example, you might choose Stream if you want lazy evaluation and caching (Iterator is faster but doesn't cache), or List if the algorithm is naturally implemented with the operations I mentioned.
By the way, it is preferable to use Seq or IndexedSeq unless you want a specific piece of API (such as List's ::), or even GenSeq or GenIndexedSeq if your algorithm can be run in parallel.
Some of the statements here are confusing or even wrong, especially the idea that immutable.Vector in Scala is anything like an ArrayList.
List and Vector are both immutable, persistent (i.e. "cheap to get a modified copy") data structures.
There is no reasonable default choice as their might be for mutable data structures, but it rather depends on what your algorithm is doing.
List is a singly linked list, while Vector is a base-32 integer trie, i.e. it is a kind of search tree with nodes of degree 32.
Using this structure, Vector can provide most common operations reasonably fast, i.e. in O(log_32(n)). That works for prepend, append, update, random access, decomposition in head/tail. Iteration in sequential order is linear.
List on the other hand just provides linear iteration and constant time prepend, decomposition in head/tail. Everything else takes in general linear time.
This might look like as if Vector was a good replacement for List in almost all cases, but prepend, decomposition and iteration are often the crucial operations on sequences in a functional program, and the constants of these operations are (much) higher for vector due to its more complicated structure.
I made a few measurements, so iteration is about twice as fast for list, prepend is about 100 times faster on lists, decomposition in head/tail is about 10 times faster on lists and generation from a traversable is about 2 times faster for vectors. (This is probably, because Vector can allocate arrays of 32 elements at once when you build it up using a builder instead of prepending or appending elements one by one).
Of course all operations that take linear time on lists but effectively constant time on vectors (as random access or append) will be prohibitively slow on large lists.
So which data structure should we use?
Basically, there are four common cases:
We only need to transform sequences by operations like map, filter, fold etc:
basically it does not matter, we should program our algorithm generically and might even benefit from accepting parallel sequences. For sequential operations List is probably a bit faster. But you should benchmark it if you have to optimize.
We need a lot of random access and different updates, so we should use vector, list will be prohibitively slow.
We operate on lists in a classical functional way, building them by prepending and iterating by recursive decomposition: use list, vector will be slower by a factor 10-100 or more.
We have an performance critical algorithm that is basically imperative and does a lot of random access on a list, something like in place quick-sort: use an imperative data structure, e.g. ArrayBuffer, locally and copy your data from and to it.
For immutable collections, if you want a sequence, your main decision is whether to use an IndexedSeq or a LinearSeq, which give different guarantees for performance. An IndexedSeq provides fast random-access of elements and a fast length operation. A LinearSeq provides fast access only to the first element via head, but also has a fast tail operation. (Taken from the Seq documentation.)
For an IndexedSeq you would normally choose a Vector. Ranges and WrappedStrings are also IndexedSeqs.
For a LinearSeq you would normally choose a List or its lazy equivalent Stream. Other examples are Queues and Stacks.
So in Java terms, ArrayList used similarly to Scala's Vector, and LinkedList similarly to Scala's List. But in Scala I would tend to use List more often than Vector, because Scala has much better support for functions that include traversal of the sequence, like mapping, folding, iterating etc. You will tend to use these functions to manipulate the list as a whole, rather than randomly accessing individual elements.
In situations which involve a lot random access and random mutation, a Vector (or – as the docs say – a Seq) seems to be a good compromise. This is also what the performance characteristics suggest.
Also, the Vector class seems to play nicely in distributed environments without much data duplication because there is no need to do a copy-on-write for the complete object. (See: http://akka.io/docs/akka/1.1.3/scala/stm.html#persistent-datastructures)
If you're programming immutably and need random access, Seq is the way to go (unless you want a Set, which you often actually do). Otherwise List works well, except it's operations can't be parallelized.
If you don't need immutable data structures, stick with ArrayBuffer since it's the Scala equivalent to ArrayList.

Why is this not faster using parallel collections?

I just wanted to test the parallel collections a bit and I used the following line of code (in REPL):
(1 to 100000).par.filter(BigInt(_).isProbablePrime(100))
against:
(1 to 100000).filter(BigInt(_).isProbablePrime(100))
But the parallel version is not faster. In fact it even feels a bit slower (But I haven't really measured that).
Has anyone an explanation for that?
Edit 1: Yes, I do have a multi-core processor
Edit 2: OK, I "solved" the problem myself. The implementation of isProbablePrime seems to be the problem and not the parallel collections. I replaced isProbablePrime with another function to test for primality and now I get an expected speedup.
Both with sequential and parallel ranges, filter will generate a vector data structure - a Vector or a ParVector, respectively.
This is a known problem with parallel vectors that get generated from range collections - transformer methods (such as filter) for parallel vectors do not construct the vector in parallel.
A solution for this that allows efficient parallel construction of vectors has already been developed, but was not yet implemented. I suggest you file a ticket, so that it can be fixed for the next release.