Apache Beam API to Runner instruction translation - apache-beam

Input: A sentence
Expected Output: String representation of an array generated by line.split(' ')
Transformation defined
.apply(
MapElements.into(TypeDescriptors.strings())
.via((String line) -> Collections.singletonList(line.split("[^\\p{L}]+")).toString()))
Question:
Does Beam translate the above instruction wherein I'm using a toString() to runner based implementations of toString? I want to avoid defining inadvertently a UDF that might cause subpar performance (I come from background in Spark, Pig) . I'm little hazy on how the translation happens between beam API and Runner instructions; appreciate any resources that throw light on the translation.

No, Beam runners should not update that function. MapElements is implemented using a Beam ParDo that will execute your function on incoming data as is. Beam runners may fuse multiple steps to create fused steps. Also the performance might depend on the Jvm used by the Beam runner.

Related

How does PCollection gets created by a runner

Similar to the code below get called internally from Read or GroupBy transform during expand. In terms of Beam code this will result in construction of an instance of PCollection. It is not apparent and clear what is actually being constructed by looking at the code as it is limited to just new operation. In terms of runner what does it mean by calling new PCollection(...)?
PCollection.createPrimitiveOutputInternal(
input.getPipeline(),
WindowingStrategy.globalDefault(),
IsBounded.BOUNDED,
ByteArrayCoder.of())
From the Apache Beam programming guide:
A PCollection represents a distributed data set that your Beam pipeline operates on. The data set can be bounded, meaning it comes from a fixed source like a file, or unbounded, meaning it comes from a continuously updating source via a subscription or other mechanism.
PCollection implements PValue, from its document:
Dataflow users should not construct PValue objects directly in their
pipelines.
Think it in this way: when using the SDK building a pipeline, you are constructing a directed acyclic graph of nodes of PTransforms and edges of PCollections. In the DAG, a PCollection instance is abstract and represents an input/output of PTransform[s]. When the DAG is executed on a runner, the data of each PCollection can reside on multiple machines/VMs/workers. You cannot view the data until you materialize them through some IO transforms.
If internally in the SDK, you see new PCollection(...), it builds the edge/node with necessary information that could later make sense to the runner when executing the DAG. A PCollection itself is not a data structure that holds data in memory.

What is the difference between Flink Core and Flink CEP in respect of their capabilities?

While studying Flink CEP library over the last few days, I've been under the impression that It doesn't add any new fundamental functionality to Flink's standard capabilities. It seems like Flink CEP's only purpose is to make event processing easier, with clear semantics and intuitive code structure. As an example, Flink CEP presents only 5 semantics of event match skipping. Although these semantics may be enough for a great range of cases, it may not solve specific problems, which makes us return to plain Flink.
A test case is the following pattern :
Emmit a alert(represented by 'a') for each non-overlapping pair of numbers in a stream
Represented by the pattern:
Pattern.begin[EventType]("pair",skipStrategy).where(new AlwaysTrueFunction()).times(2)
So, for a input like (numbers entering from left to right on the stream) 1 1 1 1 1, the expected output would be a a, but none of the 5 match skipping strategies would give the right result:
No-skip: a a a a
Skip-to-next: a a a a
Skip-past-last-event: a a a a
Skip-to-first[1]: a a a a
Skip-to-last[1]: a a a a
Although these strategies can't generate the desired pattern, It could be easily made using a RichFunction with a ValueState counter to determine when a new alert should be emmited, transforming the input stream in a stream of events.
Thus, I would appreciate some light over these questions:
Why was CEP library created if Flink seems to be more complete?
A pattern made with CEP is more efficient(greater throughput/other metric) than one made with Flink standard DataStream operators?(if possible, with some links provided for articles/papers/documentation about this)
and thanks for playing with Flink CEP.
Flink CEP is a library on top of Flink. As such, it does not add any functionality that cannot be implemented using vanilla Flink (ProcessFunctions, etc). In fact, under the hood it is implemented as a special operator who is checking elements that match a specific pattern and much of its functionality could probably be even implemented as a ProcessFunction (with a lot of tooling around).
That said, Flink CEP may not add functionality that cannot be implemented with vanilla Flink, BUT it adds expressivity which makes some usecases easier to implement. The same holds for other APIs as well, for example the Windowing API in Flink, which you can implement using ProcessFunctions (with a lot of tooling around).
Now when it comes to efficiency, the answer is that "it depends". Handcrafting a special-process function tailored to your usecase and with all optimizations possible for your workload can be more efficient than FlinkCEP, as the latter is a general purpose library. If you have the expertise and the time, then the optimal solution would always be to implement PoCs using both (CEP and vanilla Flink) and choose the most efficient for your case.

Pyspark throwing error: py4j.Py4JException: Method __getstate__([]) does not exist [duplicate]

Background
My original question here was Why using DecisionTreeModel.predict inside map function raises an exception? and is related to How to generate tuples of (original lable, predicted label) on Spark with MLlib?
When we use Scala API a recommended way of getting predictions for RDD[LabeledPoint] using DecisionTreeModel is to simply map over RDD:
val labelAndPreds = testData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
Unfortunately similar approach in PySpark doesn't work so well:
labelsAndPredictions = testData.map(
lambda lp: (lp.label, model.predict(lp.features))
labelsAndPredictions.first()
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
Instead of that official documentation recommends something like this:
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
So what is going on here? There is no broadcast variable here and Scala API defines predict as follows:
/**
* Predict values for a single data point using the model trained.
*
* #param features array representing a single data point
* #return Double prediction from the trained model
*/
def predict(features: Vector): Double = {
topNode.predict(features)
}
/**
* Predict values for the given data set using the model trained.
*
* #param features RDD representing data points to be predicted
* #return RDD of predictions for each of the given data points
*/
def predict(features: RDD[Vector]): RDD[Double] = {
features.map(x => predict(x))
}
so at least at the first glance calling from action or transformation is not a problem since prediction seems to be a local operation.
Explanation
After some digging I figured out that the source of the problem is a JavaModelWrapper.call method invoked from DecisionTreeModel.predict. It access SparkContext which is required to call Java function:
callJavaFunc(self._sc, getattr(self._java_model, name), *a)
Question
In case of DecisionTreeModel.predict there is a recommended workaround and all the required code is already a part of the Scala API but is there any elegant way to handle problem like this in general?
Only solutions I can think of right now are rather heavyweight:
pushing everything down to JVM either by extending Spark classes through Implicit Conversions or adding some kind of wrappers
using Py4j gateway directly
Communication using default Py4J gateway is simply not possible. To understand why we have to take a look at the following diagram from the PySpark Internals document [1]:
Since Py4J gateway runs on the driver it is not accessible to Python interpreters which communicate with JVM workers through sockets (See for example PythonRDD / rdd.py).
Theoretically it could be possible to create a separate Py4J gateway for each worker but in practice it is unlikely to be useful. Ignoring issues like reliability Py4J is simply not designed to perform data intensive tasks.
Are there any workarounds?
Using Spark SQL Data Sources API to wrap JVM code.
Pros: Supported, high level, doesn't require access to the internal PySpark API
Cons: Relatively verbose and not very well documented, limited mostly to the input data
Operating on DataFrames using Scala UDFs.
Pros: Easy to implement (see Spark: How to map Python with Scala or Java User Defined Functions?), no data conversion between Python and Scala if data is already stored in a DataFrame, minimal access to Py4J
Cons: Requires access to Py4J gateway and internal methods, limited to Spark SQL, hard to debug, not supported
Creating high level Scala interface in a similar way how it is done in MLlib.
Pros: Flexible, ability to execute arbitrary complex code. It can be don either directly on RDD (see for example MLlib model wrappers) or with DataFrames (see How to use a Scala class inside Pyspark). The latter solution seems to be much more friendly since all ser-de details are already handled by existing API.
Cons: Low level, required data conversion, same as UDFs requires access to Py4J and internal API, not supported
Some basic examples can be found in Transforming PySpark RDD with Scala
Using external workflow management tool to switch between Python and Scala / Java jobs and passing data to a DFS.
Pros: Easy to implement, minimal changes to the code itself
Cons: Cost of reading / writing data (Alluxio?)
Using shared SQLContext (see for example Apache Zeppelin or Livy) to pass data between guest languages using registered temporary tables.
Pros: Well suited for interactive analysis
Cons: Not so much for batch jobs (Zeppelin) or may require additional orchestration (Livy)
Joshua Rosen. (2014, August 04) PySpark Internals. Retrieved from https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals

Execute Scala code from Pyspark [duplicate]

Background
My original question here was Why using DecisionTreeModel.predict inside map function raises an exception? and is related to How to generate tuples of (original lable, predicted label) on Spark with MLlib?
When we use Scala API a recommended way of getting predictions for RDD[LabeledPoint] using DecisionTreeModel is to simply map over RDD:
val labelAndPreds = testData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
Unfortunately similar approach in PySpark doesn't work so well:
labelsAndPredictions = testData.map(
lambda lp: (lp.label, model.predict(lp.features))
labelsAndPredictions.first()
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
Instead of that official documentation recommends something like this:
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
So what is going on here? There is no broadcast variable here and Scala API defines predict as follows:
/**
* Predict values for a single data point using the model trained.
*
* #param features array representing a single data point
* #return Double prediction from the trained model
*/
def predict(features: Vector): Double = {
topNode.predict(features)
}
/**
* Predict values for the given data set using the model trained.
*
* #param features RDD representing data points to be predicted
* #return RDD of predictions for each of the given data points
*/
def predict(features: RDD[Vector]): RDD[Double] = {
features.map(x => predict(x))
}
so at least at the first glance calling from action or transformation is not a problem since prediction seems to be a local operation.
Explanation
After some digging I figured out that the source of the problem is a JavaModelWrapper.call method invoked from DecisionTreeModel.predict. It access SparkContext which is required to call Java function:
callJavaFunc(self._sc, getattr(self._java_model, name), *a)
Question
In case of DecisionTreeModel.predict there is a recommended workaround and all the required code is already a part of the Scala API but is there any elegant way to handle problem like this in general?
Only solutions I can think of right now are rather heavyweight:
pushing everything down to JVM either by extending Spark classes through Implicit Conversions or adding some kind of wrappers
using Py4j gateway directly
Communication using default Py4J gateway is simply not possible. To understand why we have to take a look at the following diagram from the PySpark Internals document [1]:
Since Py4J gateway runs on the driver it is not accessible to Python interpreters which communicate with JVM workers through sockets (See for example PythonRDD / rdd.py).
Theoretically it could be possible to create a separate Py4J gateway for each worker but in practice it is unlikely to be useful. Ignoring issues like reliability Py4J is simply not designed to perform data intensive tasks.
Are there any workarounds?
Using Spark SQL Data Sources API to wrap JVM code.
Pros: Supported, high level, doesn't require access to the internal PySpark API
Cons: Relatively verbose and not very well documented, limited mostly to the input data
Operating on DataFrames using Scala UDFs.
Pros: Easy to implement (see Spark: How to map Python with Scala or Java User Defined Functions?), no data conversion between Python and Scala if data is already stored in a DataFrame, minimal access to Py4J
Cons: Requires access to Py4J gateway and internal methods, limited to Spark SQL, hard to debug, not supported
Creating high level Scala interface in a similar way how it is done in MLlib.
Pros: Flexible, ability to execute arbitrary complex code. It can be don either directly on RDD (see for example MLlib model wrappers) or with DataFrames (see How to use a Scala class inside Pyspark). The latter solution seems to be much more friendly since all ser-de details are already handled by existing API.
Cons: Low level, required data conversion, same as UDFs requires access to Py4J and internal API, not supported
Some basic examples can be found in Transforming PySpark RDD with Scala
Using external workflow management tool to switch between Python and Scala / Java jobs and passing data to a DFS.
Pros: Easy to implement, minimal changes to the code itself
Cons: Cost of reading / writing data (Alluxio?)
Using shared SQLContext (see for example Apache Zeppelin or Livy) to pass data between guest languages using registered temporary tables.
Pros: Well suited for interactive analysis
Cons: Not so much for batch jobs (Zeppelin) or may require additional orchestration (Livy)
Joshua Rosen. (2014, August 04) PySpark Internals. Retrieved from https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals

Scio: groupByKey doesn't work when using Pub/Sub as collection source

I changed source of WindowsWordCount example program from text file to cloud Pub/Sub as shown below. I published shakespeare file's data to Pub/Sub which did get fetched properly but none of the transformations after .groupByKey seem to work.
sc.pubsubSubscription[String](psSubscription)
.withFixedWindows(windowSize) // apply windowing logic
.flatMap(_.split("[^a-zA-Z']+").filter(_.nonEmpty))
.countByValue
.withWindow[IntervalWindow]
.swap
.groupByKey
.map {
s =>
println("\n\n\n\n\n\n\n This never prints \n\n\n\n\n")
println(s)
}
Changing the input from a text file to PubSub the PCollection "unbounded". Grouping that by key requires to define aggregation triggers, otherwise the grouper will wait forever. It's mentioned in the dataflow documentation here:
https://cloud.google.com/dataflow/model/group-by-key
Note: Either non-global Windowing or an aggregation trigger is required in order to perform a GroupByKey on an unbounded PCollection. This is because a bounded GroupByKey must wait for all the data with a certain key to be collected; but with an unbounded collection, the data is unlimited. Windowing and/or Triggers allow grouping to operate on logical, finite bundles of data within the unbounded data stream.
If you apply GroupByKey to an unbounded PCollection without setting either a non-global windowing strategy, a trigger strategy, or both, Dataflow will generate an IllegalStateException error when your pipeline is constructed.
Unfortunately, in the Python SDK of Apache Beam seems not to support triggers (yet), so I'm not sure what the solution would be in python.
(see https://beam.apache.org/documentation/programming-guide/#triggers)
With regards to Franz's comment above (I would reply to his comment specifically if StackOverflow would let me!,) I see that the docs say that triggering is not implemented... but they also say that Realtime Database functions are not available, while our current project is actively using them. They're just new.
See trigger functions here: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/trigger.py
Beware, the API is unfinished as this is not "release-ready" code. But it is available.