Is there any specific reason for collectAsList method of Spark DataFrame API to return a java.util.List instead of a Scala List?
I believe its mostly a convenience function for Java users and also makes things much simpler for the Python API. Looking at the git logs (and also the since annotation) it was introduced in the initial merge of the DataFrame API so it wasn't necessarily added in response to a particular need. Sometimes some of the APIs return Java types since they are easier to interface with in Python (through py4j) - but that doesn't appear to be the case here (the python API collects by turning the DF into an RDD and collecting on the RDD).
Related
I am trying to write streaming JSON messages directly to Parquet using Scala (no Spark). I see only couple of post online and this post, however I see the ParquetWriter API is deprecated and the solution doesn't actually provides an example to follow. I read some other posts too but didn't find any descriptive explanation.
I know I have to use ParquetFileWriter API but lack of documentation is making difficult for me to use it. Can someone please provide and example of it along with all the constructor parameter and how to create those parameter, especially schema?
You may want to try using Eel, a a toolkit to manipulate data in the Hadoop ecosystem.
I recommend reading the README to gain a better understanding of the library, but to give you a sense of how the library works, what your are trying to do would look somewhat like the following:
val source = JsonSource(() => new FileInputStream("input.json"))
val sink = ParquetSink(new Path("output.parquet"))
source.toDataStream().to(sink)
Background
My original question here was Why using DecisionTreeModel.predict inside map function raises an exception? and is related to How to generate tuples of (original lable, predicted label) on Spark with MLlib?
When we use Scala API a recommended way of getting predictions for RDD[LabeledPoint] using DecisionTreeModel is to simply map over RDD:
val labelAndPreds = testData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
Unfortunately similar approach in PySpark doesn't work so well:
labelsAndPredictions = testData.map(
lambda lp: (lp.label, model.predict(lp.features))
labelsAndPredictions.first()
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
Instead of that official documentation recommends something like this:
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
So what is going on here? There is no broadcast variable here and Scala API defines predict as follows:
/**
* Predict values for a single data point using the model trained.
*
* #param features array representing a single data point
* #return Double prediction from the trained model
*/
def predict(features: Vector): Double = {
topNode.predict(features)
}
/**
* Predict values for the given data set using the model trained.
*
* #param features RDD representing data points to be predicted
* #return RDD of predictions for each of the given data points
*/
def predict(features: RDD[Vector]): RDD[Double] = {
features.map(x => predict(x))
}
so at least at the first glance calling from action or transformation is not a problem since prediction seems to be a local operation.
Explanation
After some digging I figured out that the source of the problem is a JavaModelWrapper.call method invoked from DecisionTreeModel.predict. It access SparkContext which is required to call Java function:
callJavaFunc(self._sc, getattr(self._java_model, name), *a)
Question
In case of DecisionTreeModel.predict there is a recommended workaround and all the required code is already a part of the Scala API but is there any elegant way to handle problem like this in general?
Only solutions I can think of right now are rather heavyweight:
pushing everything down to JVM either by extending Spark classes through Implicit Conversions or adding some kind of wrappers
using Py4j gateway directly
Communication using default Py4J gateway is simply not possible. To understand why we have to take a look at the following diagram from the PySpark Internals document [1]:
Since Py4J gateway runs on the driver it is not accessible to Python interpreters which communicate with JVM workers through sockets (See for example PythonRDD / rdd.py).
Theoretically it could be possible to create a separate Py4J gateway for each worker but in practice it is unlikely to be useful. Ignoring issues like reliability Py4J is simply not designed to perform data intensive tasks.
Are there any workarounds?
Using Spark SQL Data Sources API to wrap JVM code.
Pros: Supported, high level, doesn't require access to the internal PySpark API
Cons: Relatively verbose and not very well documented, limited mostly to the input data
Operating on DataFrames using Scala UDFs.
Pros: Easy to implement (see Spark: How to map Python with Scala or Java User Defined Functions?), no data conversion between Python and Scala if data is already stored in a DataFrame, minimal access to Py4J
Cons: Requires access to Py4J gateway and internal methods, limited to Spark SQL, hard to debug, not supported
Creating high level Scala interface in a similar way how it is done in MLlib.
Pros: Flexible, ability to execute arbitrary complex code. It can be don either directly on RDD (see for example MLlib model wrappers) or with DataFrames (see How to use a Scala class inside Pyspark). The latter solution seems to be much more friendly since all ser-de details are already handled by existing API.
Cons: Low level, required data conversion, same as UDFs requires access to Py4J and internal API, not supported
Some basic examples can be found in Transforming PySpark RDD with Scala
Using external workflow management tool to switch between Python and Scala / Java jobs and passing data to a DFS.
Pros: Easy to implement, minimal changes to the code itself
Cons: Cost of reading / writing data (Alluxio?)
Using shared SQLContext (see for example Apache Zeppelin or Livy) to pass data between guest languages using registered temporary tables.
Pros: Well suited for interactive analysis
Cons: Not so much for batch jobs (Zeppelin) or may require additional orchestration (Livy)
Joshua Rosen. (2014, August 04) PySpark Internals. Retrieved from https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals
According to this
Spark Catalyst is An implementation-agnostic framework for manipulating trees of relational operators and expressions.
I want to use Spark Catalyst to parse SQL DMLs and DDLs to write and generate custom Scala code for. However, it is not clear for me by reading the code if there is any wrapper class around Catalyst that I can use? The ideal wrapper would receive a sql statement and produces the equivalent Scala code. For my use case would look like this
def generate("select substring(s, 1, 3) as from t1") =
{ // custom code
return custom_scala_code_which is executable given s as List[String]
}
This is a simple example, but the idea is that I don't want to write another parser and I need to parse many SQL functionality from a legacy system that I have to write a custom Scala implementation for them.
In a more general question, with a lack of class level design documentation, how can someone learn the code base and make contributions?
Spark takes SQL queries using spark.sql. For example: you can just feed the string SELECT * FROM table as an argument to such as spark.sql("SELECT * FROM table") after having defined your dataframe as "table". To define your dataframe as "table" for use in SQL queries create a temporary view using
DataFrame.createOrReplaceTempView("table")
You can see examples here:
https://spark.apache.org/docs/2.1.0/sql-programming-guide.html#running-sql-queries-programmatically
Dataframe automatically changes into RDD and optimise the code, and this optimization is done through Catalyst. When a programmer writes a code in Dataframe , internally code will be optimized. For more detail visit
Catalyst optimisation in Spark
Background
My original question here was Why using DecisionTreeModel.predict inside map function raises an exception? and is related to How to generate tuples of (original lable, predicted label) on Spark with MLlib?
When we use Scala API a recommended way of getting predictions for RDD[LabeledPoint] using DecisionTreeModel is to simply map over RDD:
val labelAndPreds = testData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
Unfortunately similar approach in PySpark doesn't work so well:
labelsAndPredictions = testData.map(
lambda lp: (lp.label, model.predict(lp.features))
labelsAndPredictions.first()
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
Instead of that official documentation recommends something like this:
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
So what is going on here? There is no broadcast variable here and Scala API defines predict as follows:
/**
* Predict values for a single data point using the model trained.
*
* #param features array representing a single data point
* #return Double prediction from the trained model
*/
def predict(features: Vector): Double = {
topNode.predict(features)
}
/**
* Predict values for the given data set using the model trained.
*
* #param features RDD representing data points to be predicted
* #return RDD of predictions for each of the given data points
*/
def predict(features: RDD[Vector]): RDD[Double] = {
features.map(x => predict(x))
}
so at least at the first glance calling from action or transformation is not a problem since prediction seems to be a local operation.
Explanation
After some digging I figured out that the source of the problem is a JavaModelWrapper.call method invoked from DecisionTreeModel.predict. It access SparkContext which is required to call Java function:
callJavaFunc(self._sc, getattr(self._java_model, name), *a)
Question
In case of DecisionTreeModel.predict there is a recommended workaround and all the required code is already a part of the Scala API but is there any elegant way to handle problem like this in general?
Only solutions I can think of right now are rather heavyweight:
pushing everything down to JVM either by extending Spark classes through Implicit Conversions or adding some kind of wrappers
using Py4j gateway directly
Communication using default Py4J gateway is simply not possible. To understand why we have to take a look at the following diagram from the PySpark Internals document [1]:
Since Py4J gateway runs on the driver it is not accessible to Python interpreters which communicate with JVM workers through sockets (See for example PythonRDD / rdd.py).
Theoretically it could be possible to create a separate Py4J gateway for each worker but in practice it is unlikely to be useful. Ignoring issues like reliability Py4J is simply not designed to perform data intensive tasks.
Are there any workarounds?
Using Spark SQL Data Sources API to wrap JVM code.
Pros: Supported, high level, doesn't require access to the internal PySpark API
Cons: Relatively verbose and not very well documented, limited mostly to the input data
Operating on DataFrames using Scala UDFs.
Pros: Easy to implement (see Spark: How to map Python with Scala or Java User Defined Functions?), no data conversion between Python and Scala if data is already stored in a DataFrame, minimal access to Py4J
Cons: Requires access to Py4J gateway and internal methods, limited to Spark SQL, hard to debug, not supported
Creating high level Scala interface in a similar way how it is done in MLlib.
Pros: Flexible, ability to execute arbitrary complex code. It can be don either directly on RDD (see for example MLlib model wrappers) or with DataFrames (see How to use a Scala class inside Pyspark). The latter solution seems to be much more friendly since all ser-de details are already handled by existing API.
Cons: Low level, required data conversion, same as UDFs requires access to Py4J and internal API, not supported
Some basic examples can be found in Transforming PySpark RDD with Scala
Using external workflow management tool to switch between Python and Scala / Java jobs and passing data to a DFS.
Pros: Easy to implement, minimal changes to the code itself
Cons: Cost of reading / writing data (Alluxio?)
Using shared SQLContext (see for example Apache Zeppelin or Livy) to pass data between guest languages using registered temporary tables.
Pros: Well suited for interactive analysis
Cons: Not so much for batch jobs (Zeppelin) or may require additional orchestration (Livy)
Joshua Rosen. (2014, August 04) PySpark Internals. Retrieved from https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals
In Spark's documentation, it says that RDDs method reduce requires a associative AND commutative binary function.
However, the method reduceByKey ONLY requires an associative binary function.
sc.textFile("file4kB", 4)
I did some tests, and apparently it's the behavior I get. Why this difference? Why does reduceByKey ensure the binary function is always applied in certain order (to accommodate for the lack of commutativity) when reduce does not?
Example, if a load some (small) text with 4 partitions (minimum):
val r = sc.textFile("file4k", 4)
then:
r.reduce(_ + _)
returns a string where parts are not always in the same order, whereas:
r.map(x => (1,x)).reduceByKey(_ + _).first
always returns the same string (where everything is in the same order than in the original file).
(I checked with r.glom and the file content is indeed spread over 4 partitions, there is no empty partition).
As far as I am concerned this is an error in the documentation and results you see are simply incidental. Practice, other resources and a simple analysis of the code show that function passed to reduceByKey should be not only associative but commutative as well.
practice - while it looks like the order is preserved in a local mode it is no longer true when you run Spark on a cluster, including standalone mode.
other resources - to quote Data Exploration Using Spark from AmpCamp 3:
There is a convenient method called reduceByKey in Spark for exactly this pattern. Note that the second argument to reduceByKey determines the number of reducers to use. By default, Spark assumes that the reduce function is commutative and associative and applies combiners on the mapper side.
code - reduceByKey is implemented using combineByKeyWithClassTag and creates ShuffledRDD. Since Spark doesn't guarantee the order after shuffling the only way to restore it would be to attach some metadata to the partially reduced records. As far as I can tell nothing like this takes place.
On a side note reduce as it is implemented in PySpark will work just fine with a function which is only commutative. It is of course just a detail of an implementation and not a part of the contract.
According to the code documentation, recently updated/corrected. (thanks #zero323) :
reduceByKey merges the values for each key using an associative and commutative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce.
So it was in fact actually a documentation error like #zero323 pointed out in his answer.
You can check the following links to the code to make sure :
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L304
https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L1560