Unpickle/convert pyspark RDD of Rows to Scala RDD[Row] - scala

What I'm trying to achieve is to execute Scala code. Convert result Scala RDD[Row] to PySparkRDD of Rows. Perform some python operations and convert RDD of pySpark Rows back to Scala's RDD[Row].
To get RDD to pySpark RDD I'm doing this:
In Scala I have this method
import org.apache.spark.sql.execution.python.EvaluatePython.{javaToPython, toJava}
def toPythonRDD(rdd: RDD[Row]): JavaRDD[Array[Byte]] = {
javaToPython(rdd.map(r => toJava(r, r.schema)))
}
Later in pySpark I create new RDD calling
RDD(jrdd, sc, BatchedSerializer(PickleSerializer()))
I end up with RDD of pySpark Rows. I'd like to revert that process.
I can easily get Scala's JavaRDD[Array[Byte]] by accessing rdd._jrdd. My main problem is that I don't know hwo to convert/unplickle it back to RDD[Row].
I've tried
sc._jvm.SerDe.pythonToJava(rdd._to_java_object_rdd(), True)
and
sc._jvm.SerDe.pythonToJava(rdd._jrdd, True)
both crash with similar exception
net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)
net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)
I know that I can easily pass DF back and forth between Scala and Python, but my records don't have uniform schema. I'm using RDD of Row's, because I though there will already be a pickler I'd be able to reuse and it works, but so far in only one direction.

Related

Spark's toDS vs to DF

I understand that one can convert an RDD to a Dataset using rdd.toDS. However there also exists rdd.toDF. Is there really any benefit of one over the other?
After playing with the Dataset API for a day, I find out that almost any operation takes me out to a DataFrame (for instance withColumn). After converting an RDD with toDS, I often find out that another conversion to a DataSet is needed, because something brought me to a DataFrame again.
Am I using the API wrongly? Should I stick with .toDF and only convert to a DataSet in the end of a chain of operations? Or is there a benefit to using toDS earlier?
Here is a small concrete example
spark
.read
.schema (...)
.json (...)
.rdd
.zipWithUniqueId
.map[(Integer,String,Double)] { case (row,id) => ... }
.toDS // now with a Dataset API (should use toDF here?)
.withColumnRenamed ("_1", "id" ) // now back to a DataFrame, not type safe :(
.withColumnRenamed ("_2", "text")
.withColumnRenamed ("_2", "overall")
.as[ParsedReview] // back to a Dataset
Michael Armburst nicely explained that shift to dataset and dataframe and the difference between the two. Basically in spark 2.x they converged dataset and dataframe API into one with slight difference.
"DataFrame is just DataSet of generic row objects. When you don't know all the fields, DF is the answer".

Spark Scala - Apply ML/Complex functions on a GroupBy DataFrame

I have a large DataFrame (Spark 1.6 Scala) which looks like this:
Type,Value1,Value2,Value3,...
--------------------------
A,11.4,2,3
A,82.0,1,2
A,53.8,3,4
B,31.0,4,5
B,22.6,5,6
B,43.1,6,7
B,11.0,7,8
C,22.1,8,9
C,3.2,9,1
C,13.1,2,3
From this I want to group by Type and apply machine learning algorithms and/or perform complex functions on each group.
My objective is perform complex functions on each group in parallel.
I have tried the following approaches:
Approach 1) Convert Dataframe to Dataset and then use ds.mapGroups() api. But this is giving me an Iterator of each group values.
If i want to perform RandomForestClassificationModel.transform(dataset: DataFrame), i need a DataFrame with only a particular group values.
I was not sure converting Iterator to a Dataframe within mapGroups is a good idea.
Approach 2) Distinct on Type, then map on them and then filter for each Type with in the map loop:
val types = df.select("Type").distinct()
val ff = types.map(row => {
val type = row.getString(0)
val thisGroupDF = df.filter(col("Type") == type)
// Apply complex functions on thisGroupDF
(type, predictedValue)
})
For some reason, the above is never completing (seems to be getting into some kind of infinite loop)
Approach 3) Exploring Window functions, but did not find a method which can provide dataframe of particular group values.
Please help.

Calling other methods/variables inside a UDF method in Spark SQL DataFrame

I have a Spark SQL DF, in which i am trying to call one UDF [ which i created using Spark SQL udf.
val udfName = udf(somemethodName)
val newDF = df.withColumn("columnnew", udfName(col("anotherDFColumn"))
I'm trying to use another DF stored as a val inside the somemethodName, but the DF is coming as null.
This is happening only when i use where clause in the newDF.
Am i missing something?Is it not possible to use another variable / method inside UDF method?
Or do i have to do something with broadcast? Currently i am running this on local, not in the cluster though.
Is it not possible to use another variable / method inside UDF method
It is possible if and only if that variable / method can be serialized - a UDF is a closure that must be serialized and distributed to executors.
A Dataframe cannot be serialized (it's a pointer to other distributed data, so there's no logical way to serialize it without collecting it into Driver memory), therefore appears as null when you try to use the UDF.
You're probably going to need to join the two dataframes on some key, and then use a UDF (or a standard transformation) that takes columns from the joined Dataframe.

Applying function to Spark Dataframe Column

Coming from R, I am used to easily doing operations on columns. Is there any easy way to take this function that I've written in scala
def round_tenths_place( un_rounded:Double ) : Double = {
val rounded = BigDecimal(un_rounded).setScale(1, BigDecimal.RoundingMode.HALF_UP).toDouble
return rounded
}
And apply it to a one column of a dataframe - kind of what I hoped this would do:
bid_results.withColumn("bid_price_bucket", round_tenths_place(bid_results("bid_price")) )
I haven't found any easy way and am struggling to figure out how to do this. There's got to be an easier way than converting the dataframe to and RDD and then selecting from rdd of rows to get the right field and mapping the function across all of the values, yeah? And also something more succinct creating a SQL table and then doing this with a sparkSQL UDF?
You can define an UDF as follows:
val round_tenths_place_udf = udf(round_tenths_place _)
bid_results.withColumn(
"bid_price_bucket", round_tenths_place_udf($"bid_price"))
although built-in Round expression is using exactly the same logic as your function and should be more than enough, not to mention much more efficient:
import org.apache.spark.sql.functions.round
bid_results.withColumn("bid_price_bucket", round($"bid_price", 1))
See also following:
Updating a dataframe column in spark
How to apply a function to a column of a Spark DataFrame?

Spark/Scala flatten and flatMap is not working on DataFrame

I have a DataFrame containing three DataFrames of the same type (same parquet schema). They only differ in the content/values they are containing:
I want to flatten the structure, so that the three DataFrames are getting merged into one single Parquet DataFrame containing all of the content/values.
I tried it with flatten and flatMap, but with that I always get the error:
Error: No implicit view available from org.apache.spark.sql.DataFrame => Traversable[U].parquetsFiles.flatten
Error: not enough arguments for method flatten: (implicit as Trav: org.apache.spark.sql.DataFrame => Traversable[U], implicit m: scala.reflect.ClassTag[U]. Unspecified value parameters asTrav, m. parquetFiles.flatten
I also converted it to a List and then tried to flatten and this is also producing the same error.
Do you have any idea how to solve it or what is the problem here?
Thanks, Alex
The scala compiler is looking for a way to convert the DataFrames to a Traversable so it can apply the flatten. But a DataFrame is not Traversable, so it will fail. Also, no ClassTag available because DataFrames are not statically typed.
The code you're looking for is
parquetFiles.reduce(_ unionAll _)
which can be optimized by the DataFrame execution engine.
So it seems like you want to join these three DataFrames together, to do that the unionAll function would work really well. You could do parquetFiles.reduce((x, y) => x.unionAll(y)) (note this will explode on an empty list but if you might have that just look at one of the folds instead of reduce).