I have a Spark SQL DF, in which i am trying to call one UDF [ which i created using Spark SQL udf.
val udfName = udf(somemethodName)
val newDF = df.withColumn("columnnew", udfName(col("anotherDFColumn"))
I'm trying to use another DF stored as a val inside the somemethodName, but the DF is coming as null.
This is happening only when i use where clause in the newDF.
Am i missing something?Is it not possible to use another variable / method inside UDF method?
Or do i have to do something with broadcast? Currently i am running this on local, not in the cluster though.
Is it not possible to use another variable / method inside UDF method
It is possible if and only if that variable / method can be serialized - a UDF is a closure that must be serialized and distributed to executors.
A Dataframe cannot be serialized (it's a pointer to other distributed data, so there's no logical way to serialize it without collecting it into Driver memory), therefore appears as null when you try to use the UDF.
You're probably going to need to join the two dataframes on some key, and then use a UDF (or a standard transformation) that takes columns from the joined Dataframe.
Related
Context
I am trying to use Spark/Scala in order to "edit" multiple parquet files (potentially 50k+) efficiently. The only edit that needs to be done is deletion (i.e. deleting records/rows) based on a given set of row IDs.
The parquet files are stored in s3 as a partitioned DataFrame where an example partition looks like this:
s3://mybucket/transformed/year=2021/month=11/day=02/*.snappy.parquet
Each partition can have upwards of 100 parquet files that each are between 50mb and 500mb in size.
Inputs
We are given a spark Dataset[MyClass] called filesToModify which has 2 columns:
s3path: String = the complete s3 path to a parquet file in s3 that needs to be edited
ids: Set[String] = a set of IDs (rows) that need to be deleted in the parquet file located at s3path
Example input dataset filesToModify:
s3path
ids
s3://mybucket/transformed/year=2021/month=11/day=02/part-1.snappy.parquet
Set("a", "b")
s3://mybucket/transformed/year=2021/month=11/day=02/part-2.snappy.parquet
Set("b")
Expected Behaviour
Given filesToModify I want to take advantage of parallelism in Spark do the following for each row:
Load the parquet file located at row.s3path
Filter so that we exclude any row whose id is in the set row.ids
Count the number of deleted/excluded rows per id in row.ids (optional)
Save the filtered data back to the same row.s3path to overwrite the file
Return the number of deleted rows (optional)
What I have tried
I have tried using filesToModify.map(row => deleteIDs(row.s3path, row.ids)) where deleteIDs is looks like this:
def deleteIDs(s3path: String, ids: Set[String]): Int = {
import spark.implicits._
val data = spark
.read
.parquet(s3path)
.as[DataModel]
val clean = data
.filter(not(col("id").isInCollection(ids)))
// write to a temp directory and then upload to s3 with same
// prefix as original file to overwrite it
writeToSingleFile(clean, s3path)
1 // dummy output for simplicity (otherwise it should correspond to the number of deleted rows)
}
However this leads to NullPointerException when executed within the map operation. If I execute it alone outside of the map block then it works but I can't understand why it doesn't inside it (something to do with lazy evaluation?).
You get a NullPointerException because you try to retrieve your spark session from an executor.
It is not explicit, but to perform spark action, your DeleteIDs function needs to retrieve active spark session. To do so, it calls method getActiveSession from SparkSession object. But when called from an executor, this getActiveSession method returns None as stated in SparkSession's source code:
Returns the default SparkSession that is returned by the builder.
Note: Return None, when calling this function on executors
And thus NullPointerException is thrown when your code starts using this None spark session.
More generally, you can't recreate a dataset and use spark transformations/actions in transformations of another dataset.
So I see two solutions for your problem:
either to rewrite DeleteIDs function's code without using spark, and modify your parquet files by using parquet4s for instance.
or transform filesToModify to a Scala collection and use Scala's map instead of Spark's one.
s3path and ids parameters that are passed to deleteIDs are not actually strings and sets respectively. They are instead columns.
In order to operate over these values you can instead create a UDF that accepts columns instead of intrinsic types, or you can collect your dataset if it is small enough so that you can use the values in the deleteIDs function directly. The former is likely your best bet if you seek to take advantage of Spark's parallelism.
You can read about UDFs here
From PySpark, I am trying to define a custom aggregator that is accumulating state . Is it possible in Spark 2.3 ?
AFAIK, it is now possible to define a custom UDAF in PySpark since Spark 2.3 (cf How to define and use a User-Defined Aggregate Function in Spark SQL?), by calling pandas_udf with the PandasUDFType.GROUPED_AGG keyword. However given that it is just taking a function as a parameter I don't think it is possible to carry state around during the aggregation.
From Scala, I see it is possible to have stateful aggregation by either extending UserDefinedAggregateFunction or org.apache.spark.sql.expressions.Aggregator , but is there a similar thing I can do on python-side only?
You could use an accumulator.
You could leverage spark streaming built-in state management.
simple accumulator example for use in SQL
from pyspark.sql.types import IntegerType
# have some data
df = spark.range(10).toDF("num")
# have a table
df.createOrReplaceTempView("num_table")
# have an accumulator
accSum = sc.accumulator(0)
# have a function that accumulates
def add_acc(int_val):
accSum.add(int_val)
return int_val
# register function as udf
spark.udf.register("reg_addacc", add_acc, IntegerType())
# use in sql
spark.sql("SELECT sum(reg_addacc(num)) FROM num_table").show()
# get value from accumulator
print(accSum.value)
45
Iam trying to do some transformations on the dataset with spark using scala currently using spark sql but want to shift the code to native scala code. i want to know whether to use filter or map, doing some operations like matching the values in column and get a single column after the transformation into a different dataset.
SELECT * FROM TABLE WHERE COLUMN = ''
Used to write something like this earlier in spark sql can someone tell me an alternative way to write the same using map or filter on the dataset, and even which one is much faster when compared.
You can read documentation from Apache Spark website. This is the link to API documentation at https://spark.apache.org/docs/2.3.1/api/scala/index.html#package.
Here is a little example -
val df = sc.parallelize(Seq((1,"ABC"), (2,"DEF"), (3,"GHI"))).toDF("col1","col2")
val df1 = df.filter("col1 > 1")
df1.show()
val df2 = df1.map(x => x.getInt(0) + 3)
df2.show()
If I understand you question correctly, you need to rewrite your SQL query to DataFrame API. Your query reads all columns from table TABLE and filter rows where COLUMN is empty. You can do this with DF in the following way:
spark.read.table("TABLE")
.where($"COLUMN".eqNullSafe(""))
.show(10)
Performance will be the same as in your SQL. Use dataFrame.explain(true) method to understand what Spark will do.
What I'm trying to achieve is to execute Scala code. Convert result Scala RDD[Row] to PySparkRDD of Rows. Perform some python operations and convert RDD of pySpark Rows back to Scala's RDD[Row].
To get RDD to pySpark RDD I'm doing this:
In Scala I have this method
import org.apache.spark.sql.execution.python.EvaluatePython.{javaToPython, toJava}
def toPythonRDD(rdd: RDD[Row]): JavaRDD[Array[Byte]] = {
javaToPython(rdd.map(r => toJava(r, r.schema)))
}
Later in pySpark I create new RDD calling
RDD(jrdd, sc, BatchedSerializer(PickleSerializer()))
I end up with RDD of pySpark Rows. I'd like to revert that process.
I can easily get Scala's JavaRDD[Array[Byte]] by accessing rdd._jrdd. My main problem is that I don't know hwo to convert/unplickle it back to RDD[Row].
I've tried
sc._jvm.SerDe.pythonToJava(rdd._to_java_object_rdd(), True)
and
sc._jvm.SerDe.pythonToJava(rdd._jrdd, True)
both crash with similar exception
net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)
net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)
I know that I can easily pass DF back and forth between Scala and Python, but my records don't have uniform schema. I'm using RDD of Row's, because I though there will already be a pickler I'd be able to reuse and it works, but so far in only one direction.
I understand that one can convert an RDD to a Dataset using rdd.toDS. However there also exists rdd.toDF. Is there really any benefit of one over the other?
After playing with the Dataset API for a day, I find out that almost any operation takes me out to a DataFrame (for instance withColumn). After converting an RDD with toDS, I often find out that another conversion to a DataSet is needed, because something brought me to a DataFrame again.
Am I using the API wrongly? Should I stick with .toDF and only convert to a DataSet in the end of a chain of operations? Or is there a benefit to using toDS earlier?
Here is a small concrete example
spark
.read
.schema (...)
.json (...)
.rdd
.zipWithUniqueId
.map[(Integer,String,Double)] { case (row,id) => ... }
.toDS // now with a Dataset API (should use toDF here?)
.withColumnRenamed ("_1", "id" ) // now back to a DataFrame, not type safe :(
.withColumnRenamed ("_2", "text")
.withColumnRenamed ("_2", "overall")
.as[ParsedReview] // back to a Dataset
Michael Armburst nicely explained that shift to dataset and dataframe and the difference between the two. Basically in spark 2.x they converged dataset and dataframe API into one with slight difference.
"DataFrame is just DataSet of generic row objects. When you don't know all the fields, DF is the answer".