I'm trying to implement a custom UDAF in spark that needs a distinct values from the input.
does spark provides an interface which I can use to declare that I need the input as distinct values?
if so, how can this be implemented?
Related
I am trying to call 2 UDF's within the same groupBy function.
I have one UDF that takes a group and returns a Pandas dataframe with one row and multiple columns.
I have another that takes just one feature and returns a single value.
Is there a way to run both of them in the same groupBy. I run the first UDF with the applyInPandas function but can't find a way to run any other function with it running.
I need to replace only null values in selected columns in a dataframe. I know we have df.na.fill option . How can we implement it only on selected columns or is there any better option other than using df.na.fill
Reading spark documentation here we can see that fill is well suited for your need. You can do something like:
df.na.fill(0, Seq("colA", "colB"))
From PySpark, I am trying to define a custom aggregator that is accumulating state . Is it possible in Spark 2.3 ?
AFAIK, it is now possible to define a custom UDAF in PySpark since Spark 2.3 (cf How to define and use a User-Defined Aggregate Function in Spark SQL?), by calling pandas_udf with the PandasUDFType.GROUPED_AGG keyword. However given that it is just taking a function as a parameter I don't think it is possible to carry state around during the aggregation.
From Scala, I see it is possible to have stateful aggregation by either extending UserDefinedAggregateFunction or org.apache.spark.sql.expressions.Aggregator , but is there a similar thing I can do on python-side only?
You could use an accumulator.
You could leverage spark streaming built-in state management.
simple accumulator example for use in SQL
from pyspark.sql.types import IntegerType
# have some data
df = spark.range(10).toDF("num")
# have a table
df.createOrReplaceTempView("num_table")
# have an accumulator
accSum = sc.accumulator(0)
# have a function that accumulates
def add_acc(int_val):
accSum.add(int_val)
return int_val
# register function as udf
spark.udf.register("reg_addacc", add_acc, IntegerType())
# use in sql
spark.sql("SELECT sum(reg_addacc(num)) FROM num_table").show()
# get value from accumulator
print(accSum.value)
45
I have a Spark SQL DF, in which i am trying to call one UDF [ which i created using Spark SQL udf.
val udfName = udf(somemethodName)
val newDF = df.withColumn("columnnew", udfName(col("anotherDFColumn"))
I'm trying to use another DF stored as a val inside the somemethodName, but the DF is coming as null.
This is happening only when i use where clause in the newDF.
Am i missing something?Is it not possible to use another variable / method inside UDF method?
Or do i have to do something with broadcast? Currently i am running this on local, not in the cluster though.
Is it not possible to use another variable / method inside UDF method
It is possible if and only if that variable / method can be serialized - a UDF is a closure that must be serialized and distributed to executors.
A Dataframe cannot be serialized (it's a pointer to other distributed data, so there's no logical way to serialize it without collecting it into Driver memory), therefore appears as null when you try to use the UDF.
You're probably going to need to join the two dataframes on some key, and then use a UDF (or a standard transformation) that takes columns from the joined Dataframe.
I want to rewrite some of my code written with RDDs to use DataFrames. It was working quite smoothly until I found this:
events
.keyBy(row => (row.getServiceId + row.getClientCreateTimestamp + row.getClientId, row) )
.reduceByKey((e1, e2) => if(e1.getClientSendTimestamp <= e2.getClientSendTimestamp) e1 else e2)
.values
it is simple to start with
events
.groupBy(events("service_id"), events("client_create_timestamp"), events("client_id"))
but what's next? What if I'd like to iterate over every element in the current group? Is it even possible?
Thanks in advance.
GroupedData cannot be used directly. Data is not physically grouped and it is just a logical operation. You have to apply some variant of agg method for example:
events
.groupBy($"service_id", $"client_create_timestamp", $"client_id")
.min("client_send_timestamp")
or
events
.groupBy($"service_id", $"client_create_timestamp", $"client_id")
.agg(min($"client_send_timestamp"))
where client_send_timestamp is a column you want to aggregate.
If you want to keep information than aggregate just join or use Window functions - see Find maximum row per group in Spark DataFrame
Spark also supports User Defined Aggregate Functions - see How to define and use a User-Defined Aggregate Function in Spark SQL?
Spark 2.0+
You could use Dataset.groupByKey which exposes groups as an iterator.