Pyspark apply multiple groupBy UDF's - pyspark

I am trying to call 2 UDF's within the same groupBy function.
I have one UDF that takes a group and returns a Pandas dataframe with one row and multiple columns.
I have another that takes just one feature and returns a single value.
Is there a way to run both of them in the same groupBy. I run the first UDF with the applyInPandas function but can't find a way to run any other function with it running.

Related

Split row of tuple to two row in RDD

I try to split tuple of ints to two rows in RDD.
vertices=edges.map(lambda x:(x[0],)).union(edges.map(lambda x:(x[1],))).distinct()
I try this code and it is working, but I want code that run less in runtime, without using the GraphFrames package.
You can use flatMap:
edges.flatMap(lambda x: x).distinct()
In Scala, you would simply call .flatMap(identity) instead.
If you use the DataFrame API you can just use explode on your only column e.g. df.select(explode("edge"))

pyspark - difference between select and agg

What is the difference between the following two -
df.select(min("salary")).show()
and
df.agg({'salary':'min'}).show()
Also, what is the difference in these two -
df.groupBy("department").agg({'salary':'min'}).show()
and
df.groupBy("Company").min('salary').show()
In Spark, there are many different ways to write the same thing. It depends mostly if you prefer a SQL writting or a python writting.
df.select(min("salary")) is the equivalent of SQL :
select min(salary) from df
This query computes the min of the column salary without any group by clause.
It is equivalent to
from pyspark.sql import functions as F
df.groupBy().agg(F.min("salary"))
# OR
df.groupBy().agg({'salary':'min'})
As you can see, the groupBy is empty, so you do not group by anything. Python also can interpret the dict {'salary':'min'} which is equivalent to the function F.min("salary").
The method agg depends on the object. Applied to a Dataframe, it is equivalent to df.groupBy().agg. agg is also a method of the object GroupedData which is created when you do df.groupBy(). I added the link to the officiel doc where you can see the difference between the two methods.
When writting df.groupBy("department").agg({'salary':'min'}), you can specify in the method agg several different aggregation. When using just min, you are limited to one column. For example, you can do this :
from pyspark.sql import functions as F
df.groupBy("department").agg(F.min("salary"), F.max("age"))
# OR
df.groupBy("department").agg({'salary':'min', 'age':'max'})
# But you cannot do
df.groupBy("department").min("salary").max("age")
>> AttributeError: 'DataFrame' object has no attribute 'max'

SCALA: How to use collect function to get the latest modified entry from a dataframe?

I have a scala dataframe with two columns:
id: String
updated: Timestamp
From this dataframe I just want to get out the latest date, for which I use the following code at the moment:
df.agg(max("updated")).head()
// returns a row
I've just read about the collect() function, which I'm told to be
safer to use for such a problem - when it runs as a job, it appears it is not aggregating the max on the whole dataset, it looks perfectly fine when it is running in a notebook -, but I don't understand how it should
be used.
I found an implementation like the following, but I could not figure how it should be used...
df1.agg({"x": "max"}).collect()[0]
I tried it like the following:
df.agg(max("updated")).collect()(0)
Without (0) it returns an Array, which actually looks good. So idea is, we should apply the aggregation on the whole dataset loaded in the drive, not just the partitioned version, otherwise it seems to not retrieve all the timestamps. My question now is, how is collect() actually supposed to work in such a situation?
Thanks a lot in advance!
I'm assuming that you are talking about a spark dataframe (not scala).
If you just want the latest date (only that column) you can do:
df.select(max("updated"))
You can see what's inside the dataframe with df.show(). Since df are immutable you need to assign the result of the select to another variable or add the show after the select().
This will return a dataframe with just one row with the max value in "updated" column.
To answer to your question:
So idea is, we should apply the aggregation on the whole dataset loaded in the drive, not just the partitioned version, otherwise it seems to not retrieve all the timestamp
When you select on a dataframe, spark will select data from the whole dataset, there is not a partitioned version and a driver version. Spark will shard your data across your cluster and all the operations that you define will be done on the entire dataset.
My question now is, how is collect() actually supposed to work in such a situation?
The collect operation is converting from a spark dataframe into an array (which is not distributed) and the array will be in the driver node, bear in mind that if your dataframe size exceed the memory available in the driver you will have an outOfMemoryError.
In this case if you do:
df.select(max("Timestamp")).collect().head
You DF (that contains only one row with one column which is your date), will be converted to a scala array. In this case is safe because the select(max()) will return just one row.
Take some time to read more about spark dataframe/rdd and the difference between transformation and action.
It sounds weird. First of all you donĀ“t need to collect the dataframe to get the last element of a sorted dataframe. There are many answers to this topics:
How to get the last row from DataFrame?

Using min/max operations in groupByKey on a spark dataset

I am trying to achieve min and max inside agg of a groupByKey operation. The code looks like below:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.TypedColumn
import org.apache.spark.sql.expressions.scalalang.typed.{
count => typedCount,
sum => typedSum }
inputFlowRecords.groupByKey(inputFlowRecrd => inputFlowRecrd.FlowInformation)
.agg(typedSum[InputFlowRecordV1](_.FlowStatistics.minFlowTime).name("minFlowTime"),
typedSum[InputFlowRecordV1](_.FlowStatistics.maxFlowTime).name("maxFlowTime"),
typedSum[InputFlowRecordV1](_.FlowStatistics.flowStartedCount).name("flowStartedCount"),
typedSum[InputFlowRecordV1](_.FlowStatistics.flowEndedCount).name("flowEndedCount"),
typedSum[InputFlowRecordV1](_.FlowStatistics.packetsCountFromSource).name("packetsCountFromSource"),
typedSum[InputFlowRecordV1](_.FlowStatistics.bytesCountFromSource).name("bytesCountFromSource"),
typedSum[InputFlowRecordV1](_.FlowStatistics.packetsCountFromDestination).name("packetsCountFromDestination"),
typedSum[InputFlowRecordV1](_.FlowStatistics.bytesCountFromDestination).name("bytesCountFromDestination"))
I am facing 2 problems here:
Instead of sum I want to take min/max on few columns. When I try to use org.apache.spark.sql.functions.min/max operations, the error says TypedColumns should be used. How can this be solved?
The agg function lets us specify only 4 columns max. inside it while I have 8 columns to aggregate. How can this be achieved?
Unfortunately it seems that:
min/max are not yet supported (see "todos" in typed.scala)
agg function indeed only supports up to 4 columns (see in KeyValueGroupedDataset.scala)
In your case a reasonable thing to do might be to define your own specialized aggregator that would aggregate InputFlowStatistics objects, so you only have single argument to agg.
Typed aggregators are defined here: typedaggregators.scala and Spark documentation provides some information on creating custom ones (->link).

Calling other methods/variables inside a UDF method in Spark SQL DataFrame

I have a Spark SQL DF, in which i am trying to call one UDF [ which i created using Spark SQL udf.
val udfName = udf(somemethodName)
val newDF = df.withColumn("columnnew", udfName(col("anotherDFColumn"))
I'm trying to use another DF stored as a val inside the somemethodName, but the DF is coming as null.
This is happening only when i use where clause in the newDF.
Am i missing something?Is it not possible to use another variable / method inside UDF method?
Or do i have to do something with broadcast? Currently i am running this on local, not in the cluster though.
Is it not possible to use another variable / method inside UDF method
It is possible if and only if that variable / method can be serialized - a UDF is a closure that must be serialized and distributed to executors.
A Dataframe cannot be serialized (it's a pointer to other distributed data, so there's no logical way to serialize it without collecting it into Driver memory), therefore appears as null when you try to use the UDF.
You're probably going to need to join the two dataframes on some key, and then use a UDF (or a standard transformation) that takes columns from the joined Dataframe.