Using min/max operations in groupByKey on a spark dataset - scala

I am trying to achieve min and max inside agg of a groupByKey operation. The code looks like below:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.TypedColumn
import org.apache.spark.sql.expressions.scalalang.typed.{
count => typedCount,
sum => typedSum }
inputFlowRecords.groupByKey(inputFlowRecrd => inputFlowRecrd.FlowInformation)
.agg(typedSum[InputFlowRecordV1](_.FlowStatistics.minFlowTime).name("minFlowTime"),
typedSum[InputFlowRecordV1](_.FlowStatistics.maxFlowTime).name("maxFlowTime"),
typedSum[InputFlowRecordV1](_.FlowStatistics.flowStartedCount).name("flowStartedCount"),
typedSum[InputFlowRecordV1](_.FlowStatistics.flowEndedCount).name("flowEndedCount"),
typedSum[InputFlowRecordV1](_.FlowStatistics.packetsCountFromSource).name("packetsCountFromSource"),
typedSum[InputFlowRecordV1](_.FlowStatistics.bytesCountFromSource).name("bytesCountFromSource"),
typedSum[InputFlowRecordV1](_.FlowStatistics.packetsCountFromDestination).name("packetsCountFromDestination"),
typedSum[InputFlowRecordV1](_.FlowStatistics.bytesCountFromDestination).name("bytesCountFromDestination"))
I am facing 2 problems here:
Instead of sum I want to take min/max on few columns. When I try to use org.apache.spark.sql.functions.min/max operations, the error says TypedColumns should be used. How can this be solved?
The agg function lets us specify only 4 columns max. inside it while I have 8 columns to aggregate. How can this be achieved?

Unfortunately it seems that:
min/max are not yet supported (see "todos" in typed.scala)
agg function indeed only supports up to 4 columns (see in KeyValueGroupedDataset.scala)
In your case a reasonable thing to do might be to define your own specialized aggregator that would aggregate InputFlowStatistics objects, so you only have single argument to agg.
Typed aggregators are defined here: typedaggregators.scala and Spark documentation provides some information on creating custom ones (->link).

Related

pyspark - difference between select and agg

What is the difference between the following two -
df.select(min("salary")).show()
and
df.agg({'salary':'min'}).show()
Also, what is the difference in these two -
df.groupBy("department").agg({'salary':'min'}).show()
and
df.groupBy("Company").min('salary').show()
In Spark, there are many different ways to write the same thing. It depends mostly if you prefer a SQL writting or a python writting.
df.select(min("salary")) is the equivalent of SQL :
select min(salary) from df
This query computes the min of the column salary without any group by clause.
It is equivalent to
from pyspark.sql import functions as F
df.groupBy().agg(F.min("salary"))
# OR
df.groupBy().agg({'salary':'min'})
As you can see, the groupBy is empty, so you do not group by anything. Python also can interpret the dict {'salary':'min'} which is equivalent to the function F.min("salary").
The method agg depends on the object. Applied to a Dataframe, it is equivalent to df.groupBy().agg. agg is also a method of the object GroupedData which is created when you do df.groupBy(). I added the link to the officiel doc where you can see the difference between the two methods.
When writting df.groupBy("department").agg({'salary':'min'}), you can specify in the method agg several different aggregation. When using just min, you are limited to one column. For example, you can do this :
from pyspark.sql import functions as F
df.groupBy("department").agg(F.min("salary"), F.max("age"))
# OR
df.groupBy("department").agg({'salary':'min', 'age':'max'})
# But you cannot do
df.groupBy("department").min("salary").max("age")
>> AttributeError: 'DataFrame' object has no attribute 'max'

SCALA: How to use collect function to get the latest modified entry from a dataframe?

I have a scala dataframe with two columns:
id: String
updated: Timestamp
From this dataframe I just want to get out the latest date, for which I use the following code at the moment:
df.agg(max("updated")).head()
// returns a row
I've just read about the collect() function, which I'm told to be
safer to use for such a problem - when it runs as a job, it appears it is not aggregating the max on the whole dataset, it looks perfectly fine when it is running in a notebook -, but I don't understand how it should
be used.
I found an implementation like the following, but I could not figure how it should be used...
df1.agg({"x": "max"}).collect()[0]
I tried it like the following:
df.agg(max("updated")).collect()(0)
Without (0) it returns an Array, which actually looks good. So idea is, we should apply the aggregation on the whole dataset loaded in the drive, not just the partitioned version, otherwise it seems to not retrieve all the timestamps. My question now is, how is collect() actually supposed to work in such a situation?
Thanks a lot in advance!
I'm assuming that you are talking about a spark dataframe (not scala).
If you just want the latest date (only that column) you can do:
df.select(max("updated"))
You can see what's inside the dataframe with df.show(). Since df are immutable you need to assign the result of the select to another variable or add the show after the select().
This will return a dataframe with just one row with the max value in "updated" column.
To answer to your question:
So idea is, we should apply the aggregation on the whole dataset loaded in the drive, not just the partitioned version, otherwise it seems to not retrieve all the timestamp
When you select on a dataframe, spark will select data from the whole dataset, there is not a partitioned version and a driver version. Spark will shard your data across your cluster and all the operations that you define will be done on the entire dataset.
My question now is, how is collect() actually supposed to work in such a situation?
The collect operation is converting from a spark dataframe into an array (which is not distributed) and the array will be in the driver node, bear in mind that if your dataframe size exceed the memory available in the driver you will have an outOfMemoryError.
In this case if you do:
df.select(max("Timestamp")).collect().head
You DF (that contains only one row with one column which is your date), will be converted to a scala array. In this case is safe because the select(max()) will return just one row.
Take some time to read more about spark dataframe/rdd and the difference between transformation and action.
It sounds weird. First of all you donĀ“t need to collect the dataframe to get the last element of a sorted dataframe. There are many answers to this topics:
How to get the last row from DataFrame?

Using MLUtils.convertVectorColumnsToML() inside a UDF?

I have a Dataset/Dataframe with a mllib.linalg.Vector (of Doubles) as one of the columns. I would like to add another column to this dataset of type ml.linalg.Vector to this data set (so I will have both types of Vectors). The reason is I am evaluating few algorithms and some of those expect mllib vector and some expect ml vector. Also, I have to feed o/p of one algorithm to another and each use different types.
Can someone please help me convert mllib.linalg.Vector to ml.linalg.Vector and append a new column to the data set in hand. I tried using MLUtils.convertVectorColumnsToML() inside an UDF and regular functions but not able to get it to working. I am trying to avoid creating a new dataset and then doing inner join and dropping the columns as the data set will be huge eventually and joins are expensive.
You can use the method toML to convert from mllib to ml vector. An UDF and usage example can look like this:
val convertToML = udf((mllibVec: org.apache.spark.mllib.linalg.Vector) = > {
mllibVec.asML
})
val df2 = df.withColumn("mlVector", convertToML($"mllibVector"))
Assuming df to be the original dataframe and the column with the mllib vector to be named mllibVector.

How to pass a group of RelationalGroupedDataset to a function?

I am reading a csv as a Data Frame by below:
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("D:/ModelData.csv")
Then I group by three columns as below which returns a RelationalGroupedDataset
df.groupBy("col1", "col2","col3")
And I want each grouped data frame to be send through the below function
def ModelFunction(daf: DataFrame) = {
//do some calculation
}
For example if I have col1 having 2 unique (0,1) values and col2 having 2 unique values(1,2) and col3 having 3 unique values(1,2,3) Then i would like to pass each combination grouping to the Model function Like for col1=0 ,col2=1,col3=1 I will havea dataframe and I want to pass that to the ModelFunction and so on for each combination of the three columns.
I tried
df.groupBy("col1", "col2","col3").ModelFunction();
But it throw an error.
.
Any help is appreciated.
The short answer is that you cannot do that. You can only do aggregate functions on RelationalGroupedDataset (either ones you write as UDAF or built in ones in org.apache.spark.sql.functions)
The way I see it you have several options:
Option 1: The amount of data for each unique combination is small enough and not skewed too much compared to other combinations.
In this case you can do:
val grouped = df.groupBy("col1", "col2","col3").agg(collect_list(struct(all other columns)))
grouped.as[some case class to represent the data including the combination].map[your own logistic regression function).
Option 2: If the total number of combinations is small enough you can do:
val values: df.select("col1", "col2", "col3").distinct().collect()
and then loop through them creating a new dataframe from each combination by doing a filter.
Option 3: Write your own UDAF
This would probably not be good enough as the data comes in a stream without the ability to do iteration, however, if you have an implemenation of logistic regression which matches you can try to write a UDAF to do this. See for example: How to define and use a User-Defined Aggregate Function in Spark SQL?

How to do custom operations on GroupedData in Spark?

I want to rewrite some of my code written with RDDs to use DataFrames. It was working quite smoothly until I found this:
events
.keyBy(row => (row.getServiceId + row.getClientCreateTimestamp + row.getClientId, row) )
.reduceByKey((e1, e2) => if(e1.getClientSendTimestamp <= e2.getClientSendTimestamp) e1 else e2)
.values
it is simple to start with
events
.groupBy(events("service_id"), events("client_create_timestamp"), events("client_id"))
but what's next? What if I'd like to iterate over every element in the current group? Is it even possible?
Thanks in advance.
GroupedData cannot be used directly. Data is not physically grouped and it is just a logical operation. You have to apply some variant of agg method for example:
events
.groupBy($"service_id", $"client_create_timestamp", $"client_id")
.min("client_send_timestamp")
or
events
.groupBy($"service_id", $"client_create_timestamp", $"client_id")
.agg(min($"client_send_timestamp"))
where client_send_timestamp is a column you want to aggregate.
If you want to keep information than aggregate just join or use Window functions - see Find maximum row per group in Spark DataFrame
Spark also supports User Defined Aggregate Functions - see How to define and use a User-Defined Aggregate Function in Spark SQL?
Spark 2.0+
You could use Dataset.groupByKey which exposes groups as an iterator.