How to do custom operations on GroupedData in Spark? - scala

I want to rewrite some of my code written with RDDs to use DataFrames. It was working quite smoothly until I found this:
events
.keyBy(row => (row.getServiceId + row.getClientCreateTimestamp + row.getClientId, row) )
.reduceByKey((e1, e2) => if(e1.getClientSendTimestamp <= e2.getClientSendTimestamp) e1 else e2)
.values
it is simple to start with
events
.groupBy(events("service_id"), events("client_create_timestamp"), events("client_id"))
but what's next? What if I'd like to iterate over every element in the current group? Is it even possible?
Thanks in advance.

GroupedData cannot be used directly. Data is not physically grouped and it is just a logical operation. You have to apply some variant of agg method for example:
events
.groupBy($"service_id", $"client_create_timestamp", $"client_id")
.min("client_send_timestamp")
or
events
.groupBy($"service_id", $"client_create_timestamp", $"client_id")
.agg(min($"client_send_timestamp"))
where client_send_timestamp is a column you want to aggregate.
If you want to keep information than aggregate just join or use Window functions - see Find maximum row per group in Spark DataFrame
Spark also supports User Defined Aggregate Functions - see How to define and use a User-Defined Aggregate Function in Spark SQL?
Spark 2.0+
You could use Dataset.groupByKey which exposes groups as an iterator.

Related

how to insert the data from delta table to a variable in order to apply drools rule on them

I am using spark with scala in which I am getting streaming datas from eventhubs and then storing them in delta table. In order to apply drools rule on them ,i need to pass them through variables...i am stuck where i have to get the data from delta table to variable.
It really depends what data you need to pass to that drools rules, and what you need to return. You can either use:
User defined function - you define a function that will receive one or more parameters (column values of specific rows). (more examples)
Use map function of Dataset / Dataframe class to process the whole Row (doc, and examples)
Delta Tables can be read into DataFrames. A variable can be assigned to point to the DataFrame.
df = spark.read.format("delta").load("some/delta/path")
Once the Delta Table is read, you can apply your custom transformations:
transformed_df = df.transform(first_transform).transform(second_transform)
Hope this helps point you in the right direction.

Using min/max operations in groupByKey on a spark dataset

I am trying to achieve min and max inside agg of a groupByKey operation. The code looks like below:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.TypedColumn
import org.apache.spark.sql.expressions.scalalang.typed.{
count => typedCount,
sum => typedSum }
inputFlowRecords.groupByKey(inputFlowRecrd => inputFlowRecrd.FlowInformation)
.agg(typedSum[InputFlowRecordV1](_.FlowStatistics.minFlowTime).name("minFlowTime"),
typedSum[InputFlowRecordV1](_.FlowStatistics.maxFlowTime).name("maxFlowTime"),
typedSum[InputFlowRecordV1](_.FlowStatistics.flowStartedCount).name("flowStartedCount"),
typedSum[InputFlowRecordV1](_.FlowStatistics.flowEndedCount).name("flowEndedCount"),
typedSum[InputFlowRecordV1](_.FlowStatistics.packetsCountFromSource).name("packetsCountFromSource"),
typedSum[InputFlowRecordV1](_.FlowStatistics.bytesCountFromSource).name("bytesCountFromSource"),
typedSum[InputFlowRecordV1](_.FlowStatistics.packetsCountFromDestination).name("packetsCountFromDestination"),
typedSum[InputFlowRecordV1](_.FlowStatistics.bytesCountFromDestination).name("bytesCountFromDestination"))
I am facing 2 problems here:
Instead of sum I want to take min/max on few columns. When I try to use org.apache.spark.sql.functions.min/max operations, the error says TypedColumns should be used. How can this be solved?
The agg function lets us specify only 4 columns max. inside it while I have 8 columns to aggregate. How can this be achieved?
Unfortunately it seems that:
min/max are not yet supported (see "todos" in typed.scala)
agg function indeed only supports up to 4 columns (see in KeyValueGroupedDataset.scala)
In your case a reasonable thing to do might be to define your own specialized aggregator that would aggregate InputFlowStatistics objects, so you only have single argument to agg.
Typed aggregators are defined here: typedaggregators.scala and Spark documentation provides some information on creating custom ones (->link).

How to avoid the need for nested calls to Spark dataframes - which dont' work

Suppose I have a Spark dataframe called trades which has in its schema a few columns, some dimensions (let's say Product and Type) and some facts (let's say Price and Volume).
Rows in the dataframe which have the same dimension columns belong logically to the same group.
What I need is to map each dimension set (Product, Type) to a numeric value, so to obtain in the end a dataframe stats which has as many rows as the distinct number of dimensions and a value - this is the critical part - which is obtained from all the rows in trades of that (Product, Type) and which must be computed sequentially in order, because the function applied row by row is neither associative nor commutative, and it cannot be parallelized.
I managed to handle the sequential function I need to apply to each subset by repartitioning to 1 single chunk each dataframe and sorting the rows, so to get exactly what I need.
The thing I am struggling with is how to do the map from trades to stats as a Spark job: in my scenario master is remote and can leverage multiple executors, while the deploy mode is local and local machine is poorly equipped.
So I don't want to do looping over the driver, but push it down to the cluster.
If this was not Spark, I'd have done something like:
val dimensions = trades.select("Product", "Type").distinct()
val stats = dimensions.map( row =>
val product = row.getAs[String]("Product")
val type = row.getAs[String]("Type")
val inScope = col("Product") === product and col("Type") === type
val tradesInScope = trades.filter(inScope)
Row(product, type, callSequentialFunction(tradesInScope))
)
This seemed fine to me, but it's absolutely not working: I am trying to do a nested call on trades, and it seem they are not supported. Indeed, when running this the spark job compile but when actually performing an action I get a NullPointerException because the dataframe trades is null within the map
I am new to Spark, and I don't know any other way of achieving the same intent in a valid way. Could you help me?
you get a NullpointerExecptionbecause you cannot use dataframes within executor-side code, they only live on the driver.Also, your code would not ensure thatcallSequentialFunction will be called sequentially, because map on a dataframe will run in parallel (if you have more than 1 partition). What you can do is something like this:
val dimensions = trades.select("Product", "Type").distinct().as[(String,String)].collect()
val stats = dimensions.map{case (product,type) =>
val inScope = col("Product") === product and col("Type") === type
val tradesInScope = trades.filter(inScope)
(product, type, callSequentialFunction(tradesInScope))
}
But note that the order in dimensionsis somewhat arbitrary, so you should sort dimensionsaccording to your needs

Spark Scala - Apply ML/Complex functions on a GroupBy DataFrame

I have a large DataFrame (Spark 1.6 Scala) which looks like this:
Type,Value1,Value2,Value3,...
--------------------------
A,11.4,2,3
A,82.0,1,2
A,53.8,3,4
B,31.0,4,5
B,22.6,5,6
B,43.1,6,7
B,11.0,7,8
C,22.1,8,9
C,3.2,9,1
C,13.1,2,3
From this I want to group by Type and apply machine learning algorithms and/or perform complex functions on each group.
My objective is perform complex functions on each group in parallel.
I have tried the following approaches:
Approach 1) Convert Dataframe to Dataset and then use ds.mapGroups() api. But this is giving me an Iterator of each group values.
If i want to perform RandomForestClassificationModel.transform(dataset: DataFrame), i need a DataFrame with only a particular group values.
I was not sure converting Iterator to a Dataframe within mapGroups is a good idea.
Approach 2) Distinct on Type, then map on them and then filter for each Type with in the map loop:
val types = df.select("Type").distinct()
val ff = types.map(row => {
val type = row.getString(0)
val thisGroupDF = df.filter(col("Type") == type)
// Apply complex functions on thisGroupDF
(type, predictedValue)
})
For some reason, the above is never completing (seems to be getting into some kind of infinite loop)
Approach 3) Exploring Window functions, but did not find a method which can provide dataframe of particular group values.
Please help.

Using groupBy in Spark and getting back to a DataFrame

I have a difficulty when working with data frames in spark with Scala. If I have a data frame that I want to extract a column of unique entries, when I use groupBy I don't get a data frame back.
For example, I have a DataFrame called logs that has the following form:
machine_id | event | other_stuff
34131231 | thing | stuff
83423984 | notathing | notstuff
34131231 | thing | morestuff
and I would like the unique machine ids where event is thing stored in a new DataFrame to allow me to do some filtering of some kind. Using
val machineId = logs
.where($"event" === "thing")
.select("machine_id")
.groupBy("machine_id")
I get a val of Grouped Data back which is a pain in the butt to use (or I don't know how to use this kind of object properly). Having got this list of unique machine id's, I then want to use this in filtering another DataFrame to extract all events for individual machine ids.
I can see I'll want to do this kind of thing fairly regularly and the basic workflow is:
Extract unique id's from a log table.
Use unique ids to extract all events for a particular id.
Use some kind of analysis on this data that has been extracted.
It's the first two steps I would appreciate some guidance with here.
I appreciate this example is kind of contrived but hopefully it explains what my issue is. It may be I don't know enough about GroupedData objects or (as I'm hoping) I'm missing something in data frames that makes this easy. I'm using spark 1.5 built on Scala 2.10.4.
Thanks
Just use distinct not groupBy:
val machineId = logs.where($"event"==="thing").select("machine_id").distinct
Which will be equivalent to SQL:
SELECT DISTINCT machine_id FROM logs WHERE event = 'thing'
GroupedData is not intended to be used directly. It provides a number of methods, where agg is the most general, which can be used to apply different aggregate functions and convert it back to DataFrame. In terms of SQL what you have after where and groupBy is equivalent to something like this
SELECT machine_id, ... FROM logs WHERE event = 'thing' GROUP BY machine_id
where ... has to be provided by agg or equivalent method.
A group by in spark followed by aggregation and then a select statement will return a data frame. For your example it should be something like:
val machineId = logs
.groupBy("machine_id", "event")
.agg(max("other_stuff") )
.select($"machine_id").where($"event" === "thing")