How to use combineByKey on dataframe - scala

I am trying to achieve secondary sorting in spark. To be precise, for all events of a user session, I want to sort them based on timestamp. Post secondary sorting, I need to iterate through each event of a session to implement a business logic. I am doing it as follows:
def createCombiner = (row: Row) => Array(row)
def mergeValue = (rows: Array[Row], row: Row) => {
rows :+ row
}
def mergeCombiner = (rows1: Array[Row], rows2: Array[Row]) => rows1 ++ rows2
def attribute(eventsList: List[Row]): List[Row] = {
for (row: Row <- eventsList) {
// some logic
}
}
var groupedAndSortedRows = rawData.rdd.map(row => {
(row.getAs[String]("session_id"), row)
}).combineByKey(createCombiner, mergeValue, mergeCombiner)
.mapValues(_.toList.sortBy(_.getAs[String]("client_ts")))
.mapValues(attribute)
But I fear this is not the most time efficient way to do this as converting to RDD would require de-serialization and serialization, which I believe is not required when working with dataframes/datasets.
I am not sure if there is an aggregator function that return the entire row
rawData.groupBy("session_id").someAggregateFunction()
I want the someAggregateFunction() to return list of Rows. I do not want to aggregate on some columns but want the list of entire Rows corresponding to a session_id. Is it possible to do this?

The answer is yes, but may not be what you expect. Depends on how complicated your business logic is, there are 2 alernatives other than the combineByKey
If you just need mean, min, max and other known function defined in [spark.sql.functions][1]
[1]: https://github.com/apache/spark/blob/v2.0.2/sql/core/src/main/scala/org/apache/spark/sql/functions.scala you could certainly with groupBy(...).agg(...) . I guess that's not your case. So If you'd like to implement your own UDAF that's no better than the combineByKey unless this business logic is quite common and could be re-used for other dataset
Or you need slightly complicated logic you could use window function
To specify a window spec with Window.partitionBy($"session_id").orderBy($"client_ts" desc) then you could easily implement topN, moving average, ntile etc.See https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html and you could also implement the custom window aggegration function yourself

Related

Slick: update List in db

My table schema in Postgres is the following:
I store List[String] in the 2nd column and I wrote the working method that updates this list with Union of a new list and old list:
def update(userId: Long, unknownWords: List[String]) = db.run {
for {
y <- lists.filter(_.userId === userId).result
words = y.map(_.unknownWords).flatMap(_.union(unknownWords)).distinct.toList
x <- lists.filter(_.userId === userId).map(_.unknownWords).update(words)
} yield x
}
Is there any way to write this better? And maybe the question is pretty dumb, but I don’t quite understand why I should apply .result() to the first line of the for expression, the filter().map() chain on the 3d line is working fine, is there something wrong with the types?
Why .result
The reason you need to apply .result is to do with the difference between queries (Query type) and actions (DBIO) in Slick.
By itself, the lists.filter line is a query. However, the third line (the update) is an action. If you left the .result off your for comprehension would have a type mismatch between a Query and a DBIO (action).
Because you're going to db.run the result of the for comprehension, the for comprehension needs to result in an DBIO action, rather than a query. In other words, putting a .result there is the right thing to do because you're constructing an action to run in the database (namely, fetching some data for the user).
You'll then going to run another action later to update the database. So in all, you're using for to combine two actions (two runnable SQL expressions) into a single DBIO. That's the x you yield, which is executed by db.run.
Better?
This is working for you, and that's just fine.
There's a small amount of duplication. You might spot your query on the first line, is very similar to the update query. You could abstract that out into a value:
val userLists = lists.filter(_.userId === userId)
That's a query. In fact, you could go a step further and modify the query to just select the unknownWords column:
val userUnknownWords = lists.filter(_.userId === userId).map(_.unknownWords)
I've not tried to compile this but that would make your code something like:
def update(userId: Long, unknownWords: List[String]) = {
val userUnknownWords = lists.filter(_.userId === userId).map(_.unknownWords)
db.run {
for {
y <- userUnknowlWords.result
words = y.flatMap(_.union(unknownWords)).distinct.toList
x <- userUnknownWords.update(words)
} yield x
}
Given that you're composing two actions (a select and an update), you could use DBIO.flatMap in place of the for comprehension. You might find it clearer. Or not. But here's an example...
The argument to DBIO.flatMap needs to be another action. That is, flatMap is a way to sequence actions. In particular, it's a way to do that while using the value from the database.
So you could replace the for comprehension with:
val action: DBIO[Int] =
userUnknowlWords.result.flatMap { currentWords =>
userUnknownWords.update(
currentWords.flatMap(_.union(unknownWords)).distinct.toList
)
}
(Again, apologies for not compiling the above: I don't have the details of the types, but hopefully this will give a flavour for how the code could work).
The final action is the one you can pass to db.run. It returns the number of rows changed.

Data analysis on a subset with scala

I'm new to learning Scala and exploring the ways it can do things, and am now trying to learn to implement some slightly more sophisticated data analysis.
I have weather data for various cities in different countries in a text file loaded into the program. I have so far figured out how to calculate simple things like the average temperature across a country per day, or the average temperature of each city grouped by country across the whole file, using Maps/Mapvalues to bind keys to the values I'm looking for.
Now would like to be able to specify a time window (say, a week) and, from there, grouped by country, figure out things like the average temperature of each city in that time window. For simplicity, I've made the dates simple INTs rather than go with MM/DD/YY format.
In another language I would likely go for loops to do this, but I'm not quite sure the best "Scala" way to do this. At first I thought maybe "sliding" or "grouped," but have found this would split the list entirely and thefore I could not specify an arbitrary day to calculate the week from. I've included example code for my method which calculates the average temperature per city over the whole time period
def citytempaverages(): Map[String, Map[String, Double]] = {
weatherpatterns.groupBy(_.country)
.mapValues(_.groupBy(_.city)
.mapValues(cityavg => cityavg.map(_.temperature).sum /cityavg.length))
Does it even still make sense to use maps for this new problem, or perhaps another method in the collections API is more suited?
UPDATE #1: so I've built a collection like so:
def dailycities(): Map[Int, Map[String,Map[String, List[Double]]]] = {
weatherpatterns.groupBy(_.day)
.mapValues(_.groupBy(_.country).mapValues(_.groupBy(_.city)
.mapValues(_.map(_.temperature))))
}
And then created a new map using filterKeys and the Set function to give me back just a list of the days I'm looking for. So I suppose now it's just a matter of formatting to get the averages out correctly.
I would't call it a best way in scala to do this. Rather any way to minimize iteration is the best imo in that case:
def averageOfDay(country: String, city: String, day: Int) = {
val temps = weatherPatterns.collect {
case WeatherPattern(`day`, `country`, `city`, temp) => temp
}
temps.sum / temps.length
}
Edit
I just noticed you mainly need an operation that calculates avgs for all cities and countries. In that case I'd say instead of forming the hierarchical relationship of country -> city -> temp in every operation, you'd rather opt for building the hierarchy once beforhand then operate on that:
case class DailyTemperature(day: Int, temperature: Double)
object DailyTemperature {
def sequence(patterns: List[WeatherPattern]): List[DailyTemperature] =
patterns.map(p => DailyTemperature(p.day,p.temperature))
}
case class CityTempInfo(city: String, dailyTemperatures: List[DailyTemperature])
object CityTempInfo {
def sequence(patterns: List[WeatherPattern]): List[CityTempInfo] =
patterns.groupBy(_.city).map {
case (city, ps) => CityTempInfo(city,DailyTemperature.sequence(ps))
}.toList
}
case class CountryTempInfo(country: String, citiesInfo: List[CityTempInfo])
object CountryTempInfo {
def sequence(patterns: List[WeatherPattern]) =
patterns.groupBy(_.country).map {
case (country, ps) => CountryTempInfo(country, CityTempInfo.sequence(ps))
}.toList
}
now to have your tree of country -> city -> temp you call the CountryTempInfo.sequence and feed it your list of WeatherPatterns. any other method you want to have operate on DailyTemperature,CityTempInfo, of CountryTempInfo can be defined on their respective classes.
I am not sure what exactly you mean when you say that you use "simple ints" for the date, but if it is something sensible, like, for instance "days since epoch", you could fairly easily come up with a grouping function, that maps to weeks:
def weakOf(n: Int, start: Int) = (start + n) / 7
patterns
.groupBy { p => (weakOf(p.day, startDay), p.country, p.city) }
.mapValues(_.map(_.temperature))
.mapValues { s => s.sum / s.size }

Dataset.groupByKey + untyped aggregation functions

Suppose I have types like these:
case class SomeType(id: String, x: Int, y: Int, payload: String)
case class Key(x: Int, y: Int)
Then suppose I did groupByKey on a Dataset[SomeType] like this:
val input: Dataset[SomeType] = ...
val grouped: KeyValueGroupedDataset[Key, SomeType] =
input.groupByKey(s => Key(s.x, s.y))
Then suppose I have a function which determines which field I want to use in an aggregation:
val chooseDistinguisher: SomeType => String = _.id
And now I would like to run an aggregation function over the grouped dataset, for example, functions.countDistinct, using the field obtained by the function:
grouped.agg(
countDistinct(<something which depends on chooseDistinguisher>).as[Long]
)
The problem is, I cannot create a UDF from chooseDistinguisher, because countDistinct accepts a Column, and to turn a UDF into a Column you need to specify the input column names, which I cannot do - I do not know which name to use for the "values" of a KeyValueGroupedDataset.
I think it should be possible, because KeyValueGroupedDataset itself does something similar:
def count(): Dataset[(K, Long)] = agg(functions.count("*").as(ExpressionEncoder[Long]()))
However, this method cheats a bit because it uses "*" as the column name, but I need to specify a particular column (i.e. the column of the "value" in a key-value grouped dataset). Also, when you use typed functions from the typed object, you also do not need to specify the column name, and it works somehow.
So, is it possible to do this, and if it is, how to do it?
As I know it's not possible with agg transformation, which expects TypedColumn type which is constructed based on Column type using as method, so you need to start from not type-safe expression. If somebody knows solution I would be interested to see it...
If you need to use type-safe aggregation you can use one of below approaches:
mapGroups - where you can implement Scala function responsible for aggregating Iterator
implement your custom Aggregator as suggested above
First approach needs less code, so below I'm showing quick example:
def countDistinct[T](values: Iterator[T])(chooseDistinguisher: T => String): Long =
values.map(chooseDistinguisher).toSeq.distinct.size
ds
.groupByKey(s => Key(s.x, s.y))
.mapGroups((k,vs) => (k, countDistinct(vs)(_.name)))
In my opinion Spark Dataset type-safe API is still much less mature than not type safe DataFrame API. Some time ago I was thinking that it could be good idea to implement simple to use type-safe aggregation API for Spark Dataset.
Currently, this use case is better handled with DataFrame, which you can later convert back into a Dataset[A].
// Code assumes SQLContext implicits are present
import org.apache.spark.sql.{functions => f}
val colName = "id"
ds.toDF
.withColumn("key", f.concat('x, f.lit(":"), 'y))
.groupBy('key)
.agg(countDistinct(f.col(colName)).as("cntd"))

Applying DataFrame operations to a Single row in mapWithState

I'm on spark 2.1.0 with Scala 2.11. I have a requirement to store state in Map[String, Any] format for every key. The right candidate to solve my problem appears to be mapWithState() which is defined in PairDStreamFunctions. The DStream on which I am applying mapWithState() is of type DStream[Row]. Before applying mapWithState(), I do this:
dstream.map(row=> (row.get(0), row))
Now my DStream is of type Tuple2[Any, Row]. On this DStream I apply mapWithState and here's how my updatefunction looks:
def stateUpdateFunction(): (Any, Option[Row], State[Map[String, Any]]) => Option[Row] = {
(key, newData, stateData) => {
if (stateData.exists()) {
var oldState = stateData.get()
stateData.update(Map("count" -> newData.get.get(1), "sum" -> newData.get.get(2)))
Some(Row.fromSeq(newData.get.toSeq.++(Seq(oldState.get("count").get, oldState.get("sum").get))))
}
else {
stateData.update(Map("count" -> newData.get.get(1), "sum" -> newData.get.get(2)))
Some(Row.fromSeq(newData.get.toSeq.++(Seq[Any](null, null))))
}
}
}
Right now, update function only stores 2 values (per key) in the Map and appends the old values stored against "count" and "sum" to the input Row and returns. The state Map gets updated by the newly passed values in the input Row. My requirement is to be able to do complex operations on the input Row like we do on a DataFrame before storing them in the state Map. In other words I would like to be able to do something like this:
var transformedRow = originalRow.select(concat(upper($"C0"), lit("dummy")), lower($"C1") ...)
In the update-function I don't have access to SparkContext or SparkSession. So, I cannot create a single row DataFrame. If I could do that, applying DataFrame operations would not be difficult. I have all the column expressions defined for the transformed row.
Here's my sequence of operations:
readState-> Perform complex DataFrame operations using this state on input row -> Perform more complex DataFrame operations to define new values for state.
Is it possible to fetch the SparkPlan/logicalPlan corresponding to a DataFrame query/operation and apply it on a single spark-sql Row ? I would very much appreciate any leads here. Please let me know if the question is not clear or some more details are required.
I've found a not-so-efficient solution to the given problem. With the known DataFrame operations we have, we can create an empty DataFrame with an already known schema. This DataFrame can give us the SparkPlan through
DataFrame.queryExecution.sparkPlan
This object is serializable and can be passed over to stateUpdateFunction. In the stateUpdateFunction, we can iterate over expressions contained in the passed SparkPlan, transforming it to replace unresolved attributes with corresponding literals:
sparkPlan.expressions.map(expr=>{
expr.transform{
case attr: AttributeReference =>
println(s"Resolving ${attr.name} to ${map.getOrElse(attr.name, null)}, type: ${map.getOrElse(attr.name, null).getClass().toString()}")
Literal(map.getOrElse(attr.name, null))
case a => a
}
})
The map here refers to Row's column-value pairs. On these transformed expressions we call eval passing it empty InternalRow. This gives us results corresponding to every expression. Because this method involves interpreted evaluation and doesn't employ code generation, it will be inefficient to use this in a real world use-case. But I'll dig further to find out how code generation can be leveraged here.

Batching within an Apache Spark RDD map

I have a situation where an underlying function operates significantly more efficiently when given batches to work on. I have existing code like this:
// subjects: RDD[Subject]
val subjects = Subject.load(job, sparkContext, config)
val classifications = subjects.flatMap(subject => classify(subject)).reduceByKey(_ + _)
classifications.saveAsTextFile(config.output)
The classify method works on single elements but would be more efficient operating on groups of elements. I considered using coalesce to split the RDD into chunks and acting on each chunk as a group, however there are two problems with this:
I'm not sure how to return the mapped RDD.
classify doesn't know in advance how big the groups should be and it varies based on the contents of the input.
Sample code on how classify could be called in an ideal situation (the output is kludgey since it can't spill for very large inputs):
def classifyRdd (subjects: RDD[Subject]): RDD[(String, Long)] = {
val classifier = new Classifier
subjects.foreach(subject => classifier.classifyInBatches(subject))
classifier.classifyRemaining
classifier.results
}
This way classifyInBatches can have code like this internally:
def classifyInBatches(subject: Subject) {
if (!internals.canAdd(subject)) {
partialResults.add(internals.processExisting)
}
internals.add(subject) // Assumption: at least one will fit.
}
What can I do in Apache Spark that will allow behavior somewhat like this?
Try using the mapPartitions method, which allows your map function to consume a partition as an iterator and produce an iterator of output.
You should be able to write something like this:
subjectsRDD.mapPartitions { subjects =>
val classifier = new Classifier
subjects.foreach(subject => classifier.classifyInBatches(subject))
classifier.classifyRemaining
classifier.results
}