Getting the delta time (minimum value - actual value) of an RDD - scala

I have a cartesian RDD which allows me to filter a RDD on a certain time range, but I need to get the minimum value of the RDD so I can calculate the delta time of each record to the entry that occurred first.
I have a case class that is made up like the below:
case class auction(id: String, prodID: String, timestamp: Long)
and I put together two RDDs, one that contains the auction of note, the other contains the auctions that occured in that time period as below:
val specificmessages = allauctions.cartesian(winningauction)
.filter( (x, y) => x.timestamp > y.timestamp - 10 &&
x.timestamp < y.timestamp + 10 &&
x.productID == y.productID )
I would like to, in the specificmessages function, be able to add a field which will contain the delta between each record and the auction timestamp that has the minimum value.

You can use DataFrames like this:
import org.apache.spark.sql.{functions => f}
import org.apache.spark.sql.expressions.Window
// Convert RDDs to DFs
val allDF = allauctions.toDF
val winDF = winningauction.toDF("winId", "winProdId", "winTimestamp")
// Prepare join conditions
val prodCond = $"prodID" === $"winProdID"
val tsCond = f.abs($"timestamp" - $"winTimestamp") < 10
// Create window
val w = Window
.partitionBy($"id", $"prodID", $"timestamp")
val joined = allDF
.join(winDF, prodCond && tsCond)
.select($"*", first($"winTimestamp").over(w).alias("mintimestamp")
Using plain RDDs
// Create PairRDDs
def allPairs = => (a.prodID, a))
def winPairs = => (a.prodID, a))
.join(winPairs) // Join by prodId -> RDD[(prodID, (auction, auction))]
// Filter timestamp
.filter{case (_, (x, y)) => (x.timestamp - y.timestamp).abs < 10} //
.values // Drop key -> RDD[(auction, auction)]
.groupByKey // Group by allAuctions -> RDD[(auction, Seq[auction])]
.flatMap{ case (k, vals) => {
val minTs = // Find min ts from winauction => (k, v, minTs))
}} // -> RDD[(auction, auction, ts)]


Getting the mode from an RDD

I would like to get the mode (the most common number) from an rdd using Spark + Scala.
I can get it doing the following but I think it could be a better way to calculate this. The most important thing is if more than one value has the same number of repetition, I need to return both of them.
Let's see my example code:
val l = List(3,4,4,3,3,7,7,7,9)
val rdd = spark.sparkContext.parallelize(l)
val grouped = (e => (e, 1)).groupBy(_._1).map(e=> (e._1, e._2.size))
val maxRep = grouped.collect().maxBy(_._2)._2
val mode = grouped.filter(e => e._2 == maxRep).map(e => e._1).collect
And the result is right:
Array[Int] = Array(3, 7)
but is there a better way to do this? I mean considering the performance because the original RDD would be much bigger than this.
This should work and be a little bit more efficient.
(only if you are sure the total number of elements is small)
val counted = rdd.countByValue()
val max = counted.valuesIterator.max
val maxElements = count.collect { case (k, v) if (v == max) => k }
If there could be many elements, consider this alternative which is memory safe.
val counted = => (x, 1L)).reduceByKey(_ + _).cache()
val max = counted.values.max
val maxElements = { case (k, v) => (v, k) }.lookup(max)
How about get the max key-value pair from a double groupBy? This works even better for bigger data size.
// res1: (Int, Iterable[(Int, Int)]) = (3,CompactBuffer((3,3), (7,3)))
To get the element
// res4: Iterable[Int] = List(3, 7)
The first groupBy will get element into (element -> count) with type Map[Int, Long], the second groupBy will group (element -> count) by count, like (count -> Iterable((element, count)), then simply max to get the key-value pair with the maximum key value, which is the count.

Spark - calculate max ocurrence per day-event

I have the following RDD[String]:
The first number is supposed to be days and the following characters are events.
I have to calculate the day where each event has the maximum occurrence.
The expected result for this dataset should be:
{ "A" -> Day2 , "B" -> Day3 }
(A has repeated 10 times in day2 and b 10 times in day3)
I am splitting the original dataset
val foo =":")).map(x => (x(0), x(1).split("")) )
What could be the best implementation for count and aggregation?
Any help is appreciated.
This should do the trick:
import org.apache.spark.sql.functions._
val rdd = sqlContext.sparkContext.makeRDD(Seq(
val keys = Seq("A", "B")
val seqOfMaps: RDD[(String, Map[String, Int])] ={str =>
val split = str.split(":")
(s"Day${split.head}", split(1).groupBy(a => a.toString).mapValues(_.length))
}{key => {
key -> seqOfMaps.mapValues(_.get(key).get).sortBy(a => -a._2).first._1
The processing that need to be done consist in transforming the data into a rdd that is easy to apply on functions like: find the maximum for a list
I will try to explain step by step
I've used dummy data for "A" and "B" chars.
The foo rdd is the first step it will give you RDD[(String, Array[String])]
Let's extract each char for the Array[String]
val res3 ={case (d,s)=> (d, s.toList.groupBy(c => c).map{case (x, xs) => (x, xs.size)}.toList)}
(1,List((A,18), (B,6)))
(2,List((A,20), (B,4)))
(3,List((A,14), (B,10)))
Next we will flatMap over values to expand our rdd by char
res3.flatMapValues(list => list)
Rearrange the rdd in order to look better{case (d, (s, c)) => (s, c, d)}
Now we are groupy by char
(A,CompactBuffer((A,18,1), (A,20,2), (A,14,3)))
(B,CompactBuffer((B,6,1), (B,4,2), (B,10,3)))
Finally we are taking the maxium count for each row{case (s, list) => (s, list.maxBy(_._2))}
Hope this help
Previous answers are good, but I prefer such solution:
val data = Seq(
val initialRDD = sparkContext.parallelize(data)
// to tuples like (1,'A',18)
val charCountRDD = initialRDD.flatMap(s => {
val parts = s.split(":")
val charCount = parts(1).groupBy(i => i).mapValues(_.length) => (parts(0), i._1, i._2))
// group by character, and take max value from grouped collection
val result = charCountRDD.groupBy(i => i._2).map(k => k._2.maxBy(z => z._3))
Result is:

RDD/Scala Get one column from RDD

I have an RDD[Log] file with various fields (username,content,date,bytes) and I want to find different things for each field/column.
For example, I want to get the min/max and average bytes found in the RDD. When i do:
val q1 = cleanRdd.filter(x => x.bytes != 0)
I get the full lines of the RDD with bytes != 0. But how can I actually sum them, calculate the avg, find the min/max etc? How can I take only one column from my RDD and apply transformations on it?
EDIT: Prasad told me about changing the type to dataframe, he gave no instructions on how to do so though, and I cant find a solid answer on the site. Any help would be great.
EDIT: LOG class:
case class Log (username: String, date: String, status: Int, content: Int)
using a cleanRdd.take(5).foreach(println) gives something like this
Log( ,01/Jul/1995:00:00:01 -0400,200,6245)
Log( ,01/Jul/1995:00:00:06 -0400,200,3985)
Log( ,01/Jul/1995:00:00:09 -0400,200,4085)
Log( ,01/Jul/1995:00:00:11 -0400,304,0)
Log( ,01/Jul/1995:00:00:11 -0400,200,4179)
Well... you have a lot of questions.
So... you have the following abstraction of a Log
case class Log (username: String, date: String, status: Int, content: Int, byte: Int)
Que - How can I take only one column from my RDD.
Ans - You have a map function with the RDD's. So for an RDD[A], map takes a map/transform function of type A => B to transform it into a RDD[B].
val logRdd: RDD[Log] = ...
val byteRdd = logRdd
.filter(l => l.bytes != 0)
.map(l => l.byte)
Que - how can I actually sum them ?
Ans - You can do it by using reduce / fold / aggregate.
val sum = byteRdd.reduce((acc, b) => acc + b)
val sum = byteRdd.fold(0)((acc, b) => acc + b)
val sum = byteRdd.aggregate(0)(
(acc, b) => acc + b,
(acc1, acc2) => acc1 + acc2
Note :: An important thing to notice here is that a sum of Int can grow bigger than what an Int can handle. So in most real life cases we should use at least a Long as our accumulator instead of an Int, which actually removes reduce and fold as options. And we will be left with an aggregate only.
val sum = byteRdd.aggregate(0l)(
(acc, b) => acc + b,
(acc1, acc2) => acc1 + acc2
Now if you have to calculate multiple things like min, max, avg then I will suggest that you calculate them in a single aggregate instead of multiple like this,
// (count, sum, min, max)
val accInit = (0, 0, Int.MaxValue, Int.MinValue)
val (count, sum, min, max) = byteRdd.aggregate(accInit)(
{ case ((count, sum, min, max), b) =>
(count + 1, sum + b, Math.min(min, b), Math.max(max, b)) },
{ case ((count1, sum1, min1, max1), (count2, sum2, min2, max2)) =>
(count1 + count2, sum1 + sum2, Math.min(min1, min2), Math.max(max1, max2)) }
val avg = sum.toDouble / count
Have a look in DataFrame API. You need to convert your RDD to a DataFrame and then you can use min, max, avg functions like below:
val rdd = cleanRdd.filter(x => x.bytes != 0)
val df = sparkSession.sqlContext.createDataFrame(rdd, classOf[Log])
Assuming you wanted to operations on column bytes then
import org.apache.spark.sql.functions._"bytes")).show"bytes")).show"bytes")).show
Tried with the following in spark-shell. check the screenshots for the outcome...
case class Log (username: String, date: String, status: Int, content: Int)
val inputRDD = sc.parallelize(Seq(Log("","01/Jul/1995:00:00:01 -0400",200,6245), Log("","01/Jul/1995:00:00:06 -0400",200,3985), Log("","01/Jul/1995:00:00:09 -0400",200,4085), Log("","01/Jul/1995:00:00:11 -0400",304,0), Log("","01/Jul/1995:00:00:11 -0400",200,4179)))
val rdd = inputRDD.filter(x => x.content != 0)
val df = rdd.toDF("username", "date", "status", "content")
import org.apache.spark.sql.functions._"content")).show"content")).show"content")).show

Spark: Efficient mass lookup in pair RDD's

In Apache Spark I have two RDD's. The first data : RDD[(K,V)] containing data in key-value form. The second pairs : RDD[(K,K)] contains a set of interesting key-pairs of this data.
How can I efficiently construct an RDD pairsWithData : RDD[((K,K)),(V,V))], such that it contains all the elements from pairs as the key-tuple and their corresponding values (from data) as the value-tuple?
Some properties of the data:
The keys in data are unique
All entries in pairs are unique
For all pairs (k1,k2) in pairs it is guaranteed that k1 <= k2
The size of 'pairs' is only a constant the size of data |pairs| = O(|data|)
Current data sizes (expected to grow): |data| ~ 10^8, |pairs| ~ 10^10
Current attempts
Here is some example code in Scala:
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext._
// This kind of show the idea, but fails at runtime.
def massPairLookup1(keyPairs : RDD[(Int, Int)], data : RDD[(Int, String)]) = {
keyPairs map {case (k1,k2) =>
val v1 : String = data lookup k1 head;
val v2 : String = data lookup k2 head;
((k1, k2), (v1,v2))
// Works but is O(|data|^2)
def massPairLookup2(keyPairs : RDD[(Int, Int)], data : RDD[(Int, String)]) = {
// Construct all possible pairs of values
val cartesianData = data cartesian data map {case((k1,v1),(k2,v2)) => ((k1,k2),(v1,v2))}
// Select only the values who's keys are in keyPairs
keyPairs map {(_,0)} join cartesianData mapValues {_._2}
// Example function that find pairs of keys
// Runs in O(|data|) in real life, but cannot maintain the values
def relevantPairs(data : RDD[(Int, String)]) = {
val keys = data map (_._1)
keys cartesian keys filter {case (x,y) => x*y == 12 && x < y}
// Example run
val data = sc parallelize(1 to 12) map (x => (x, "Number " + x))
val pairs = relevantPairs(data)
val pairsWithData = massPairLookup2(pairs, data)
// Print:
// ((1,12),(Number1,Number12))
// ((2,6),(Number2,Number6))
// ((3,4),(Number3,Number4))
Attempt 1
First I tried just using the lookup function on data, but that throws an runtime error when executed. It seems like self is null in the PairRDDFunctions trait.
In addition I am not sure about the performance of lookup. The documentation says This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to. This sounds like n lookups takes O(n*|partition|) time at best, which I suspect could be optimized.
Attempt 2
This attempt works, but I create |data|^2 pairs which will kill performance. I do not expect Spark to be able to optimize that away.
Your lookup 1 doesn't work because you cannot perform RDD transformations inside workers (inside another transformation).
In the lookup 2, I don't think it's necessary to perform full cartesian...
You can do it like this:
val firstjoin ={case (k1,k2) => (k1, (k1,k2))})
.map({case (_, ((k1, k2), v1)) => ((k1, k2), v1)})
val result ={case ((k1,k2),v1) => (k2, ((k1,k2),v1))})
.map({case(_, (((k1,k2), v1), v2))=>((k1, k2), (v1, v2))})
Or in a more dense form:
val firstjoin = => (x._1, x)).join(data).map(_._2)
val result ={case (x,y) => (x._2, (x,y))})
.join(data).map({case(x, (y, z))=>(y._1, (y._2, z))})
I don't think you can do it more efficiently, but I might be wrong...

Is there a data structure / library to do in memory olap / pivot tables in Java / Scala?

Relevant questions
This question is quite relevant, but is 2 years old: In memory OLAP engine in Java
I would like to create a pivot-table like matrix from a given tabular dataset, in memory
e.g. an age by marital status count (rows are age, columns are marital status).
The input: List of People, with age and some Boolean property (e.g. married),
The desired output: count of People, by age (row) and isMarried (column)
What I've tried (Scala)
case class Person(val age:Int, val isMarried:Boolean)
val people:List[Person] = ... //
val peopleByAge = people.groupBy(_.age) //only by age
val peopleByMaritalStatus = people.groupBy(_.isMarried) //only by marital status
I managed to do it the naive way, first grouping by age, then map which is doing a count by marital status, and outputs the result, then I foldRight to aggregate
TreeMap(peopleByAge.toSeq: _*).map(x => {
val age = x._1
val rows = x._2
val numMarried = rows.count(_.isMarried())
val numNotMarried = rows.length - numMarried
(age, numMarried, numNotMarried)
}).foldRight(List[FinalResult]())(row,list) => {
val cumMarried = row._2+
(if (list.isEmpty) 0 else list.last.cumMarried)
val cumNotMarried = row._3 +
(if (list.isEmpty) 0 else l.last.cumNotMarried)
list :+ new FinalResult(row._1, row._2, row._3, cumMarried,cumNotMarried)
I don't like the above code, it's not efficient, hard to read, and I'm sure there is a better way.
The question(s)
How do I groupBy "both"? and how do I do a count for each subgroup, e.g.
How many people are exactly 30 years old and married?
Another question, is how do I do a running total, to answer the question:
How many people above 30 are married?
Thank you for all the great answers.
just to clarify, I would like the output to include a "table" with the following columns
Age (ascending)
Num Married
Num Not Married
Running Total Married
Running Total Not Married
Not only answering those specific queries, but to produce a report that will allow answering all such type of questions.
Here is an option that is a little more verbose, but does this in a generic fashion instead of using strict data types. You could of course use generics to make this nicer, but i think you get the idea.
/** Creates a new pivot structure by finding correlated values
* and performing an operation on these values
* #param accuOp the accumulator function (e.g. sum, max, etc)
* #param xCol the "x" axis column
* #param yCol the "y" axis column
* #param accuCol the column to collect and perform accuOp on
* #return a new Pivot instance that has been transformed with the accuOp function
def doPivot(accuOp: List[String] => String)(xCol: String, yCol: String, accuCol: String) = {
// create list of indexes that correlate to x, y, accuCol
val colsIdx = List(xCol, yCol, accuCol).map(headers.getOrElse(_, 1))
// group by x and y, sending the resulting collection of
// accumulated values to the accuOp function for post-processing
val data = body.groupBy(row => {
(row(colsIdx(0)), row(colsIdx(1)))
}).map(g => {
(g._1, accuOp(
// get distinct axis values
val xAxis = => {g._1._1}).toList.distinct
val yAxis = => {g._1._2}).toList.distinct
// create result matrix
val newRows = => { => {
data.getOrElse((x,y), "")
// collect it with axis labels for results
Pivot(List((yCol + "/" + xCol) +: xAxis) :::> {x._2 +: x._1}))
my Pivot type is pretty basic:
class Pivot(val rows: List[List[String]]) {
val headers = rows.head.zipWithIndex.toMap
val body = rows.tail
And to test it, you could do something like this:
val marriedP = Pivot(
List("Name", "Age", "Married"),
List("Bill", "42", "TRUE"),
List("Heloise", "47", "TRUE"),
List("Thelma", "34", "FALSE"),
List("Bridget", "47", "TRUE"),
List("Robert", "42", "FALSE"),
List("Eddie", "42", "TRUE")
def accum(values: List[String]) = { => {1}).sum.toString
println(marriedP + "\n")
println(marriedP.doPivot(accum)("Age", "Married", "Married"))
Which yields:
Name Age Married
Bill 42 TRUE
Heloise 47 TRUE
Thelma 34 FALSE
Bridget 47 TRUE
Robert 42 FALSE
Eddie 42 TRUE
Married/Age 47 42 34
TRUE 2 2
The nice thing is that you can use currying to pass in any function for the values like you would in a traditional excel pivot table.
More can be found here:
You can
val groups = people.groupBy(p => (p.age, p.isMarried))
and then
val thirty_and_married = groups((30, true))._2
val over_thirty_and_married_count =
groups.filterKeys(k => k._1 > 30 && k._2).map(_._2.length).sum
I think it would be better to use the count method on Lists directly
For question 1
people.count { p => p.age == 30 && p.isMarried }
For question 2
people.count { p => p.age > 30 && p.isMarried }
If you also want to actual groups of people who conform to those predicates use filter.
people.filter { p => p.age > 30 && p.isMarried }
You could probably optimise these by doing the traversal only once but is that a requirement?
You can group using a tuple:
val res1 = people.groupBy(p => (p.age, p.isMarried)) //or
val res2 = people.groupBy(p => (p.age, p.isMarried)).mapValues(_.size) //if you dont care about People instances
You can answer both question like that:
res2.getOrElse((30, true), 0)
res2.filter{case (k, _) => k._1 > 30 && k._2}.values.sum
res2.filterKeys(k => k._1 > 30 && k._2).values.sum // nicer with filterKeys from Rex Kerr's answer
You could answer both questions with a method count on List:
people.count(p => p.age == 30 && p.isMarried)
people.count(p => p.age > 30 && p.isMarried)
Or using filter and size:
people.filter(p => p.age == 30 && p.isMarried).size
people.filter(p => p.age > 30 && p.isMarried).size
slightly cleaner version of your code:
TreeMap(peopleByAge.toSeq: _*).map {case (age, ps) =>
val (married, notMarried) = ps.span(_.isMarried)
(age, married.size, notMarried.size)
}.foldLeft(List[FinalResult]()) { case (acc, (age, married, notMarried)) =>
def prevValue(f: (FinalResult) => Int) =
new FinalResult(age, married, notMarried, prevValue(_.cumMarried) + married, prevValue(_.cumNotMarried) + notMarried) :: acc