Unpack list, pair up and calculate difference - scala

I want to calculate the time difference between in and out for each id. The data is in the format:
String,Long,String,List[String]
======================================
in, time0, door1, [id1, id2, id3, id4]
out, time1, door1, [id1, id2, id3]
out, time2, door1, [id4, id5]
In the end it should end up with key-value pairs like:
{(id1, #time1-time0), (id2, #time1-time0), (id3, #time1-time0), (id4, #time2-time0), (id5, N/A)}
What would be a good approach for solving this problem?
EDIT: I have tried the following.
case class Data(direction: String, time:Long, door:String, ids: List[String])
val data = sc.parallelize(Seq(Data("in", 5, "d1", List("id1", "id2", "id3", "id4")),Data("out", 20, "d1", List("id1", "id2", "id3")), Data("out",50, "d1", List("id4", "id5"))))
data.flatMap(x => (x.ids, x))

scala> case class Data( direction: String, time: Long, door: String, ids: List[ String ] )
defined class Data
scala> val data = sc.parallelize( Seq( Data( "in", 5, "d1", List( "id1", "id2", "id3", "id4" ) ), Data( "out", 20, "d1", List( "id1", "id2", "id3" ) ), Data( "out",50, "d1", List( "id4", "id5" ) ) ) )
data: org.apache.spark.rdd.RDD[Data] = ParallelCollectionRDD[0] at parallelize at <console>:14
// Get an RDD entry for each ( id, data ) pair
scala> data.flatMap( x => x.ids.map( id => ( id, x ) ) )
res0: org.apache.spark.rdd.RDD[(String, Data)] = FlatMappedRDD[1] at flatMap at <console>:17
// group by id to get data's with same id's
scala> res0.groupBy( { case ( id, data ) => id } )
res1: org.apache.spark.rdd.RDD[(String, Iterable[(String, Data)])] = ShuffledRDD[3] at groupBy at <console>:19
// convert Iterable[(String, Data)] to List[Data]
scala> res1.map( { case ( id, iter ) => ( id, iter.toList.map( { case ( i, d ) => d } ) ) } )
res2: org.apache.spark.rdd.RDD[(String, List[Data])] = MappedRDD[4] at map at <console>:21
// sort list of data's by data.time
res2.map( { case ( id, list ) => ( id, list.sortBy( d => d.time ) ) } )
res3: org.apache.spark.rdd.RDD[(String, List[Data])] = MappedRDD[5] at map at <console>:23
// get the time diff by doing lastData.time - firstData.time for each id
scala> :paste
// Entering paste mode (ctrl-D to finish)
res3.map( { case ( id, list ) => {
list match {
case d :: Nil => ( id, None )
case d :: tail => ( id, Some( list.last.time - d.time ) )
case _ => ( id, None )
}
} } )
// Exiting paste mode, now interpreting.
res6: org.apache.spark.rdd.RDD[(String, Option[Long])] = MappedRDD[7] at map at <console>:25
Now, res6 has your desired data.
Also... I was not sure how do you wanted to use direction so I did not use it, modify some of the code to get what you want ( I think just the last res3 thing need to change a little ) or you can explain it here, and may be I will give you the answer. If you have any other doubts... ask away.
It can also be achieved in a more concise way... but that will be hard to understand. That's why I have provided a verbose and simpler code.

Related

spark: join rdd based on sequence of another rdd

I have an rdd say sample_rdd of type RDD[(String, String, Int))] with 3 columns id,item,count. sample data:
id1|item1|1
id1|item2|3
id1|item3|4
id2|item1|3
id2|item4|2
I want to join each id against a lookup_rdd this:
item1|0
item2|0
item3|0
item4|0
item5|0
The output should give me following for id1, outerjoin with lookuptable:
item1|1
item2|3
item3|4
item4|0
item5|0
Similarly for id2 i should get:
item1|3
item2|0
item3|0
item4|2
item5|0
Finally output for each id should have all counts with id:
id1,1,3,4,0,0
id2,3,0,0,2,0
IMPORTANT:this output should be always ordered according to the order in lookup
This is what i have tried:
val line = rdd_sample.map { case (id, item, count) => (id, (item,count)) }.map(row=>(row._1,row._2)).groupByKey()
get(line).map(l=>(l._1,l._2)).mapValues(item_count=>lookup_r‌​dd.leftOuterJoin(ite‌​m_count))
def get (line: RDD[(String, Iterable[(String, Int)])]) = { for{ (id, item_cnt) <- line i = item_cnt.map(tuple => (tuple._1,tuple._2)) } yield (id,i)
Try below. Run each step on your local console to understand whats happening in detail.
The idea is to zipwithindex and form seq based on lookup_rdd.
(i1,0),(i2,1)..(i5,4) and (id1,0),(id2,1)
Index of final result wanted = [delta(length of lookup_rdd seq) * index of id1..id2 ] + index of i1...i5
So the base seq generated will be (0,(i1,id1)),(1,(i2,id1))...(8,(i4,id2)),(9,(i5,id2))
and then based on the key(i1,id1) reduce and calculate count.
val res2 = sc.parallelize(arr) //sample_rdd
val res3 = sc.parallelize(cart) //lookup_rdd
val delta = res3.count
val res83 = res3.map(_._1).zipWithIndex.cartesian(res2.map(_._1).distinct.zipWithIndex).map(x => (((x._1._1,x._2._1),((delta * x._2._2) + x._1._2, 0)))
val res86 = res2.map(x => ((x._2,x._1),x._3)).reduceByKey(_+_)
val res88 = res83.leftOuterJoin(res86)
val res91 = res88.map( x => {
x._2._2 match {
case Some(x1) => (x._2._1._1, (x._1,x._2._1._2+x1))
case None => (x._2._1._1, (x._1,x._2._1._2))
}
})
val res97 = res91.sortByKey(true).map( x => {
(x._2._1._2,List(x._2._2))}).reduceByKey(_++_)
res97.collect
// SOLUTION: Array((id1,List(1,3,4,0,0)),(id2,List(3,0,0,2,0)))

filtering dataframe in scala

Let say I have a dataframe created from a text file using case class schema. Below is the data stored in dataframe.
id - Type- qt - P
1, X, 10, 100.0
2, Y, 20, 200.0
1, Y, 15, 150.0
1, X, 5, 120.0
I need to filter dataframe by "id" and Type. And for every "id" iterate through the dataframe for some calculation.
I tried this way but it did not work. Code snapshot.
case class MyClass(id: Int, type: String, qt: Long, PRICE: Double)
val df = sc.textFile("xyz.txt")
.map(_.split(","))
.map(p => MyClass(p(0).trim.toInt, p(1), p(2).trim.toLong, p(3).trim.toDouble)
.toDF().cache()
val productList: List[Int] = df.map{row => row.getInt(0)}.distinct.collect.toList
val xList: List[RDD[MyClass]] = productList.map {
productId => df.filter({ item: MyClass => (item.id== productId) && (item.type == "X" })}.toList
val yList: List[RDD[MyClass]] = productList.map {
productId => df.filter({ item: MyClass => (item.id== productId) && (item.type == "Y" })}.toList
Taking the distinct idea from your example, simply iterate over all the IDs and filter the DataFrame according to the current ID. After this you have a DataFrame with only the relevant data:
val df3 = sc.textFile("src/main/resources/importantStuff.txt") //Your data here
.map(_.split(","))
.map(p => MyClass(p(0).trim.toInt, p(1), p(2).trim.toLong, p(3).trim.toDouble)).toDF().cache()
val productList: List[Int] = df3.map{row => row.getInt(0)}.distinct.collect.toList
println(productList)
productList.foreach(id => {
val sqlDF = df3.filter(df3("id") === id)
sqlDF.show()
})
sqlDF in the loop is the DF with the relevant data, later you can run your calculations on it.

How to Find Out Per key Maximum in scala collection

Let us consider I hava a collection of eployees as List of Tuples, where t._1 represents department Id, t._2 is salary and t._3 is Name of employee
val eployees = List((1, 8000, "Sally"),(1, 9999, "Tom"), (2, 5000, "Pam"), (4, 500, "NK"), (4, 999, "Robert"))
Expected Result: -((2,5000,Kumar), (4,999,Robert), (1,9999,Ashok))
I am trying with but getting error,
val maxSal1 = emps.map(t => (t._1, (t._2, t._3))).groupBy(a => a._1).map(k => {
k._2.foldLeft(0, "dummy")((aa, bb) => {
if (aa._1 > bb._1) aa else bb
})
})
Don't overcomplicate things, avoid doing unnecessary operations, and carrying redundant information around. Just be explicit, and spell out the transformations you need at each step. Simplicity is your friend.
employees.groupBy(_._1).values.map(_.maxBy(_._2))
scala> List((1, 8000, "Sally"),(1, 9999, "Tom"), (2, 5000, "Pam"), (4, 500, "NK"), (4, 999, "Robert")).groupBy {
| case (dept, salary, employee) => dept
| }
res6: scala.collection.immutable.Map[Int,List[(Int, Int, String)]] = Map(2 -> List((2,5000,Pam)), 4 -> List((4,500,NK), (4,999,Robert)), 1 -> List((1,8000,Sally), (1,9999,Tom)))
scala> res6.map {
| case (dept, employees) => employees.maxBy(_._2)
| }
res5: scala.collection.immutable.Iterable[(Int, Int, String)] = List((2,5000,Pam), (4,999,Robert), (1,9999,Tom))
But note that maxBy is a partial function:
scala> List[Int]().maxBy(x => x)
java.lang.UnsupportedOperationException: empty.maxBy
As a side note, I'd use case class Employee with 3 fields rather than a tuple. I believe it's more readable.
I tried with this option and seems to give result,
val maxsal1 = emps1.map(t => (t._1, t._2, t._3)).groupBy(_._1).values.map(t => t.foldLeft((0, 1, "dummy"))((aa, bb) => {
if (aa._2 > bb._2) aa else bb
}))
Output: List((2,5000,Pam), (4,999,Robert), (1,9999,Tom))

How to implement a dynamic group by in scala?

Suppose I have a List[Map[String, String]] that represents a table in a database, and a List[String] that represents a list of column names. I'd like to implement the equivalent of a group by clause in SQL query:
def fun(table:List[Map[String, String]], keys:List[String]): List[List[Map[String, String]]
For example:
val table = List(
Map("name"->"jade", "job"->"driver", "sex"->"male"),
Map("name"->"mike", "job"->"police", "sex"->"female"),
Map("name"->"jane", "job"->"clerk", "sex"->"female"),
Map("name"->"smith", "job"->"driver", "sex"->"male")
)
val keys = List("job", "sex")
And then fun(table,keys) should be:
List(
List(
Map("name"->"jade", "job"->"driver", "sex"->"male"),
Map("name"->"smith", "job"->"driver", "sex"->"male")
),
List(Map("name"->"mike", "job"->"police", "sex"->"female")),
List(Map("name"->"jane", "job"->"clerk", "sex"->"female"))
)
You're looking for groupBy:
table.groupBy(row => keys.map(key => row(key))) map {
case (group, values) => values
}
Or more concisely:
table.groupBy(keys.map(_)).map(_._2)

spark join operation based on two columns

I'm trying to join two datasets based on two columns. It works until I use one column but fails with below error
:29: error: value join is not a member of org.apache.spark.rdd.RDD[(String, String, (String, String, String, String, Double))]
val finalFact = fact.join(dimensionWithSK).map { case(nk1,nk2, ((parts1,parts2,parts3,parts4,amount), (sk, prop1,prop2,prop3,prop4))) => (sk,amount) }
Code :
import org.apache.spark.rdd.RDD
def zipWithIndex[T](rdd: RDD[T]) = {
val partitionSizes = rdd.mapPartitions(p => Iterator(p.length)).collect
val ranges = partitionSizes.foldLeft(List((0, 0))) { case(accList, count) =>
val start = accList.head._2
val end = start + count
(start, end) :: accList
}.reverse.tail.toArray
rdd.mapPartitionsWithIndex( (index, partition) => {
val start = ranges(index)._1
val end = ranges(index)._2
val indexes = Iterator.range(start, end)
partition.zip(indexes)
})
}
val dimension = sc.
textFile("dimension.txt").
map{ line =>
val parts = line.split("\t")
(parts(0),parts(1),parts(2),parts(3),parts(4),parts(5))
}
val dimensionWithSK =
zipWithIndex(dimension).map { case((nk1,nk2,prop3,prop4,prop5,prop6), idx) => (nk1,nk2,(prop3,prop4,prop5,prop6,idx + nextSurrogateKey)) }
val fact = sc.
textFile("fact.txt").
map { line =>
val parts = line.split("\t")
// we need to output (Naturalkey, (FactId, Amount)) in
// order to be able to join with the dimension data.
(parts(0),parts(1), (parts(2),parts(3), parts(4),parts(5),parts(6).toDouble))
}
val finalFact = fact.join(dimensionWithSK).map { case(nk1,nk2, ((parts1,parts2,parts3,parts4,amount), (sk, prop1,prop2,prop3,prop4))) => (sk,amount) }
Request someone's help here..
Thanks
Sridhar
If you look at the signature of join it works on an RDD of pairs:
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))]
You have a triple. I guess your trying to join on the first 2 elements of the tuple, and so you need to map your triple to a pair, where the first element of the pair is a pair containing the first two elements of the triple, e.g. for any Types V1 and V2
val left: RDD[(String, String, V1)] = ??? // some rdd
val right: RDD[(String, String, V2)] = ??? // some rdd
left.map {
case (key1, key2, value) => ((key1, key2), value)
}
.join(
right.map {
case (key1, key2, value) => ((key1, key2), value)
})
This will give you an RDD of the form RDD[(String, String), (V1, V2)]
rdd1 Schema :
field1,field2, field3, fieldX,.....
rdd2 Schema :
field1, field2, field3, fieldY,.....
val joinResult = rdd1.join(rdd2,
Seq("field1", "field2", "field3"), "outer")
joinResult schema :
field1, field2, field3, fieldX, fieldY, ......
val emp = sc.
textFile("emp.txt").
map { line =>
val parts = line.split("\t")
// we need to output (Naturalkey, (FactId, Amount)) in
// order to be able to join with the dimension data.
((parts(0), parts(2)),parts(1))
}
val emp_new = sc.
textFile("emp_new.txt").
map { line =>
val parts = line.split("\t")
// we need to output (Naturalkey, (FactId, Amount)) in
// order to be able to join with the dimension data.
((parts(0), parts(2)),parts(1))
}
val finalemp =
emp_new.join(emp).
map { case((nk1,nk2) ,((parts1), (val1))) => (nk1,parts1,val1) }