I want to calculate the time difference between in and out for each id. The data is in the format:
String,Long,String,List[String]
======================================
in, time0, door1, [id1, id2, id3, id4]
out, time1, door1, [id1, id2, id3]
out, time2, door1, [id4, id5]
In the end it should end up with key-value pairs like:
{(id1, #time1-time0), (id2, #time1-time0), (id3, #time1-time0), (id4, #time2-time0), (id5, N/A)}
What would be a good approach for solving this problem?
EDIT: I have tried the following.
case class Data(direction: String, time:Long, door:String, ids: List[String])
val data = sc.parallelize(Seq(Data("in", 5, "d1", List("id1", "id2", "id3", "id4")),Data("out", 20, "d1", List("id1", "id2", "id3")), Data("out",50, "d1", List("id4", "id5"))))
data.flatMap(x => (x.ids, x))
scala> case class Data( direction: String, time: Long, door: String, ids: List[ String ] )
defined class Data
scala> val data = sc.parallelize( Seq( Data( "in", 5, "d1", List( "id1", "id2", "id3", "id4" ) ), Data( "out", 20, "d1", List( "id1", "id2", "id3" ) ), Data( "out",50, "d1", List( "id4", "id5" ) ) ) )
data: org.apache.spark.rdd.RDD[Data] = ParallelCollectionRDD[0] at parallelize at <console>:14
// Get an RDD entry for each ( id, data ) pair
scala> data.flatMap( x => x.ids.map( id => ( id, x ) ) )
res0: org.apache.spark.rdd.RDD[(String, Data)] = FlatMappedRDD[1] at flatMap at <console>:17
// group by id to get data's with same id's
scala> res0.groupBy( { case ( id, data ) => id } )
res1: org.apache.spark.rdd.RDD[(String, Iterable[(String, Data)])] = ShuffledRDD[3] at groupBy at <console>:19
// convert Iterable[(String, Data)] to List[Data]
scala> res1.map( { case ( id, iter ) => ( id, iter.toList.map( { case ( i, d ) => d } ) ) } )
res2: org.apache.spark.rdd.RDD[(String, List[Data])] = MappedRDD[4] at map at <console>:21
// sort list of data's by data.time
res2.map( { case ( id, list ) => ( id, list.sortBy( d => d.time ) ) } )
res3: org.apache.spark.rdd.RDD[(String, List[Data])] = MappedRDD[5] at map at <console>:23
// get the time diff by doing lastData.time - firstData.time for each id
scala> :paste
// Entering paste mode (ctrl-D to finish)
res3.map( { case ( id, list ) => {
list match {
case d :: Nil => ( id, None )
case d :: tail => ( id, Some( list.last.time - d.time ) )
case _ => ( id, None )
}
} } )
// Exiting paste mode, now interpreting.
res6: org.apache.spark.rdd.RDD[(String, Option[Long])] = MappedRDD[7] at map at <console>:25
Now, res6 has your desired data.
Also... I was not sure how do you wanted to use direction so I did not use it, modify some of the code to get what you want ( I think just the last res3 thing need to change a little ) or you can explain it here, and may be I will give you the answer. If you have any other doubts... ask away.
It can also be achieved in a more concise way... but that will be hard to understand. That's why I have provided a verbose and simpler code.
Related
I have an rdd say sample_rdd of type RDD[(String, String, Int))] with 3 columns id,item,count. sample data:
id1|item1|1
id1|item2|3
id1|item3|4
id2|item1|3
id2|item4|2
I want to join each id against a lookup_rdd this:
item1|0
item2|0
item3|0
item4|0
item5|0
The output should give me following for id1, outerjoin with lookuptable:
item1|1
item2|3
item3|4
item4|0
item5|0
Similarly for id2 i should get:
item1|3
item2|0
item3|0
item4|2
item5|0
Finally output for each id should have all counts with id:
id1,1,3,4,0,0
id2,3,0,0,2,0
IMPORTANT:this output should be always ordered according to the order in lookup
This is what i have tried:
val line = rdd_sample.map { case (id, item, count) => (id, (item,count)) }.map(row=>(row._1,row._2)).groupByKey()
get(line).map(l=>(l._1,l._2)).mapValues(item_count=>lookup_rdd.leftOuterJoin(item_count))
def get (line: RDD[(String, Iterable[(String, Int)])]) = { for{ (id, item_cnt) <- line i = item_cnt.map(tuple => (tuple._1,tuple._2)) } yield (id,i)
Try below. Run each step on your local console to understand whats happening in detail.
The idea is to zipwithindex and form seq based on lookup_rdd.
(i1,0),(i2,1)..(i5,4) and (id1,0),(id2,1)
Index of final result wanted = [delta(length of lookup_rdd seq) * index of id1..id2 ] + index of i1...i5
So the base seq generated will be (0,(i1,id1)),(1,(i2,id1))...(8,(i4,id2)),(9,(i5,id2))
and then based on the key(i1,id1) reduce and calculate count.
val res2 = sc.parallelize(arr) //sample_rdd
val res3 = sc.parallelize(cart) //lookup_rdd
val delta = res3.count
val res83 = res3.map(_._1).zipWithIndex.cartesian(res2.map(_._1).distinct.zipWithIndex).map(x => (((x._1._1,x._2._1),((delta * x._2._2) + x._1._2, 0)))
val res86 = res2.map(x => ((x._2,x._1),x._3)).reduceByKey(_+_)
val res88 = res83.leftOuterJoin(res86)
val res91 = res88.map( x => {
x._2._2 match {
case Some(x1) => (x._2._1._1, (x._1,x._2._1._2+x1))
case None => (x._2._1._1, (x._1,x._2._1._2))
}
})
val res97 = res91.sortByKey(true).map( x => {
(x._2._1._2,List(x._2._2))}).reduceByKey(_++_)
res97.collect
// SOLUTION: Array((id1,List(1,3,4,0,0)),(id2,List(3,0,0,2,0)))
Let say I have a dataframe created from a text file using case class schema. Below is the data stored in dataframe.
id - Type- qt - P
1, X, 10, 100.0
2, Y, 20, 200.0
1, Y, 15, 150.0
1, X, 5, 120.0
I need to filter dataframe by "id" and Type. And for every "id" iterate through the dataframe for some calculation.
I tried this way but it did not work. Code snapshot.
case class MyClass(id: Int, type: String, qt: Long, PRICE: Double)
val df = sc.textFile("xyz.txt")
.map(_.split(","))
.map(p => MyClass(p(0).trim.toInt, p(1), p(2).trim.toLong, p(3).trim.toDouble)
.toDF().cache()
val productList: List[Int] = df.map{row => row.getInt(0)}.distinct.collect.toList
val xList: List[RDD[MyClass]] = productList.map {
productId => df.filter({ item: MyClass => (item.id== productId) && (item.type == "X" })}.toList
val yList: List[RDD[MyClass]] = productList.map {
productId => df.filter({ item: MyClass => (item.id== productId) && (item.type == "Y" })}.toList
Taking the distinct idea from your example, simply iterate over all the IDs and filter the DataFrame according to the current ID. After this you have a DataFrame with only the relevant data:
val df3 = sc.textFile("src/main/resources/importantStuff.txt") //Your data here
.map(_.split(","))
.map(p => MyClass(p(0).trim.toInt, p(1), p(2).trim.toLong, p(3).trim.toDouble)).toDF().cache()
val productList: List[Int] = df3.map{row => row.getInt(0)}.distinct.collect.toList
println(productList)
productList.foreach(id => {
val sqlDF = df3.filter(df3("id") === id)
sqlDF.show()
})
sqlDF in the loop is the DF with the relevant data, later you can run your calculations on it.
Let us consider I hava a collection of eployees as List of Tuples, where t._1 represents department Id, t._2 is salary and t._3 is Name of employee
val eployees = List((1, 8000, "Sally"),(1, 9999, "Tom"), (2, 5000, "Pam"), (4, 500, "NK"), (4, 999, "Robert"))
Expected Result: -((2,5000,Kumar), (4,999,Robert), (1,9999,Ashok))
I am trying with but getting error,
val maxSal1 = emps.map(t => (t._1, (t._2, t._3))).groupBy(a => a._1).map(k => {
k._2.foldLeft(0, "dummy")((aa, bb) => {
if (aa._1 > bb._1) aa else bb
})
})
Don't overcomplicate things, avoid doing unnecessary operations, and carrying redundant information around. Just be explicit, and spell out the transformations you need at each step. Simplicity is your friend.
employees.groupBy(_._1).values.map(_.maxBy(_._2))
scala> List((1, 8000, "Sally"),(1, 9999, "Tom"), (2, 5000, "Pam"), (4, 500, "NK"), (4, 999, "Robert")).groupBy {
| case (dept, salary, employee) => dept
| }
res6: scala.collection.immutable.Map[Int,List[(Int, Int, String)]] = Map(2 -> List((2,5000,Pam)), 4 -> List((4,500,NK), (4,999,Robert)), 1 -> List((1,8000,Sally), (1,9999,Tom)))
scala> res6.map {
| case (dept, employees) => employees.maxBy(_._2)
| }
res5: scala.collection.immutable.Iterable[(Int, Int, String)] = List((2,5000,Pam), (4,999,Robert), (1,9999,Tom))
But note that maxBy is a partial function:
scala> List[Int]().maxBy(x => x)
java.lang.UnsupportedOperationException: empty.maxBy
As a side note, I'd use case class Employee with 3 fields rather than a tuple. I believe it's more readable.
I tried with this option and seems to give result,
val maxsal1 = emps1.map(t => (t._1, t._2, t._3)).groupBy(_._1).values.map(t => t.foldLeft((0, 1, "dummy"))((aa, bb) => {
if (aa._2 > bb._2) aa else bb
}))
Output: List((2,5000,Pam), (4,999,Robert), (1,9999,Tom))
Suppose I have a List[Map[String, String]] that represents a table in a database, and a List[String] that represents a list of column names. I'd like to implement the equivalent of a group by clause in SQL query:
def fun(table:List[Map[String, String]], keys:List[String]): List[List[Map[String, String]]
For example:
val table = List(
Map("name"->"jade", "job"->"driver", "sex"->"male"),
Map("name"->"mike", "job"->"police", "sex"->"female"),
Map("name"->"jane", "job"->"clerk", "sex"->"female"),
Map("name"->"smith", "job"->"driver", "sex"->"male")
)
val keys = List("job", "sex")
And then fun(table,keys) should be:
List(
List(
Map("name"->"jade", "job"->"driver", "sex"->"male"),
Map("name"->"smith", "job"->"driver", "sex"->"male")
),
List(Map("name"->"mike", "job"->"police", "sex"->"female")),
List(Map("name"->"jane", "job"->"clerk", "sex"->"female"))
)
You're looking for groupBy:
table.groupBy(row => keys.map(key => row(key))) map {
case (group, values) => values
}
Or more concisely:
table.groupBy(keys.map(_)).map(_._2)
I'm trying to join two datasets based on two columns. It works until I use one column but fails with below error
:29: error: value join is not a member of org.apache.spark.rdd.RDD[(String, String, (String, String, String, String, Double))]
val finalFact = fact.join(dimensionWithSK).map { case(nk1,nk2, ((parts1,parts2,parts3,parts4,amount), (sk, prop1,prop2,prop3,prop4))) => (sk,amount) }
Code :
import org.apache.spark.rdd.RDD
def zipWithIndex[T](rdd: RDD[T]) = {
val partitionSizes = rdd.mapPartitions(p => Iterator(p.length)).collect
val ranges = partitionSizes.foldLeft(List((0, 0))) { case(accList, count) =>
val start = accList.head._2
val end = start + count
(start, end) :: accList
}.reverse.tail.toArray
rdd.mapPartitionsWithIndex( (index, partition) => {
val start = ranges(index)._1
val end = ranges(index)._2
val indexes = Iterator.range(start, end)
partition.zip(indexes)
})
}
val dimension = sc.
textFile("dimension.txt").
map{ line =>
val parts = line.split("\t")
(parts(0),parts(1),parts(2),parts(3),parts(4),parts(5))
}
val dimensionWithSK =
zipWithIndex(dimension).map { case((nk1,nk2,prop3,prop4,prop5,prop6), idx) => (nk1,nk2,(prop3,prop4,prop5,prop6,idx + nextSurrogateKey)) }
val fact = sc.
textFile("fact.txt").
map { line =>
val parts = line.split("\t")
// we need to output (Naturalkey, (FactId, Amount)) in
// order to be able to join with the dimension data.
(parts(0),parts(1), (parts(2),parts(3), parts(4),parts(5),parts(6).toDouble))
}
val finalFact = fact.join(dimensionWithSK).map { case(nk1,nk2, ((parts1,parts2,parts3,parts4,amount), (sk, prop1,prop2,prop3,prop4))) => (sk,amount) }
Request someone's help here..
Thanks
Sridhar
If you look at the signature of join it works on an RDD of pairs:
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))]
You have a triple. I guess your trying to join on the first 2 elements of the tuple, and so you need to map your triple to a pair, where the first element of the pair is a pair containing the first two elements of the triple, e.g. for any Types V1 and V2
val left: RDD[(String, String, V1)] = ??? // some rdd
val right: RDD[(String, String, V2)] = ??? // some rdd
left.map {
case (key1, key2, value) => ((key1, key2), value)
}
.join(
right.map {
case (key1, key2, value) => ((key1, key2), value)
})
This will give you an RDD of the form RDD[(String, String), (V1, V2)]
rdd1 Schema :
field1,field2, field3, fieldX,.....
rdd2 Schema :
field1, field2, field3, fieldY,.....
val joinResult = rdd1.join(rdd2,
Seq("field1", "field2", "field3"), "outer")
joinResult schema :
field1, field2, field3, fieldX, fieldY, ......
val emp = sc.
textFile("emp.txt").
map { line =>
val parts = line.split("\t")
// we need to output (Naturalkey, (FactId, Amount)) in
// order to be able to join with the dimension data.
((parts(0), parts(2)),parts(1))
}
val emp_new = sc.
textFile("emp_new.txt").
map { line =>
val parts = line.split("\t")
// we need to output (Naturalkey, (FactId, Amount)) in
// order to be able to join with the dimension data.
((parts(0), parts(2)),parts(1))
}
val finalemp =
emp_new.join(emp).
map { case((nk1,nk2) ,((parts1), (val1))) => (nk1,parts1,val1) }