Calculation on consecutive array elements - scala

I have this:
val myInput:ArrayBuffer[(String,String)] = ArrayBuffer(
(a,timestampAStr),
(b,timestampBStr),
...
)
I would like to calculate the duration between each two consecutive timestamps from myInput and retrieve those like the following:
val myOutput = ArrayBuffer(
(a,durationFromTimestampAToTimestampB),
(b,durationFromTimestampBToTimestampC),
...
)
This is a paired evaluation, which led me to think something with foldLeft() might do the trick, but after giving this a little more thought, I could not come up with a solution.
I have already put something together with some for loops and .indices, but this does not seem as clean and concise as it could be. I would appreciate if somebody had a better option.

You can use zip and sliding to achieve what you want. For example, if you have a collection
scala> List(2,3,5,7,11)
res8: List[Int] = List(2, 3, 5, 7, 11)
The list of differences is res8.sliding(2).map{case List(fst,snd)=>snd-fst}.toList, which you can zip with the original list.
scala> res8.zip(res8.sliding(2).map{case List(fst,snd)=>snd-fst}.toList)
res13: List[(Int, Int)] = List((2,1), (3,2), (5,2), (7,4))

You can zip your array with itself, after dropping the first item - to match each item with the consecutive one - and then map to the calculated result:
val myInput:ArrayBuffer[(String,String)] = ArrayBuffer(
("a","1000"),
("b","1500"),
("c","2500")
)
val result: ArrayBuffer[(String, Int)] = myInput.zip(myInput.drop(1)).map {
case ((k1, v1), (k2, v2)) => (k1, v2.toInt - v1.toInt)
}
result.foreach(println)
// prints:
// (a,500)
// (b,1000)

Related

How the get the index of the duplicate pair in the scala list

I have a scala list like this below:
slist = List("a","b","c","a","d","c","a")
I want to get the index of the same element pair in this list.
For example,the result of this slist is
(0,3),(0,6),(3,6),(2,5)
which (0,3) means the slist(0)==slist(3)
(0,6) means the slist(0)==slist(6)
and so on.
So is there any method to do this in scala?
Thanks very much
There's simpler approaches but starting with zipWithIndex leads down this path. zipWithIndex returns a Tuple2 with the index and one of the letters. From there we groupBy the letter to get a map of the letter to it's indices and filter the ones with more than one value. Lastly, we have this MapLike.DefaultValuesIterable(List((a,0), (a,3), (a,6)), List((c,2), (c,5)))
which we take the indices from and make combinations.
scala> slist.zipWithIndex.groupBy(zipped => zipped._1).filter(t => t._2.size > 1).values.flatMap(xs => xs.map(t => t._2).combinations(2))
res40: Iterable[List[Int]] = List(List(0, 3), List(0, 6), List(3, 6), List(2, 5))
Indexing a List is rather inefficient so I recommend a transition to Vector and then back again (if needed).
val svec = slist.toVector
svec.indices
.map(x => (x,svec.indexOf(svec(x),x+1)))
.filter(_._2 > 0)
.toList
//res0: List[(Int, Int)] = List((0,3), (2,5), (3,6))
val v = slist.toVector; val s = v.size
for(i<-0 to s-1;j<-0 to s-1;if(i<j && v(i)==v(j))) yield (i,j)
In Scala REPL:
scala> for(i<-0 to s-1;j<-0 to s-1;if(i<j && v(i)==v(j))) yield (i,j)
res34: scala.collection.immutable.IndexedSeq[(Int, Int)] = Vector((0,3), (0,6), (2,5), (3,6))

Sum of Values based on key in scala

I am new to scala I have List of Integers
val list = List((1,2,3),(2,3,4),(1,2,3))
val sum = list.groupBy(_._1).mapValues(_.map(_._2)).sum
val sum2 = list.groupBy(_._1).mapValues(_.map(_._3)).sum
How to perform N values I tried above but its not good way how to sum N values based on key
Also I have tried like this
val sum =list.groupBy(_._1).values.sum => error
val sum =list.groupBy(_._1).mapvalues(_.map(_._2).sum (_._3).sum) error
It's easier to convert these tuples to List[Int] with shapeless and then work with them. Your tuples are actually more like lists anyways. Also, as a bonus, you don't need to change your code at all for lists of Tuple4, Tuple5, etc.
import shapeless._, syntax.std.tuple._
val list = List((1,2,3),(2,3,4),(1,2,3))
list.map(_.toList) // convert tuples to list
.groupBy(_.head) // group by first element of list
.mapValues(_.map(_.tail).map(_.sum).sum) // sums elements of all tails
Result is Map(2 -> 7, 1 -> 10).
val sum = list.groupBy(_._1).map(i => (i._1, i._2.map(j => j._1 + j._2 + j._3).sum))
> sum: scala.collection.immutable.Map[Int,Int] = Map(2 -> 9, 1 -> 12)
Since tuple can't type safe convert to List, need to specify add one by one as j._1 + j._2 + j._3.
using the first element in the tuple as the key and the remaining elements as what you need you could do something like this:
val list = List((1,2,3),(2,3,4),(1,2,3))
list: List[(Int, Int, Int)] = List((1, 2, 3), (2, 3, 4), (1, 2, 3))
val sum = list.groupBy(_._1).map { case (k, v) => (k -> v.flatMap(_.productIterator.toList.drop(1).map(_.asInstanceOf[Int])).sum) }
sum: Map[Int, Int] = Map(2 -> 7, 1 -> 10)
i know its a bit dirty to do asInstanceOf[Int] but when you do .productIterator you get a Iterator of Any
this will work for any tuple size

SubtractByKey and keep rejected values

I was playing around with spark and I am getting stuck with something that seems foolish.
Let's say we have two RDD:
rdd1 = {(1, 2), (3, 4), (3, 6)}
rdd2 = {(3, 9)}
if I am doing rdd1.substrackByKey(rdd2) , I will get {(1, 2)} wich is perfectly fine. But I also want to save the rejected values {(3,4),(3,6)} to another RDD, is there a prebuilt function in spark or an elegant way to do this?
Please keep in mind that I am new with Spark, any help will be appreciated, thanks.
As Rohan suggests, there is no (to the best of my knowledge) standard API call to do this. What you want to do can be expressed as Union - Intersection.
Here is how you can do this on spark:
val r1 = sc.parallelize(Seq((1,2), (3,4), (3,6)))
val r2 = sc.parallelize(Seq((3,9)))
val intersection = r1.map(_._1).intersection(r2.map(_._1))
val union = r1.map(_._1).union(r2.map(_._1))
val diff = union.subtract(intersection)
diff.collect()
> Array[Int] = Array(1)
To get the actual pairs:
val d = diff.collect()
r1.union(r2).filter(x => d.contains(x._1)).collect
I think I claim this is slightly more elegant:
val r1 = sc.parallelize(Seq((1,2), (3,4), (3,6)))
val r2 = sc.parallelize(Seq((3,9)))
val r3 = r1.leftOuterJoin(r2)
val subtracted = r3.filter(_._2._2.isEmpty).map(x=>(x._1, x._2._1))
val discarded = r3.filter(_._2._2.nonEmpty).map(x=>(x._1, x._2._1))
//subtracted: (1,2)
//discarded: (3,4)(3,6)
The insight is noticing that leftOuterJoin produces both the discarded (== records with a matching key in r2) and remaining (no matching key) in one go.
It's a pity Spark doesn't have RDD.partition (in the Scala collection sense of split a collection into two depending on a predicate) or we could caclculate subtracted and discarded in one pass
You can try
val rdd3 = rdd1.subtractByKey(rdd2)
val rdd4 = rdd1.subtractByKey(rdd3)
But you won't be keeping the values, just running another subtraction.
Unfortunately, I don't think there's an easy way to keep the rejected values using subtractByKey(). I think one way you get your desired result is through cogrouping and filtering. Something like:
val cogrouped = rdd1.cogroup(rdd2, numPartitions)
def flatFunc[A, B](key: A, values: Iterable[B]) : Iterable[(A, B)] = for {value <- values} yield (key, value)
val res1 = cogrouped.filter(_._2._2.isEmpty).flatMap { case (key, values) => flatFunc(key, values._1) }
val res2 = cogrouped.filter(_._2._2.nonEmpty).flatMap { case (key, values) => flatFunc(key, values._1) }
You might be able to borrow the work done here to make the last two lines look more elegant.
When I run this on your example, I see:
scala> val rdd1 = sc.parallelize(Array((1, 2), (3, 4), (3, 6)))
scala> val rdd2 = sc.parallelize(Array((3, 9)))
scala> val cogrouped = rdd1.cogroup(rdd2)
scala> def flatFunc[A, B](key: A, values: Iterable[B]) : Iterable[(A, B)] = for {value <- values} yield (key, value)
scala> val res1 = cogrouped.filter(_._2._2.isEmpty).flatMap { case (key, values) => flatFunc(key, values._1) }
scala> val res2 = cogrouped.filter(_._2._2.nonEmpty).flatMap { case (key, values) => flatFunc(key, values._1) }
scala> res1.collect()
...
res7: Array[(Int, Int)] = Array((1,2))
scala> res2.collect()
...
res8: Array[(Int, Int)] = Array((3,4), (3,6))
First use substractByKey() and then subtract
val rdd1 = spark.sparkContext.parallelize(Seq((1,2), (3,4), (3,5)))
val rdd2 = spark.sparkContext.parallelize(Seq((3,10)))
val result = rdd1.subtractByKey(rdd2)
result.foreach(print) // (1,2)
val rejected = rdd1.subtract(result)
rejected.foreach(print) // (3,5)(3,4)

combine two lists with same keys

Here's a quite simple request to combine two lists as following:
scala> list1
res17: List[(Int, Double)] = List((1,0.1), (2,0.2), (3,0.3), (4,0.4))
scala> list2
res18: List[(Int, String)] = List((1,aaa), (2,bbb), (3,ccc), (4,ddd))
The desired output is as:
((aaa,0.1),(bbb,0.2),(ccc,0.3),(ddd,0.4))
I tried:
scala> (list1 ++ list2)
res23: List[(Int, Any)] = List((1,0.1), (2,0.2), (3,0.3), (4,0.4),
(1,aaa), (2,bbb), (3,ccc), (4,ddd))
But:
scala> (list1 ++ list2).groupByKey
<console>:10: error: value groupByKey is not a member of List[(Int,
Any)](list1 ++ list2).groupByKey
Any hints? Thanks!
The method you're looking for is groupBy:
(list1 ++ list2).groupBy(_._1)
If you know that for each key you have exactly two values, you can join them:
scala> val pairs = List((1, "a1"), (2, "b1"), (1, "a2"), (2, "b2"))
pairs: List[(Int, String)] = List((1,a1), (2,b1), (1,a2), (2,b2))
scala> pairs.groupBy(_._1).values.map {
| case List((_, v1), (_, v2)) => (v1, v2)
| }
res0: Iterable[(String, String)] = List((b1,b2), (a1,a2))
Another approach using zip is possible if the two lists contain the same keys in the same order:
scala> val l1 = List((1, "a1"), (2, "b1"))
l1: List[(Int, String)] = List((1,a1), (2,b1))
scala> val l2 = List((1, "a2"), (2, "b2"))
l2: List[(Int, String)] = List((1,a2), (2,b2))
scala> l1.zip(l2).map { case ((_, v1), (_, v2)) => (v1, v2) }
res1: List[(String, String)] = List((a1,a2), (b1,b2))
Here's a quick one-liner:
scala> list2.map(_._2) zip list1.map(_._2)
res0: List[(String, Double)] = List((aaa,0.1), (bbb,0.2), (ccc,0.3), (ddd,0.4))
If you are unsure why this works then read on! I'll expand it step by step:
list2.map(<function>)
The map method iterates over each value in list2 and applies your function to it. In this case each of the values in list2 is a Tuple2 (a tuple with two values). What you are wanting to do is access the second tuple value. To access the first tuple value use the ._1 method and to access the second tuple value use the ._2 method. Here is an example:
val myTuple = (1.0, "hello") // A Tuple2
println(myTuple._1) // prints "1.0"
println(myTuple._2) // prints "hello"
So what we want is a function literal that takes one parameter (the current value in the list) and returns the second tuple value (._2). We could have written the function literal like this:
list2.map(item => item._2)
We don't need to specify a type for item because the compiler is smart enough to infer it thanks to target typing. A really helpful shortcut is that we can just leave out item altogether and replace it with a single underscore _. So it gets simplified (or cryptified, depending on how you view it) to:
list2.map(_._2)
The other interesting part about this one liner is the zip method. All zip does is take two lists and combines them into one list just like the zipper on your favorite hoodie does!
val a = List("a", "b", "c")
val b = List(1, 2, 3)
a zip b // returns ((a,1), (b,2), (c,3))
Cheers!

What is the inverse of intercalate, and how to implement it?

This question discusses how to interleave two lists in an alternating fashion, i.e. intercalate them.
What is the inverse of "intercalate" called?
Is there an idiomatic way to implement this in Scala?
The topic is discussed on this Haskell IRC session.
Possibilities include "deintercalate", "extracalate", "ubercalate", "outercalate", and "chocolate" ;-)
Assuming we go for "extracalate", it can be implemented as a fold:
def extracalate[A](a: List[A]) =
a.foldRight((List[A](), List[A]())){ case (b, (a1,a2)) => (b :: a2, a1) }
For example:
val mary = List("Mary", "had", "a", "little", "lamb")
extracalate(mary)
//> (List(Mary, a, lamb),List(had, little)
Note that the original lists can only be reconstructed if either:
the input lists were the same length, or
the first list was 1 longer than the second list
The second case actually turns out to be useful for the geohashing algorithm, where the latitude bits and longitude bits are intercalated, but there may be an odd number of bits.
Note also that the definition of intercalate in the linked question is different from the definition in the Haskell libraries, which intersperses a list in between a list of lists!
Update: As for any fold, we supply a starting value and a function to apply to each value of the input list. This function modifies the starting value and passes it to the next step of the fold.
Here, we start with a pair of empty output lists: (List[A](), List[A]())
Then for each element in the input list, we add it onto the front of one of the output lists using cons ::. However, we also swap the order of the two output lists , each time the function is invoked; (a1, a2) becomes (b :: a2, a1). This divides the input list between the two output lists in alternating fashion. Because it's a right fold, we start at the end of the input list, which is necessary to get each output list in the correct order. Proceeding from the starting value to the final value, we would get:
([], [])
([lamb], [])
([little],[lamb])
([a, lamb],[little])
([had, little],[a, lamb])
([Mary, a, lamb],[had, little])
Also, using standard methods
val mary = List("Mary", "had", "a", "little", "lamb")
//> mary : List[String] = List(Mary, had, a, little, lamb)
val (f, s) = mary.zipWithIndex.partition(_._2 % 2 == 0)
//> f : List[(String, Int)] = List((Mary,0), (a,2), (lamb,4))
//| s : List[(String, Int)] = List((had,1), (little,3))
(f.unzip._1, s.unzip._1)
//> res0: (List[String], List[String]) = (List(Mary, a, lamb),List(had, little))
Not really recommending it, though, the fold will beat it hands down on performance
Skinning the cat another way
val g = mary.zipWithIndex.groupBy(_._2 % 2)
//> g : scala.collection.immutable.Map[Int,List[(String, Int)]] = Map(1 -> List
//| ((had,1), (little,3)), 0 -> List((Mary,0), (a,2), (lamb,4)))
(g(0).unzip._1, g(1).unzip._1)
//> res1: (List[String], List[String]) = (List(Mary, a, lamb),List(had, little))
Also going to be slow
I think it's inferior to #DNA's answer as it's more code and it requires passing through the list twice.
scala> list
res27: List[Int] = List(1, 2, 3, 4, 5)
scala> val first = list.zipWithIndex.filter( x => x._1 % 2 == 1).map(x => x._2)
first: List[Int] = List(0, 2, 4)
scala> val second = list.zipWithIndex.filter( x => x._1 % 2 == 0).map(x => x._2)
second: List[Int] = List(1, 3)
scala> (first, second)
res28: (List[Int], List[Int]) = (List(0, 2, 4),List(1, 3))