Calculate sliding durations in scala - scala

I have a list of Tuples and a datum like below
val datum =("R",89)
val dataList = Seq(("R",91),("R",95),("X",96),("S",98))
I want to calculate the duration between elements in the list , starting with the datum so the result would be
res0:> Seq(("R",7) , ("X",2)) //R - 96-89 , X - 98-96
Things I have tried are not functional
a) I used a sliding on the list and used a pattern match with an accumulator to hold the values. This used a Boolean and a listBuffer to keep adding values into the list
b) Used a map function with an accumulator tuple with a pattern match for the tuple , compare the _1 values and when the values change compare reset the accumulator and collect the result of the subtraction
I was imagining if foldLeft or fold functions could be used to make it more "functional". In this case we would have .foldLeft(List()) as the initial value and then write a map function that takes in 2 tuples and compare manually possibly with a flag as well.
Any pointers as to how this can be made "functional"
b) Used a map function that

Here's what you can do
first create a function that takes datum, dataList and empty list (which will be the final list). And you iterate through the function using your logic.
def func(x : Tuple2[String, Int], y : Seq[Tuple2[String, Int]], z: List[Tuple2[String, Int]]) : List[Tuple2[String, Int]] = y match {
case (a::b) => if(x._1 == a._1) func(x, b, z) else func(a, b, z :+ (x._1, a._2-x._2))
case Nil => z
}
Thats all now just call the function
val finalTuples = func(datum, dataList, List.empty[Tuple2[String, Int]])
finalTuples is finalTuples: List[(String, Int)] = List((R,7), (X,2))

Related

scala to check whether loop through all element in a vector when joining two vectors

I have 2 vectors as below.
val vecBase21=....sortBy(r=>(r._1,r._2))
vecBase21: scala.collection.immutable.Vector[(String, String, Double)] = Vector((036,20210624 0400,2.0), (036,20210624 0405,2.0), (036,20210624 0410,2.0), (036,20210624 0415,2.0), (036,20210624 0420,2.0),...)
val vecBase22=....sortBy(r=>(r._1,r._2))
vecBase22: scala.collection.immutable.Vector[(String, String, Double)] = Vector((036,20210625 0400,2.0), (036,20210625 0405,2.0), (036,20210625 0410,2.0), (036,20210625 0415,2.0), (036,20210625 0420,2.0),...)
Inside, x._1 is ID, x._2 is date time, and x._3 is value.Then I did this to create a 3rd vector as follow.
val vecBase30=vecBase21.map(x=>vecBase22.filter(y=>x._1==y._1 && x._2==y._2).map(y=>(x._1,x._2,x._3,y._3))).flatten
This is literally a join in SQL, a join b on a.id=b.id and a.date_time=b.date_time. It loops in vecBase22 to search one combination of ID and date_time from vecBase21. As each combination is unique in one vector and they are sorted, I want to find out whether the loop in vecBase22 stops once it finds a match or it loops till the end of vecBase22 anyway. I tried this
val vecBase30=vecBase21.map(x=>vecBase22.filter(y=>x._1==y._1 && x._2==y._2).map{y=>
println("x1="+x._1+" y1="+y._1+" x2="+x._2+" y2="+y._2)
(x._1,x._2,x._3,y._3)}).flatten
But it apparently gives only matched results. Is there a way of printing all combinations from two vectors that the machine evaluates whether there is a match?
As each combination is unique in one vector and they are sorted, I want to find out whether the loop in vecBase22 stops once it finds a match or it loops till the end of vecBase22 anyway
When you call filter on vecBase22 you loop through every element of that collection to see if it matches the predicate. This returns a new collection and passes it to the map function. If you want to short-circuit the filtering process you could consider using the method collectFirst (Scala 2.12):
def collectFirst[B](pf: PartialFunction[A, B]): Option[B]
Finds the first element of the traversable or iterator for which the given partial function is defined, and applies the partial function to it.
Note: may not terminate for infinite-sized collections.
Note: might return different results for different runs, unless the underlying collection type is ordered.
pf: the partial function
returns: an option value containing pf applied to the first value for which it is defined, or None if none exists.
Example:
Seq("a", 1, 5L).collectFirst({ case x: Int => x*10 }) = Some(10)
So you could do something like:
val vecBase30: Vector[(String, String, Double, Double)] = vecBase21
.flatMap(x => vecBase22.collectFirst {
case matched: (String, String, Double) if x._1 == matched._1 && x._2 == matched._2 => (x._1, x._2, x._3, matched._3)
})
First off: yes it loop through all items of vecBase22, for each item of vecBase21. That's what the map and filter do.
If the println doesn't work, it is probably because you are executing you code in an interpreter that lose the std out. Some notebook maybe?
Also, if you want it stop once it find a match, use Seq.find
Finally, you can improve readability. here is a couple of ideas:
use case class instead of tuple
add space around operator
add new lines before each monad operation if it doesn't fit one line
use flatMap instead of map followed by flatten
add val type (not necessary but it helps reading the code)
That gives:
case class Item(id: String, time: String, value: Double)
case class Joint(id: String, time: String, v1: Double, v2: Double)
val vecBase21: Vector[Item] = ....sortBy(item => (item.id, item.time))
val vecBase22: Vector[Item] = ....sortBy(item => (item.id, item.time))
val vecBase30: Vector[Joint] = vecBase21.flatMap( x =>
vecBase22
.filter( y => x.id == y.id && x.time == y.time)
.map( y => Joint(x.id, x.time, x.value, y.value))
)

Filtering RDDs based on value of Key

I have two RDDs that wrap the following arrays:
Array((3,Ken), (5,Jonny), (4,Adam), (3,Ben), (6,Rhonda), (5,Johny))
Array((4,Rudy), (7,Micheal), (5,Peter), (5,Shawn), (5,Aaron), (7,Gilbert))
I need to design a code in such a way that if I provide input as 3 I need to return
Array((3,Ken), (3,Ben))
If input is 6, output should be
Array((6,Rhonda))
I tried something like this:
val list3 = list1.union(list2)
list3.reduceByKey(_+_).collect
list3.reduceByKey(6).collect
None of these worked, can anyone help me out with a solution for this problem?
Given the following that you would have to define yourself
// Provide you SparkContext and inputs here
val sc: SparkContext = ???
val array1: Array[(Int, String)] = ???
val array2: Array[(Int, String)] = ???
val n: Int = ???
val rdd1 = sc.parallelize(array1)
val rdd2 = sc.parallelize(array2)
You can use the union and filter to reach your goal
rdd1.union(rdd2).filter(_._1 == n)
Since filtering by key is something that you would probably want to do in several occasions, it makes sense to encapsulate this functionality in its own function.
It would also be interesting if we could make sure that this function could work on any type of keys, not only Ints.
You can express this in the old RDD API as follows:
def filterByKey[K, V](rdd: RDD[(K, V)], k: K): RDD[(K, V)] =
rdd.filter(_._1 == k)
You can use it as follows:
val rdd = rdd1.union(rdd2)
val filtered = filterByKey(rdd, n)
Let's look at this method a little bit more in detail.
This method allows to filterByKey and RDD which contains a generic pair, where the type of the first item is K and the type of the second type is V (from key and value). It also accepts a key of type K that will be used to filter the RDD.
You then use the filter function, that takes a predicate (a function that goes from some type - in this case K - to a Boolean) and makes sure that the resulting RDD only contains items that respect this predicate.
We could also have written the body of the function as:
rdd.filter(pair => pair._1 == k)
or
rdd.filter { case (key, value) => key == k }
but we took advantage of the _ wildcard to express the fact that we want to act on the first (and only) parameter of this anonymous function.
To use it, you first parallelize your RDDs, call union on them and then invoke the filterByKey function with the number you want to filter by (as shown in the example).

Vector product using map and reduce in scala

I'm trying to calculate the vector product between two vector using the map and reduce functions.
Let's see what happens in the REPL of Scala:
First of all I define 2 vectors with same length
scala> val v1 = Array(1,4,5,2)
v1: Array[Int] = Array(1, 4, 5, 2)
scala> val v2 = Array (3,5,1,5)
v2: Array[Int] = Array(3, 5, 1, 5)
Now I create a new array vecZip using the zip function
scala> val vecZip = v1 zip v2
vecZip: Array[(Int, Int)] = Array((1,3), (4,5), (5,1), (2,5))
Now I'd like to apply the reduce method
(to do the product of each tuple) for each element of this array.
I thought this:
val vecToSum = vecZip.map(x=>(List(x).reduce(_*_)))
I want to get a list (vecToSum) where apply the reduce method to calculate the total result. However I get this error:
scala> val vecToSum = vecZip.map(x=>(List(x).reduce(_*_)))
<console>:10: error: value * is not a member of (Int, Int)
val vecToSum = vecZip.map(x=>(List(x).reduce(_*_)))
^
You just need to call map and multiply the tuples values with each other, like this:
val vecToSum = vecZip.map(x => x._1 * x._2)
vecToSum is a List of tuples, so x is a Tuple of (Int, Int). Therefore if you call List(x).reduce(...), you're creating a List with the only value being the tuple, so that's not really what you want.
What your code is actually trying to do is it creates a list of a single tuple element, and then tries to reduce it. It would never work this way, as there is nothing to reduce - there is already single element in a list - a tuple.
Instead you need to map your vecZip array elements (tuples) via multiplying their elements:
vecZip.map { case (x, y) => x * y }
You don't need to reduce here. Reducing an Array[(Int, Int)] would mean performing some associative binary operation on all tuples inside the array. Note that it could be performing the operation on the first couple of tuples, then on the result of that and the third tuple, then on the result of that and the fourth tuple etc. but also, due to associativity, it could perform the operation on first and second tuple, simultaneously on third and fourth tuple, and then on their results etc., which is nice for parallelization (and frameworks such as Spark rely on it heavily)).
For example you could sum all first elements and all second elements of each tuple:
val reduced = vecZip.reduce((pair1, pair2) => (pair1._1 + pair2._1, pair1._2 + pair2._2))
What you want however is to simply map each tuple into the product of its elements:
val vecToSum = vecZip.map { case (x, y) => x * y }
Note that I used the partial function (see that case over there) in order to perform pattern matching on the tuple; without the partial function it would look like this:
val vecToSum = vecZip.map(tuple => tuple._1 * tuple._2)

How can I functionally iterate over a collection combining elements?

I have a sequence of values of type A that I want to transform to a sequence of type B.
Some of the elements with type A can be converted to a B, however some other elements need to be combined with the immediately previous element to produce a B.
I see it as a small state machine with two states, the first one handling the transformation from A to B when just the current A is needed, or saving A if the next row is needed and going to the second state; the second state combining the saved A with the new A to produce a B and then go back to state 1.
I'm trying to use scalaz's Iteratees but I fear I'm overcomplicating it, and I'm forced to return a dummy B when the input has reached EOF.
What's the most elegant solution to do it?
What about invoking the sliding() method on your sequence?
You might have to put a dummy element at the head of the sequence so that the first element (the real head) is evaluated/converted correctly.
If you map() over the result from sliding(2) then map will "see" every element with its predecessor.
val input: Seq[A] = ??? // real data here (no dummy values)
val output: Seq[B] = (dummy +: input).sliding(2).flatMap(a2b).toSeq
def a2b( arg: Seq[A] ): Seq[B] = {
// arg holds 2 elements
// return a Seq() of zero or more elements
}
Taking a stab at it:
Partition your list into two lists. The first is the one you can directly convert and the second is the one that you need to merge.
scala> val l = List("String", 1, 4, "Hello")
l: List[Any] = List(String, 1, 4, Hello)
scala> val (string, int) = l partition { case s:String => true case _ => false}
string: List[Any] = List(String, Hello)
int: List[Any] = List(1, 4)
Replace the logic in the partition block with whatever you need.
After you have the two lists, you can do whatever you need to with your second using something like this
scala> string ::: int.collect{case i:Integer => i}.sliding(2).collect{
| case List(a, b) => a+b.toString}.toList
res4: List[Any] = List(String, Hello, 14)
You would replace the addition with whatever your aggregate function is.
Hopefully this is helpful.

How to sum the corresponding values in the List into a Tuple?

I have a list details of this type :
case class Detail(point: List[Double], cluster: Int)
val details = List(Detail(List(2.0, 10.0),1), Detail(List(2.0, 5.0),3),
Detail(List(8.0, 4.0),2), Detail(List(5.0, 8.0),2))
I want filter this list into a tuple which contains a sum of each corresponding point where the cluster is 2
So I filter this List :
details.filter(detail => detail.cluster == 2)
which returns :
List(Detail(List(8.0, 4.0),2), Detail(List(5.0, 8.0),2))
It's the summing of the corresponding values I'm having trouble with. In this example the tuple should contain (8+5, 4+8) = (13, 12)
I'm thinking to flatten the List and then sum each corresponding value but
List(details).flatten
just returns the same List
How to sum the corresponding values in the List into a Tuple ?
I could achieve this easily using a for loop and just extract the details I need into a counter but what is the functional solution ?
What do you want to happen if the lists for different Details have different lengths?
Or same length which is different from 2? Tuples are generally only used when you need a fixed in advance number of elements; you won't even be able to write a return type if you need tuples of different lengths.
Assuming that all of them are lists of the same length and you get a list in return, something like this should work (untested):
details.filter(_.cluster == 2).map(_.point).transpose.map(_.sum)
I.e. first get all points as a list of lists, transpose it so you get a list for each "coordinate", and sum each of these lists.
If you do know that each point has two coordinates, this should likely be reflected in your Point type, by using (Double, Double) instead of List[Double] and you can just fold over the list of points, which should be a bit more efficient. Look at definition of foldLeft and the standard implementation of sum in terms of foldLeft:
def sum(list: List[Int]): Int = list.foldLeft(0)((acc, x) => acc + x)
and it should be easy to do what you want.
You can use just one foldLeft with PF without filter:
details.foldLeft((0.0,0.0))({
case ((accX, accY), Detail(x :: y :: Nil, 2)) => (accX + x, accY + y)
case (acc, _) => acc
})
res1: (Double, Double) = (13.0,12.0)