Scala extract from Seq of tuples - scala

Here's a seq of tuples in Scala
val t = Seq((1,2,3),(4,5,6))
I like to extract the first element of each tuple into its own sequence, i.e.,
Seq(1,4)
How do I do this in Scala?

Simply use map and transform each tuple to its first element:
t.map(x => x._1)
Or shorter:
t.map(_._1)

The general form to extract more than one columns:
def extractColumns3[T1, T2, T3](t: Seq[(T1, T2, T3)]): (Seq[T1], Seq[T2], Seq[T3]) =
t.foldLeft((Seq.empty[T1], Seq.empty[T2], Seq.empty[T3])) { (columns, row) ⇒
(columns._1 :+ row._1, columns._2 :+ row._2, columns._3 :+ row._3)
}

Related

Fold function scala's immutable list

I have created an immutable list and try to fold it to a map, where each element is mapped to a constant string "abc". I do it for practice.
While I do that, I am getting an error. I am not sure why the map (here, e1 which has mutable map type) is converted to Any.
val l = collection.immutable.List(1,2,3,4)
l.fold (collection.mutable.Map[Int,String]()) ( (e1,e2) => e1 += (e2,"abc") )
l.fold (collection.mutable.Map[Int,String]()) ( (e1,e2) => e1 += (e2,"abc") )
<console>:13: error: value += is not a member of Any
Expression does not convert to assignment because receiver is not assignable.
l.fold (collection.mutable.Map[Int,String]()) ( (e1,e2) => e1 += (e2,"abc") )
At least three different problem sources here:
Map[...] is not a supertype of Int, so you probably want foldLeft, not fold (the fold acts more like the "banana brackets", it expects the first argument to act like some kind of "zero", and the binary operation as some kind of "addition" - this does not apply to mutable maps and integers).
The binary operation must return something, both for fold and foldLeft. In this case, you probably want to return the modified map. This is why you need ; m (last expression is what gets returned from the closure).
The m += (k, v) is not what you think it is. It attempts to invoke a method += with two separate arguments. What you need is to invoke it with a single pair. Try m += ((k, v)) instead (yes, those problems with arity are annoying).
Putting it all together:
l.foldLeft(collection.mutable.Map[Int, String]()){ (m, e) => m += ((e, "abc")); m }
But since you are using a mutable map anyway:
val l = (1 to 4).toList
val m = collection.mutable.Map[Int, String]()
for (e <- l) m(e) = "abc"
This looks arguably clearer to me, to be honest. In a foldLeft, I wouldn't expect the map to be mutated.
Folding is all about combining a sequence of input elements into a single output element. The output and input elements should have the same types in Scala. Here is the definition of fold:
def fold[A1 >: A](z: A1)(op: (A1, A1) => A1): A1
In your case type A1 is Int, but output element (sum type) is mutable.Map. So if you want to build a Map throug iteration, then you can use foldLeft or any other alternatives where you can use different input and output types. Here is the definition of foldLeft:
def foldLeft[B](z: B)(op: (B, A) => B): B
Solution:
val l = collection.immutable.List(1, 2, 3, 4)
l.foldLeft(collection.immutable.Map.empty[Int, String]) { (e1, e2) =>
e1 + (e2 -> "abc")
}
Note: I'm not using a mutabe Map

Calculate sliding durations in scala

I have a list of Tuples and a datum like below
val datum =("R",89)
val dataList = Seq(("R",91),("R",95),("X",96),("S",98))
I want to calculate the duration between elements in the list , starting with the datum so the result would be
res0:> Seq(("R",7) , ("X",2)) //R - 96-89 , X - 98-96
Things I have tried are not functional
a) I used a sliding on the list and used a pattern match with an accumulator to hold the values. This used a Boolean and a listBuffer to keep adding values into the list
b) Used a map function with an accumulator tuple with a pattern match for the tuple , compare the _1 values and when the values change compare reset the accumulator and collect the result of the subtraction
I was imagining if foldLeft or fold functions could be used to make it more "functional". In this case we would have .foldLeft(List()) as the initial value and then write a map function that takes in 2 tuples and compare manually possibly with a flag as well.
Any pointers as to how this can be made "functional"
b) Used a map function that
Here's what you can do
first create a function that takes datum, dataList and empty list (which will be the final list). And you iterate through the function using your logic.
def func(x : Tuple2[String, Int], y : Seq[Tuple2[String, Int]], z: List[Tuple2[String, Int]]) : List[Tuple2[String, Int]] = y match {
case (a::b) => if(x._1 == a._1) func(x, b, z) else func(a, b, z :+ (x._1, a._2-x._2))
case Nil => z
}
Thats all now just call the function
val finalTuples = func(datum, dataList, List.empty[Tuple2[String, Int]])
finalTuples is finalTuples: List[(String, Int)] = List((R,7), (X,2))

How to sum on a groupBy with an iterator?

given Iterator[(String, Int)]
I would like to group by the String and sum the Int and return the results as a Map[String, Int]
You can convert it to a list or other strict structure:
iter.toList.groupBy(_._1).mapValues(_.map(_._2).sum)
If you don't want to convert to a strict structure (which forces all of the entries into memory), you can foldLeft and build the map as you go:
(Map.empty[String,Int] /: iter) {case (acc, (k,v)) =>
acc + (k -> acc.get(k).map(_ + v).getOrElse(v))
}

How can I functionally iterate over a collection combining elements?

I have a sequence of values of type A that I want to transform to a sequence of type B.
Some of the elements with type A can be converted to a B, however some other elements need to be combined with the immediately previous element to produce a B.
I see it as a small state machine with two states, the first one handling the transformation from A to B when just the current A is needed, or saving A if the next row is needed and going to the second state; the second state combining the saved A with the new A to produce a B and then go back to state 1.
I'm trying to use scalaz's Iteratees but I fear I'm overcomplicating it, and I'm forced to return a dummy B when the input has reached EOF.
What's the most elegant solution to do it?
What about invoking the sliding() method on your sequence?
You might have to put a dummy element at the head of the sequence so that the first element (the real head) is evaluated/converted correctly.
If you map() over the result from sliding(2) then map will "see" every element with its predecessor.
val input: Seq[A] = ??? // real data here (no dummy values)
val output: Seq[B] = (dummy +: input).sliding(2).flatMap(a2b).toSeq
def a2b( arg: Seq[A] ): Seq[B] = {
// arg holds 2 elements
// return a Seq() of zero or more elements
}
Taking a stab at it:
Partition your list into two lists. The first is the one you can directly convert and the second is the one that you need to merge.
scala> val l = List("String", 1, 4, "Hello")
l: List[Any] = List(String, 1, 4, Hello)
scala> val (string, int) = l partition { case s:String => true case _ => false}
string: List[Any] = List(String, Hello)
int: List[Any] = List(1, 4)
Replace the logic in the partition block with whatever you need.
After you have the two lists, you can do whatever you need to with your second using something like this
scala> string ::: int.collect{case i:Integer => i}.sliding(2).collect{
| case List(a, b) => a+b.toString}.toList
res4: List[Any] = List(String, Hello, 14)
You would replace the addition with whatever your aggregate function is.
Hopefully this is helpful.

Summing items within a Tuple

Below is a data structure of List of tuples, ot type List[(String, String, Int)]
val data3 = (List( ("id1" , "a", 1), ("id1" , "a", 1), ("id1" , "a", 1) , ("id2" , "a", 1)) )
//> data3 : List[(String, String, Int)] = List((id1,a,1), (id1,a,1), (id1,a,1),
//| (id2,a,1))
I'm attempting to count the occurences of each Int value associated with each id. So above data structure should be converted to List((id1,a,3) , (id2,a,1))
This is what I have come up with but I'm unsure how to group similar items within a Tuple :
data3.map( { case (id,name,num) => (id , name , num + 1)})
//> res0: List[(String, String, Int)] = List((id1,a,2), (id1,a,2), (id1,a,2), (i
//| d2,a,2))
In practice data3 is of type spark obj RDD , I'm using a List in this example for testing but same solution should be compatible with an RDD . I'm using a List for local testing purposes.
Update : based on following code provided by maasg :
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => (id1,id2,values.sum)}
I needed to amend slightly to get into format I expect which is of type
.RDD[(String, Seq[(String, Int)])]
which corresponds to .RDD[(id, Seq[(name, count-of-names)])]
:
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => ((id1),(id2,values.sum))}
val counted = result.groupedByKey
In Spark, you would do something like this: (using Spark Shell to illustrate)
val l = List( ("id1" , "a", 1), ("id1" , "a", 1), ("id1" , "a", 1) , ("id2" , "a", 1))
val rdd = sc.parallelize(l)
val grouped = rdd.groupBy{case (id1,id2,v) => (id1,id2)}
val result = grouped.map{case ((id1,id2),values) => (id1,id2,value.foldLeft(0){case (cumm, tuple) => cumm + tuple._3})}
Another option would be to map the rdd into a PairRDD and use groupByKey:
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => (id1,id2,values.sum)}
Option 2 is a slightly better option when handling large sets as it does not replicate the id's in the cummulated value.
This seems to work when I use scala-ide:
data3
.groupBy(tupl => (tupl._1, tupl._2))
.mapValues(v =>(v.head._1,v.head._2, v.map(_._3).sum))
.values.toList
And the result is the same as required by the question
res0: List[(String, String, Int)] = List((id1,a,3), (id2,a,1))
You should look into List.groupBy.
You can use the id as the key, and then use the length of your values in the map (ie all the items sharing the same id) to know the count.
#vptheron has the right idea.
As can be seen in the docs
def groupBy[K](f: (A) ⇒ K): Map[K, List[A]]
Partitions this list into a map of lists according to some discriminator function.
Note: this method is not re-implemented by views. This means when applied to a view it will >always force the view and return a new list.
K the type of keys returned by the discriminator function.
f the discriminator function.
returns
A map from keys to lists such that the following invariant holds:
(xs partition f)(k) = xs filter (x => f(x) == k)
That is, every key k is bound to a list of those elements x for which f(x) equals k.
So something like the below function, when used with groupBy will give you a list with keys being the ids.
(Sorry, I don't have access to an Scala compiler, so I can't test)
def f(tupule: A) :String = {
return tupule._1
}
Then you will have to iterate through the List for each id in the Map and sum up the number of integer occurrences. That is straightforward, but if you still need help, ask in the comments.
The following is the most readable, efficient and scalable
data.map {
case (key1, key2, value) => ((key1, key2), value)
}
.reduceByKey(_ + _)
which will give a RDD[(String, String, Int)]. By using reduceByKey it means the summation will paralellize, i.e. for very large groups it will be distributed and summation will happen on the map side. Think about the case where there are only 10 groups but billions of records, using .sum won't scale as it will only be able to distribute to 10 cores.
A few more notes about the other answers:
Using head here is unnecessary: .mapValues(v =>(v.head._1,v.head._2, v.map(_._3).sum)) can just use .mapValues(v =>(v_1, v._2, v.map(_._3).sum))
Using a foldLeft here is really horrible when the above shows .map(_._3).sum will do: val result = grouped.map{case ((id1,id2),values) => (id1,id2,value.foldLeft(0){case (cumm, tuple) => cumm + tuple._3})}