Scala zip with id on sorted collections - scala

I have two collections in scala with objects containing id. And I want to zip them together by id. So in such example:
case class A(id: Long)
case class B(id: Long)
val col1 = A(1) :: A(2) :: A(5) :: Nil
val col2 = B(2) :: B(2) :: B(5) :: Nil
I would expect as result:
List(
(A(1), List()),
(A(2), List(B(2), B(2)),
(A(5), List(B(5))
)
How to do it the easy way?
If I know col1 and col2 are already sorted by id would it help somehow?

I can't figure out a good way to do it without an intermediate variable, but how about something like this:
val map = col2.groupBy(_.id).withDefault(_ => List.empty)
col1.map { a => a -> map(a.id) }
For 3-element arrays it doesn't matter, but note that the main difference from the other answer is that this is linear time.

One way would be to map the first collection and inside the map do a filter on the second and build a tuple:
scala> col1.map { c1 =>
| (c1, col2.filter(_.id == c1.id))
| }
res0: List[(A, List[B])] = List((A(1),List()), (A(2),List(B(2), B(2))), (A(5),List(B(5))))

Related

How to read an element from a Scala HList?

There is very few readable documentation about HLists, and the answers I can find on SO come from outer space for a humble Scala beginner.
I encountered HLists because Slick can auto-generate some to represent database rows. They are slick.collection.heterogeneous.HList (not shapeless').
Example:
type MyRow = HCons[Int,HCons[String,HCons[Option[String],HCons[Int,HCons[String,HCons[Int,HCons[Int,HCons[Option[Int],HCons[Option[Float],HCons[Option[Float],HCons[Option[String],HCons[Option[String],HCons[Boolean,HCons[Option[String],HCons[Option[String],HCons[Option[String],HCons[Option[String],HCons[Option[String],HCons[Option[Int],HCons[Option[Float],HCons[Option[Float],HCons[Option[Float],HCons[Option[String],HCons[Option[String],HNil]]]]]]]]]]]]]]]]]]]]]]]]
def MyRow(a, b, c, ...): MyRow = a :: b :: c :: ... :: HNil
Now given one of these rows, I'd need to read one element, typed if possible. I just can't do that. I tried
row(4) // error
row._4 // error
row.toList // elements are inferred as Any
row match { case a :: b :: c :: x :: rest => x } // "Pattern type is incompatible. Expected MyRow."
row match { case MyRow(_,_,_,_,_,x,...) => x } // is not a case class like other rows
row match { HCons[Int,HCons[String,HCons[Option[String],HCons[Int,HCons[String, x]]]]] => x.head } // error
row.tail.tail.tail.tail.head // well, is that really the way??
Could somebody please explain how I can extract a specific value from that dinosaur?
I'd expect your row(0) lookup to work based on the HList API doc for apply. Here's an example I tried with Slick 3.1.1:
scala> import slick.collection.heterogeneous._
import slick.collection.heterogeneous._
scala> import slick.collection.heterogeneous.syntax._
import slick.collection.heterogeneous.syntax._
scala> type MyRow = Int :: String :: HNil
defined type alias MyRow
scala> val row: MyRow = 1 :: "a" :: HNil
row: MyRow = 1 :: a :: HNil
scala> row(0) + 99
res1: Int = 100
scala> val a: String = row(1)
a: String = a
Just one thing... if it is not too important than just stick to HList as the type. Do not alias it to MyRow unless necessary.
So.. you had
val row = a :: b :: c :: ... :: HNil
How about this ?
val yourX = row match { case a :: b :: c :: x ::: rest => x }
notice that ::: instead of :: at the end.
Or... how about this,
val yourX = row.tail.tail.tail.head
// this may change a little if you had,
def MyRow(a, b, c, ...): MyRow = a :: b :: c :: ... :: HNil
val row = MyRow(a, b, c, ...)
val yourX = row.asInstanceOf[HList].tail.tail.tail.head

Scala: How to map a subset of a seq to a shorter seq

I am trying to map a subset of a sequence using another (shorter) sequence while preserving the elements that are not in the subset. A toy example below tries to give a flower to females only:
def giveFemalesFlowers(people: Seq[Person], flowers: Seq[Flower]): Seq[Person] = {
require(people.count(_.isFemale) == flowers.length)
magic(people, flowers)(_.isFemale)((p, f) => p.withFlower(f))
}
def magic(people: Seq[Person], flowers: Seq[Flower])(predicate: Person => Boolean)
(mapping: (Person, Flower) => Person): Seq[Person] = ???
Is there an elegant way to implement the magic?
Use an iterator over flowers, consume one each time the predicate holds; the code would look like this,
val it = flowers.iterator
people.map ( p => if (predicate(p)) p.withFlowers(it.next) else p )
What about zip (aka zipWith) ?
scala> val people = List("m","m","m","f","f","m","f")
people: List[String] = List(m, m, m, f, f, m, f)
scala> val flowers = List("f1","f2","f3")
flowers: List[String] = List(f1, f2, f3)
scala> def comb(xs:List[String],ys:List[String]):List[String] = (xs,ys) match {
| case (x :: xs, y :: ys) if x=="f" => (x+y) :: comb(xs,ys)
| case (x :: xs,ys) => x :: comb(xs,ys)
| case (Nil,Nil) => Nil
| }
scala> comb(people, flowers)
res1: List[String] = List(m, m, m, ff1, ff2, m, ff3)
If the order is not important, you can get this elegant code:
scala> val (men,women) = people.partition(_=="m")
men: List[String] = List(m, m, m, m)
women: List[String] = List(f, f, f)
scala> men ++ (women,flowers).zipped.map(_+_)
res2: List[String] = List(m, m, m, m, ff1, ff2, ff3)
I am going to presume you want to retain all the starting people (not simply filter out the females and lose the males), and in the original order, too.
Hmm, bit ugly, but what I came up with was:
def giveFemalesFlowers(people: Seq[Person], flowers: Seq[Flower]): Seq[Person] = {
require(people.count(_.isFemale) == flowers.length)
people.foldLeft((List[Person]() -> flowers)){ (acc, p) => p match {
case pp: Person if pp.isFemale => ( (pp.withFlower(acc._2.head) :: acc._1) -> acc._2.tail)
case pp: Person => ( (pp :: acc._1) -> acc._2)
} }._1.reverse
}
Basically, a fold-left, initialising the 'accumulator' with a pair made up of an empty list of people and the full list of flowers, then cycling through the people passed in.
If the current person is female, pass it the head of the current list of flowers (field 2 of the 'accumulator'), then set the updated accumulator to be the updated person prepended to the (growing) list of processed people, and the tail of the (shrinking) list of flowers.
If male, just prepend to the list of processed people, leaving the flowers unchanged.
By the end of the fold, field 2 of the 'accumulator' (the flowers) should be an empty list, while field one holds all the people (with any females having each received their own flower), in reverse order, so finish with ._1.reverse
Edit: attempt to clarify the code (and substitute a test more akin to #elm's to replace the match, too) - hope that makes it clearer what is going on, #Felix! (and no, no offence taken):
def giveFemalesFlowers(people: Seq[Person], flowers: Seq[Flower]): Seq[Person] = {
require(people.count(_.isFemale) == flowers.length)
val start: (List[Person], Seq[Flower]) = (List[Person](), flowers)
val result: (List[Person], Seq[Flower]) = people.foldLeft(start){ (acc, p) =>
val (pList, fList) = acc
if (p.isFemale) {
(p.withFlower(fList.head) :: pList, fList.tail)
} else {
(p :: pList, fList)
}
}
result._1.reverse
}
I'm obviously missing something but isn't it just
people map {
case p if p.isFemale => p.withFlower(f)
case p => p
}

RDD Product without repeat some tuples

I got the following RDD[String]
TTT
SSS
AAA
and I am having problems to get the following tuples
(TTT, SSS)
(TTT, AAA)
(SSS, AAA)
I was doing:
val res = input.cartesian(input).filter{ case (a,b) => a != b }
But the result is:
(TTT,SSS)
(TTT,AAA)
(SSS,TTT)
(SSS,AAA)
(AAA,TTT)
(AAA,SSS)
What is the best way to do that? please
You could impose an order in the tuple to obtain the combinations:
val res = input.cartesian(input).filter{ case (a,b) => a < b }

Summing items within a Tuple

Below is a data structure of List of tuples, ot type List[(String, String, Int)]
val data3 = (List( ("id1" , "a", 1), ("id1" , "a", 1), ("id1" , "a", 1) , ("id2" , "a", 1)) )
//> data3 : List[(String, String, Int)] = List((id1,a,1), (id1,a,1), (id1,a,1),
//| (id2,a,1))
I'm attempting to count the occurences of each Int value associated with each id. So above data structure should be converted to List((id1,a,3) , (id2,a,1))
This is what I have come up with but I'm unsure how to group similar items within a Tuple :
data3.map( { case (id,name,num) => (id , name , num + 1)})
//> res0: List[(String, String, Int)] = List((id1,a,2), (id1,a,2), (id1,a,2), (i
//| d2,a,2))
In practice data3 is of type spark obj RDD , I'm using a List in this example for testing but same solution should be compatible with an RDD . I'm using a List for local testing purposes.
Update : based on following code provided by maasg :
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => (id1,id2,values.sum)}
I needed to amend slightly to get into format I expect which is of type
.RDD[(String, Seq[(String, Int)])]
which corresponds to .RDD[(id, Seq[(name, count-of-names)])]
:
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => ((id1),(id2,values.sum))}
val counted = result.groupedByKey
In Spark, you would do something like this: (using Spark Shell to illustrate)
val l = List( ("id1" , "a", 1), ("id1" , "a", 1), ("id1" , "a", 1) , ("id2" , "a", 1))
val rdd = sc.parallelize(l)
val grouped = rdd.groupBy{case (id1,id2,v) => (id1,id2)}
val result = grouped.map{case ((id1,id2),values) => (id1,id2,value.foldLeft(0){case (cumm, tuple) => cumm + tuple._3})}
Another option would be to map the rdd into a PairRDD and use groupByKey:
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => (id1,id2,values.sum)}
Option 2 is a slightly better option when handling large sets as it does not replicate the id's in the cummulated value.
This seems to work when I use scala-ide:
data3
.groupBy(tupl => (tupl._1, tupl._2))
.mapValues(v =>(v.head._1,v.head._2, v.map(_._3).sum))
.values.toList
And the result is the same as required by the question
res0: List[(String, String, Int)] = List((id1,a,3), (id2,a,1))
You should look into List.groupBy.
You can use the id as the key, and then use the length of your values in the map (ie all the items sharing the same id) to know the count.
#vptheron has the right idea.
As can be seen in the docs
def groupBy[K](f: (A) ⇒ K): Map[K, List[A]]
Partitions this list into a map of lists according to some discriminator function.
Note: this method is not re-implemented by views. This means when applied to a view it will >always force the view and return a new list.
K the type of keys returned by the discriminator function.
f the discriminator function.
returns
A map from keys to lists such that the following invariant holds:
(xs partition f)(k) = xs filter (x => f(x) == k)
That is, every key k is bound to a list of those elements x for which f(x) equals k.
So something like the below function, when used with groupBy will give you a list with keys being the ids.
(Sorry, I don't have access to an Scala compiler, so I can't test)
def f(tupule: A) :String = {
return tupule._1
}
Then you will have to iterate through the List for each id in the Map and sum up the number of integer occurrences. That is straightforward, but if you still need help, ask in the comments.
The following is the most readable, efficient and scalable
data.map {
case (key1, key2, value) => ((key1, key2), value)
}
.reduceByKey(_ + _)
which will give a RDD[(String, String, Int)]. By using reduceByKey it means the summation will paralellize, i.e. for very large groups it will be distributed and summation will happen on the map side. Think about the case where there are only 10 groups but billions of records, using .sum won't scale as it will only be able to distribute to 10 cores.
A few more notes about the other answers:
Using head here is unnecessary: .mapValues(v =>(v.head._1,v.head._2, v.map(_._3).sum)) can just use .mapValues(v =>(v_1, v._2, v.map(_._3).sum))
Using a foldLeft here is really horrible when the above shows .map(_._3).sum will do: val result = grouped.map{case ((id1,id2),values) => (id1,id2,value.foldLeft(0){case (cumm, tuple) => cumm + tuple._3})}

Map a variable of type of Pair -- impossible

This seems not logical for me:
scala> val a = Map((1, "111"), (2, "222"))
a: scala.collection.immutable.Map[Int,String] = Map(1 -> 111, 2 -> 222)
scala> val b = a.map((key, value) => value)
<console>:8: error: wrong number of parameters; expected = 1
val b = a.map((key, value) => value)
^
scala> val c = a.map(x => x._2)
c: scala.collection.immutable.Iterable[String] = List(111, 222)
I know that I can say val d = a.map({ case(key, value) => value })
But why isn't it possible to say a.map((key, value) => value) ? There is only one argument there of type Tuple2[Int, String] or Pair of Int, String. What's the difference between a.map((key, value) => value) and a.map(x => x._2) ?
UPDATE:
val myTuple2 = (1, 2) -- this is one variable, correct?
for ( (k, v) <- a ) yield v -- (k, v) is also only one variable, correct?
map((key, value) => value) -- 2 variables. weird.
So how do I specify a variable of type Tuple2 (or any other type) in map without using case?
UPDATE2:
What's wrong with that?
Map((1, "111"), (2, "222")).map( ((x,y):Tuple2[Int, String]) => y) -- wrong
Map((1, "111"), (2, "222")).map( ((x):Tuple2[Int, String]) => x._2) -- ok
Okay, you still not convinced. In cases like this it is pretty reasonable to fallback to the source of the truth (well, kinda): The Holy Specification (aka, Scala Language Specification).
So, in anonymous function parameters are treated on individual basis, not as a whole tuple band (and it is pretty smart, otherwise, how would you call the anonymous function with 2, ... n parameters?).
At the same time
val x = (1, 2)
is a single item of type Tiple2[Int,Int] (if you're interested you may find corresponding section of spec as well).
for ( (k, v) <- a ) yield v
In this case you have one variable unpacked to two variables. It is similar to
val x = (1, 2) // one variable -- tuple
val (y,z) = x // two integer variables unpacked from one
Some call this destructuring assignment and this is a particular case of pattern matching. And you've already provided another example of pattern matching in action:
a.map({ case(key, value) => value })
Which we can read as map accepts a function produced by a partial function literal, which enables use of pattern matching.
You're basically asking this same questions:
Scala - can a lambda parameter match a tuple?
You've already listed most of the options they listed there, including the accepted answer of using a PartialFunction.
However, since you're using your lambda in a map function, you could use a for comprehension instead:
for ( (k, v) <- a ) yield v
Alternatively, you can use the Function2.tupled method to fix your lambda's type:
scala> val a = Map((1, "111"), (2, "222"))
a: scala.collection.immutable.Map[Int,String] = Map(1 -> 111, 2 -> 222)
scala> a.map( ((k:Int,v:String) => v).tupled )
res1: scala.collection.immutable.Iterable[String] = List(111, 222)
To answer your question in your thread with om-nom-nom above, look at this output:
scala> ( (x:Int,y:String) => y ).getClass.getSuperclass
res0: Class[?0] forSome { type ?0 >: ?0; type ?0 <: (Int, String) => String } = class scala.runtime.AbstractFunction2
Notice that the superclass of the anonymous function (x:Int,y:String) => y is Function2[Int, String, String], not Function1[(Int, String), String].
You can use pattern matching (or partial function, in this instance this is the same), notice angular brackets:
val b = a.map{ case (key, value) => value }