How to find unique elements from list of tuples based on some elements using scala? - scala

I have following list
val a = List(("name1","add1","city1",10),("name1","add1","city1",10),
("name2","add2","city2",10),("name2","add2","city2",20),("name3","add3","city3",20))
I want distinct element from above list based on first three values of tuple. Fourth value should not be consider while finding distinct elements from list.
I want following output:
val output = List(("name1","add1","city1",10),("name2","add2","city2",10),
("name3","add3","city3",20))
Is it possible to get above output?
As per my knowledge, distinct works if whole tuple/value is duplicated. I tried out with distinct like following:
val b = List(("name1","add1","city1",10),("name1","add1","city1",10),("name2","add2","city2",10),
("name2","add2","city2",20),("name3","add3","city3",20)).distinct
but it gives output as -
List(("name1","add1","city1",10),("name2","add2","city2",10),
("name2","add2","city2",20),("name3","add3","city3",20))
Any alternate approach will also appreciated.

Use groupBy like this
a.groupBy( v => (v._1,v._2,v._3)).keys.toList
This constructs a Map where each key is by definition a unique triplet as required in the lambda function above.
Should it include also the last element in the tuple, fetch the first element for each key, like this
a.groupBy( v => (v._1,v._2,v._3)).mapValues(_.head)

If the order of the output list isn't important (i.e. you are happy to get List(("name3","add3","city3",20),("name1","add1","city1",10),("name2","add2","city2",10))), the following works as specified:
a.groupBy(v => (v._1,v._2,v._3)).values.map(_.head).toList
(Due to Scala collections design, you'll see the order kept for output lists up to 4 elements, but above that size HashMap will be used.) If you do need to keep the order, you can do something like (generalizing a bit)
def distinctBy[A, B](xs: Seq[A], f: A => B) = {
val seen = LinkedHashMap.empty[B, A]
xs.foreach { x =>
val key = f(x)
if (!seen.contains(key)) { seen.update(key, x) }
}
seen.values.toList
}
distinctBy(a, v => (v._1, v._2, v._3))

You could try
a.map{case x#(name, add, city, _) => (name,add,city) -> x}.toMap.values.toList

To make sure you have the first one in list kept,
type String3 = (String, String, String)
type String3Int = (String, String, String, Int)
a.foldLeft(collection.immutable.ListMap.empty[String3, String3Int]) {
case (a, b) => if (a.contains((b._1, b._2, b._3))) {
a
} else a + ((b._1, b._2, b._3) -> b)
}.values.toList

On simple solution would be to convert the List to a Set. Sets don't contain duplicates: check the documentation.
val setOfTuples = a.toSet
println(setOfTuples)
Output: Set((1,1), (1,2), (1,3), (2,1))

Related

acces tuple inside a tuple for anonymous map job in Spark

This post is essentially about how to build joint and marginal histograms from a (String, String) RDD. I posted the code that I eventually used below as the answer.
I have an RDD that contains a set of tuples of type (String,String) and since they aren't unique I want to get a look at how many times each String, String combination occurs so I use countByValue like so
val PairCount = Pairs.countByValue().toSeq
which gives me a tuple as output like this ((String,String),Long) where long is the number of times that the (String, String) tuple appeared
These Strings can be repeated in different combinations and I essentially want to run word count on this PairCount variable so I tried something like this to start:
PairCount.map(x => (x._1._1, x._2))
But the output the this spits out is String1->1, String2->1, String3->1, etc.
How do I output a key value pair from a map job in this case where the key is going to be one of the String values from the inner tuple, and the value is going to be the Long value from the outter tuple?
Update:
#vitalii gets me almost there. the answer gets me to a Seq[(String,Long)], but what I really need is to turn that into a map so that I can run reduceByKey it afterwards. when I run
PairCount.flatMap{case((x,y),n) => Seq[x->n]}.toMap
for each unique x I get x->1
for example the above line of code generates mom->1 dad->1 even if the tuples out of the flatMap included (mom,30) (dad,59) (mom,2) (dad,14) in which case I would expect toMap to provide mom->30, dad->59 mom->2 dad->14. However, I'm new to scala so I might be misinterpreting the functionality.
how can I get the Tuple2 sequence converted to a map so that I can reduce on the map keys?
If I correctly understand question, you need flatMap:
val pairCountRDD = pairs.countByValue() // RDD[((String, String), Int)]
val res : RDD[(String, Int)] = pairCountRDD.flatMap { case ((s1, s2), n) =>
Seq(s1 -> n, s2 -> n)
}
Update: I didn't quiet understand what your final goal is, but here's a few more examples that may help you, btw code above is incorrect, I have missed the fact that countByValue returns map, and not RDD:
val pairs = sc.parallelize(
List(
"mom"-> "dad", "dad" -> "granny", "foo" -> "bar", "foo" -> "baz", "foo" -> "foo"
)
)
// don't use countByValue, if pairs is large you will run out of memmory
val pairCountRDD = pairs.map(x => (x, 1)).reduceByKey(_ + _)
val wordCount = pairs.flatMap { case (a,b) => Seq(a -> 1, b ->1)}.reduceByKey(_ + _)
wordCount.take(10)
// count in how many pairs each word occur, keys and values:
val wordPairCount = pairs.flatMap { case (a,b) =>
if (a == b) {
Seq(a->1)
} else {
Seq(a -> 1, b ->1)
}
}.reduceByKey(_ + _)
wordPairCount.take(10)
to get the histograms for the (String,String) RDD I used this code.
val Hist_X = histogram.map(x => (x._1-> 1.0)).reduceByKey(_+_).collect().toMap
val Hist_Y = histogram.map(x => (x._2-> 1.0)).reduceByKey(_+_).collect().toMap
val Hist_XY = histogram.map(x => (x-> 1.0)).reduceByKey(_+_)
where histogram was the (String,String) RDD

How to find tuple with different value in a list using scala?

I have following list:
val list = List(("name1",20),("name2",20),("name1",30),("name2",30),
("name3",40),("name3",30),("name3",20))
I want following output:
List(("name3",40))
I tried following:
val distElements = list.map(_._2).distinct
list.groupBy(_._1).map{ case(k,v) =>
val h = v.map(_._2)
if(distElements.equals(h)) List.empty else distElements.diff(h)
}.flatten
But this is not I am looking for.
Can anybody give answer/hint me to get expected output.
I understand the question as looking for the element of the list whose _2 (number) occurs only once.
val list = List(("name1",20),("name2",20),("name1",30),("name2",30),
("name3",40),("name3",30),("name3",20))
First you group by the _2 element, which gives you a map whose keys are lists of all elements with the same _2:
val g = list.groupBy(_._2) // Map[Int, List[(String, Int)]]
Now you can filter those entries that consists only of one element:
val opt = g.collectFirst { // Option[(String, Int)]
case (_, single :: Nil) => single
}
Or (if you are expecting possibly more than one distinct value)
val col = g.collect { // Map[String, Int]
case (_, single :: Nil) => single
}
Seems to me that you're looking to match against both the value of the left hand and the right hand at the same time while also preserving the type of collection you're looking at, a List. I would use collect:
val out = myList.collect{
case item # ("name3", 40) => item
}
which combines a PartialFunction with filter and map like qualities. In this case, it filters out any value for which the PartialFunction is not defined while mapping the values which match. Here, I've only allowed for a singular match.

How to extract elements from 4 lists in scala?

case class TargetClass(key: Any, value: Number, lowerBound: Double, upperBound: Double)
val keys: List[Any] = List("key1", "key2", "key3")
val values: List[Number] = List(1,2,3);
val lowerBounds: List[Double] = List(0.1, 0.2, 0.3)
val upperBounds: List[Double] = List(0.5, 0.6, 0.7)
Now I want to construct a List[TargetClass] to hold the 4 lists. Does anyone know how to do it efficiently? Is using for-loop to add elements one by one very inefficient?
I tried to use zipped, but it seems that this only applies for combining up to 3 lists.
Thank you very much!
One approach:
keys.zipWithIndex.map {
case (item,i)=> TargetClass(item,values(i),lowerBounds(i),upperBounds(i))
}
You may want to consider using the lift method to deal with case of lists being of unequal lengths (and thereby provide a default if keys is longer than any of the lists?)
I realise this doesn't address your question of efficiency. You could fairly easily run some tests on different approaches.
You can apply zipped to the first two lists, to the last two lists, then to the results of the previous zips, then map to your class, like so:
val z12 = (keys, values).zipped
val z34 = (lowerBounds, upperBounds).zipped
val z1234 = (z12.toList, z34.toList).zipped
val targs = z1234.map { case ((k,v),(l,u)) => TargetClass(k,v,l,u) }
// targs = List(TargetClass(key1,1,0.1,0.5), TargetClass(key2,2,0.2,0.6), TargetClass(key3,3,0.3,0.7))
How about:
keys zip values zip lowerBounds zip upperBounds map {
case (((k, v), l), u) => TargetClass(k, v, l, u)
}
Example:
scala> val zipped = keys zip values zip lowerBounds zip upperBounds
zipped: List[(((Any, Number), Double), Double)] = List((((key1,1),0.1),0.5), (((key2,2),0.2),0.6), (((key3,3),0.3),0.7))
scala> zipped map { case (((k, v), l), u) => TargetClass(k, v, l, u) }
res6: List[TargetClass] = List(TargetClass(key1,1,0.1,0.5), TargetClass(key2,2,0.2,0.6), TargetClass(key3,3,0.3,0.7))
It would be nice if .transpose worked on a Tuple of Lists.
for (List(k, v:Number, l:Double, u:Double) <-
List(keys, values, lowerBounds, upperBounds).transpose)
yield TargetClass(k,v,l,u)
I think no matter what you use from an efficiency point of view, you will have to traverse the lists individually. The only question is, do you do it OR for the sake of readability, you use Scala idioms and let Scala do the dirty work for you :) ?
Other approaches are not necessarily more efficient. You can change the order of zipping and the order of assembling the return value of the map function as you like.
Here is a more functional way but I am not sure it will be more efficient. See comments on #wwkudu (zip with index) answer
val res1 = keys zip lowerBounds zip values zip upperBounds
res1.map {
x=> (x._1._1._1,x._1._1._2, x._1._2, x._2)
//Of course, you can return an instance of TargetClass
//here instead of the touple I am returning.
}
I am curious, why do you need a "TargetClass"? Will a touple work?

Scala: Grouping list of tuples

I need to group list of tuples in some unique way.
For example, if I have
val l = List((1,2,3),(4,2,5),(2,3,3),(10,3,2))
Then I should group the list with second value and map with the set of first value
So the result should be
Map(2 -> Set(1,4), 3 -> Set(2,10))
By so far, I came up with this
l groupBy { p => p._2 } mapValues { v => (v map { vv => vv._1 }).toSet }
This works, but I believe there should be a much more efficient way...
This is similar to this question. Basically, as #serejja said, your approach is correct and also the most concise one. You could use collection.breakOut as builder factory argument to the last map and thereby save the additional iteration to get the Set type:
l.groupBy(_._2).mapValues(_.map(_._1)(collection.breakOut): Set[Int])
You shouldn't probably go beyond this, unless you really need to squeeze the performance.
Otherwise, this is how a general toMultiMap function could look like which allows you to control the values collection type:
import collection.generic.CanBuildFrom
import collection.mutable
def toMultiMap[A, K, V, Values](xs: TraversableOnce[A])
(key: A => K)(value: A => V)
(implicit cbfv: CanBuildFrom[Nothing, V, Values]): Map[K, Values] = {
val b = mutable.Map.empty[K, mutable.Builder[V, Values]]
xs.foreach { elem =>
b.getOrElseUpdate(key(elem), cbfv()) += value(elem)
}
b.map { case (k, vb) => (k, vb.result()) } (collection.breakOut)
}
What it does is, it uses a mutable Map during building stage, and values gathered in a mutable Builder first (the builder is provided by the CanBuildFrom instance). After the iteration over all input elements has completed, that mutable map of builder values is converted into an immutable map of the values collection type (again using the collection.breakOut trick to get the desired output collection straight away).
Ex:
val l = List((1,2,3),(4,2,5),(2,3,3),(10,3,2))
val v = toMultiMap(l)(_._2)(_._1) // uses Vector for values
val s: Map[Int, Set[Int] = toMultiMap(l)(_._2)(_._1) // uses Set for values
So your annotated result type directs the type inference of the values type. If you do not annotate the result, Scala will pick Vector as default collection type.

Iterate Over a tuple

I need to implement a generic method that takes a tuple and returns a Map
Example :
val tuple=((1,2),(("A","B"),("C",3)),4)
I have been trying to break this tuple into a list :
val list=tuple.productIterator.toList
Scala>list: List[Any] = List((1,2), ((A,B),(C,3)), 4)
But this way returns List[Any] .
I am trying now to find out how to iterate over the following tuple ,for example :
((1,2),(("A","B"),("C",3)),4)
in order to loop over each element 1,2,"A",B",...etc. How could I do this kind of iteration over the tuple
What about? :
def flatProduct(t: Product): Iterator[Any] = t.productIterator.flatMap {
case p: Product => flatProduct(p)
case x => Iterator(x)
}
val tuple = ((1,2),(("A","B"),("C",3)),4)
flatProduct(tuple).mkString(",") // 1,2,A,B,C,3,4
Ok, the Any-problem remains. At least that´s due to the return type of productIterator.
Instead of tuples, use Shapeless data structures like HList. You can have generic processing, and also don't lose type information.
The only problem is that documentation isn't very comprehensive.
tuple.productIterator map {
case (a,b) => println(a,b)
case (a) => println(a)
}
This works for me. tranform is a tuple consists of dataframes
def apply_function(a: DataFrame) = a.write.format("parquet").save("..." + a + ".parquet")
transform.productIterator.map(_.asInstanceOf[DataFrame]).foreach(a => apply_function(a))