Scala Pairs: how to count the number of the ocurrences in the value (list of numbers) [duplicate] - scala

This question already has answers here:
Scala how can I count the number of occurrences in a list
(17 answers)
Closed 3 years ago.
I have a RDD[(Int, ListBuffer[Byte])] and I like to perform a "wordcount" but for each number in the List.
For instance, the RDD is:
(31000,ListBuffer(1, 1, 0, 1, 0, 1, 1, 1, 1))
(21010,ListBuffer(0, 0, 0))
(23000,ListBuffer(1, 1, 1, 1, 1))
(01000,ListBuffer(1, 1))
(34000,ListBuffer(0))
And I want to get this:
(31000,(0,2),(1,7)) // this could be a Map[0=>2, 1=>7]
(21010,(0,3))
(23000,(1,5))
(01000,(1,2))
(34000,(0,3))
Any guidance? Thank you in advance
Edit: someone suggested my question was duplicated, but the thing is the suggested post was about only a List, but I wanted to apply on a Pair (Int, List).

The most idiomatic way to get a histogram of values in a Scala collection is to use groupBy followed by a map that takes the size of each resulting group:
scala> import collection.mutable.ListBuffer
import collection.mutable.ListBuffer
scala> val values = ListBuffer(1, 1, 0, 1, 0, 1, 1, 1, 1)
values: scala.collection.mutable.ListBuffer[Int] = ListBuffer(1, 1, 0, 1, 0, 1, 1, 1, 1)
scala> values.groupBy(identity).mapValues(_.size)
res0: scala.collection.immutable.Map[Int,Int] = Map(1 -> 7, 0 -> 2)
In your case that part is completely independent from the Spark part—you just happen to be performing this operation on values in an RDD, but the complete solution would look like this:
scala> val counts = myRdd.mapValues(_.groupBy(identity).mapValues(_.size))
counts: org.apache.spark.rdd.RDD[(Int, scala.collection.immutable.Map[Int,Int])] = MapPartitionsRDD[1] at mapValues at <console>:26
scala> counts.foreach(println)
(1000,Map(1 -> 2))
(21010,Map(0 -> 3))
(23000,Map(1 -> 5))
(34000,Map(0 -> 1))
(31000,Map(1 -> 7, 0 -> 2))
It's worth noting that the mapValues on Scala collections is lazy, which means that every time you use the maps in the RDD the values will be recomputed. This is probably fine, but if you're concerned, you can replace it with something like this:
values.groupBy(identity).map { case (k, v) => k -> v.size }
…which will return a strictly-evaluated map.

Related

How to find duplicates in a list in Scala?

I have a list of unsorted integers and I want to find the elements which are duplicated.
val dup = List(1|1|1|2|3|4|5|5|6|100|101|101|102)
I have to find the list of unique elements and also how many times each element is repeated.
I know I can find it with below code :
val ans2 = dup.groupBy(identity).map(t => (t._1, t._2.size))
But I am not able to split the above list on "|" . I tried converting to a String then using split but I got the result below:
L
i
s
t
(
1
0
3
)
I am not sure why I am getting this result.
Reference: How to find duplicates in a list?
The symbol | is a function in scala. You can check the API here
|(x: Int): Int
Returns the bitwise OR of this value and x.
So you don't have a List, you have a single Integer (103) which is the result of operating | with all the integers in your pretended List.
Your code is fine, if you want to make a proper List you should separate its elements by commas
val dup = List(1,1,1,2,3,4,5,5,6,100,101,101,102)
If you want to convert your given String before having it on a List you can do:
"1|1|1|2|3|4|5|5|6|100|101|101|102".split("\\|").toList
Even easier, convert the list of duplicates into a set - a set is a data structure that by default does not have any duplicates.
scala> val dup = List(1,1,1,2,3,4,5,5,6,100,101,101,102)
dup: List[Int] = List(1, 1, 1, 2, 3, 4, 5, 5, 6, 100, 101, 101, 102)
scala> val noDup = dup.toSet
res0: scala.collection.immutable.Set[Int] = Set(101, 5, 1, 6, 102, 2, 3, 4, 100)
To count the elements, just call the method sizeon the resulting set:
scala> noDup.size
res3: Int = 9
Another way to solve the problem
"1|1|1|2|3|4|5|5|6|100|101|101|102".split("\|").groupBy(x => x).mapValues(_.size)
res0: scala.collection.immutable.Map[String,Int] = Map(100 -> 1, 4 -> 1, 5 -> 2, 6 -> 1, 1 -> 3, 102 -> 1, 2 -> 1, 101 -> 2, 3 -> 1)

Scala hash of stacks has only one stack for all the keys

I have the following hashmap, where each element should be mapped to a stack:
var pos = new HashMap[Int, Stack[Int]] withDefaultValue Stack.empty[Int]
for(i <- a.length - 1 to 0 by -1) {
pos(a(i)).push(i)
}
If a will have elements {4, 6, 6, 4, 6, 6},
and if I add the following lines after the code above:
println("pos(0) is " + pos(0))
println("pos(4) is " + pos(4))
The output will be:
pos(0) is Stack(0, 1, 2, 3, 4, 5)
pos(4) is Stack(0, 1, 2, 3, 4, 5)
Why is this happening?
I don't want to add elements to pos(0), but only to pos(4) and pos(6) (the lements of a).
It looks like there is only one stack mapped to all the keys. I want a stack for each key.
Check the docs:
http://www.scala-lang.org/api/current/index.html#scala.collection.mutable.HashMap
method withDefaultValue takes this value as a regular parameter, it won't be recalculated so all your entries share the same mutable stack.
def withDefaultValue(d: B): Map[A, B]
You should use withDefault method instead.
val pos = new HashMap[Int, Stack[Int]] withDefault (_ => Stack.empty[Int])
Edit
Above solution doesn't seem to work, I get empty stacks. Checking with sources shows that the default value is returned but never put into map
override def apply(key: A): B = {
val result = findEntry(key)
if (result eq null) default(key)
else result.value
}
I guess one solution might be to override apply or default method to add the entry to map before returning it. Example for default method:
val pos = new mutable.HashMap[Int, mutable.Stack[Int]]() {
override def default(key: Int) = {
val newValue = mutable.Stack.empty[Int]
this += key -> newValue
newValue
}
}
Btw. that is punishment for being mutable, I encourage you to use immutable data structures.
If you are looking for a more idiomatic Scala, functional-style solution without those mutable collections, consider this:
scala> val a = List(4, 6, 6, 4, 6, 6)
a: List[Int] = List(4, 6, 6, 4, 6, 6)
scala> val pos = a.zipWithIndex groupBy {_._1} mapValues { _ map (_._2) }
pos: scala.collection.immutable.Map[Int,List[Int]] = Map(4 -> List(0, 3), 6 -> List(1, 2, 4, 5))
It may look confusing at first, but if you break it down, zipWithIndex gets pairs of values and their positions, groupBy makes a map from each value to a list of entries, and mapValues is then used to turn the lists of (value, position) pairs into just lists of positions.
scala> val pairs = a.zipWithIndex
pairs: List[(Int, Int)] = List((4,0), (6,1), (6,2), (4,3), (6,4), (6,5))
scala> val pairsByValue = pairs groupBy (_._1)
pairsByValue: scala.collection.immutable.Map[Int,List[(Int, Int)]] = Map(4 -> List((4,0), (4,3)), 6 -> List((6,1), (6,2), (6,4), (6,5)))
scala> val pos = pairsByValue mapValues (_ map (_._2))
pos: scala.collection.immutable.Map[Int,List[Int]] = Map(4 -> List(0, 3), 6 -> List(1, 2, 4, 5))

Operate on neighbor elements in RDD in Spark

As I have a collection:
List(1, 3,-1, 0, 2, -4, 6)
It's easy to make it sorted as:
List(-4, -1, 0, 1, 2, 3, 6)
Then I can construct a new collection by compute 6 - 3, 3 - 2, 2 - 1, 1 - 0, and so on like this:
for(i <- 0 to list.length -2) yield {
list(i + 1) - list(i)
}
and get a vector:
Vector(3, 1, 1, 1, 1, 3)
That is, I want to make the next element minus the current element.
But how to implement this in RDD on Spark?
I know for the collection:
List(-4, -1, 0, 1, 2, 3, 6)
There will be some partitions of the collection, each partition is ordered, can I do the similar operation on each partition and collect results on each partition together?
The most efficient solution is to use sliding method:
import org.apache.spark.mllib.rdd.RDDFunctions._
val rdd = sc.parallelize(Seq(1, 3,-1, 0, 2, -4, 6))
.sortBy(identity)
.sliding(2)
.map{case Array(x, y) => y - x}
Suppose you have something like
val seq = sc.parallelize(List(1, 3, -1, 0, 2, -4, 6)).sortBy(identity)
Let's create first collection with index as key like Ton Torres suggested
val original = seq.zipWithIndex.map(_.swap)
Now we can build collection shifted by one element.
val shifted = original.map { case (idx, v) => (idx - 1, v) }.filter(_._1 >= 0)
Next we can calculate needed differences ordered by index descending
val diffs = original.join(shifted)
.sortBy(_._1, ascending = false)
.map { case (idx, (v1, v2)) => v2 - v1 }
So
println(diffs.collect.toSeq)
shows
WrappedArray(3, 1, 1, 1, 1, 3)
Note that you can skip the sortBy step if reversing is not critical.
Also note that for local collection this could be computed much more simple like:
val elems = List(1, 3, -1, 0, 2, -4, 6).sorted
(elems.tail, elems).zipped.map(_ - _).reverse
But in case of RDD the zip method requires each collection should contain equal element count for each partition. So if you would implement tail like
val tail = seq.zipWithIndex().filter(_._2 > 0).map(_._1)
tail.zip(seq) would not work since both collection needs equal count of elements for each partition and we have one element for each partition that should travel to previous partition.

How to concatenate lists that are values of map?

Given:
scala> var a = Map.empty[String, List[Int]]
a: scala.collection.immutable.Map[String,List[Int]] = Map()
scala> a += ("AAA" -> List[Int](1,3,4))
scala> a += ("BBB" -> List[Int](4,1,4))
scala> a
res0: scala.collection.immutable.Map[String,List[Int]] = Map(AAA -> List(1, 3, 4), BBB -> List(4, 1, 4))
How to concatenate the values to a single iterable collection (to be sorted)?
List(1, 3, 4, 4, 1, 4)
How should I end this code?
a.values.[???].sorted
You should end it with:
a.values.flatten
Result:
scala> Map("AAA" -> List(1, 3, 4), "BBB" -> List(4, 1, 4))
res50: scala.collection.immutable.Map[String,List[Int]] = Map(AAA -> List(1, 3, 4), BBB -> List(4, 1, 4))
scala> res50.values.flatten
res51: Iterable[Int] = List(1, 3, 4, 4, 1, 4)
Updated:
For your specific case it's:
(for(vs <- a.asScala.values; v <- vs.asScala) yield v.asInstanceOf[TargetType]).sorted
This will work
a.values.flatten
//> res0: Iterable[Int] = List(1, 3, 4, 4, 1, 4)
Consider
a.flatMap(_._2)
which flattens up the second element of each tuple (each value in the map).
Equivalent in this case is also
a.values.flatMap(identity)
My appreciation of all answers I have received. Finally good points led to really working code. Below is real code fragment and x here is org.apache.hadoop.hbase.client.Put which makes all the 'devil in the details'. I needed HBase Put to be converted into list of appropriate data cells (accessible from puts through org.apache.hadoop.hbase.Cell interface) but yet I need disclosure of the fact they are indeed implemented as KeyValue (org.apache.hadoop.hbase.KeyValue).
val a : Put ...
a.getFamilyCellMap.asScala
.flatMap(
_._2.asScala.flatMap(
x => List[KeyValue](x.asInstanceOf[KeyValue]) )
).toList.sorted
Why so complex?
Put is Java type to represent 'write' operation content and we can get its cells only through map of cells families elements of which are lists. Of course they all are Java.
I have only access to interface (Cell) but I need implementation (KeyValue) so downcast is required. I have guarantee nothing else is present.
The most funny thing after all of this I decided to drop standard Put and encapsulate data into different container (which is my custom class) on earlier stage and this made things much more simple.
So more generic answer for this case where a is java.util.Map[?] with values of java.util.List[?] and elements of list are of BaseType but you need `TargetType is probably:
a.asScala.flatMap(
_._2.asScala.flatMap(
x => List[TargetType](x.asInstanceOf[TargetType]) )
).toList.sorted

How can I find repeated items in a Scala List?

I have a Scala List that contains some repeated numbers. I want to count the number of times a specific number will repeat itself. For example:
val list = List(1,2,3,3,4,2,8,4,3,3,5)
val repeats = list.takeWhile(_ == List(3,3)).size
And the val repeats would equal 2.
Obviously the above is pseudo-code and takeWhile will not find two repeated 3s since _ represents an integer. I tried mixing both takeWhile and take(2) but with little success. I also referred code from How to find count of repeatable elements in scala list but it appears the author is looking to achieve something different.
Thanks for your help.
This will work in this case:
val repeats = list.sliding(2).count(_.forall(_ == 3))
The sliding(2) method gives you an iterator of lists of elements and successors and then we just count where these two are equal to 3.
Question is if it creates the correct result to List(3, 3, 3)? Do you want that to be 2 or just 1 repeat.
val repeats = list.sliding(2).toList.count(_==List(3,3))
and more generally the following code returns tuples of element and repeats value for all elements:
scala> list.distinct.map(x=>(x,list.sliding(2).toList.count(_.forall(_==x))))
res27: List[(Int, Int)] = List((1,0), (2,0), (3,2), (4,0), (8,0), (5,0))
which means that the element '3' repeats 2 times consecutively at 2 places and all others 0 times.
and also if we want element repeats 3 times consecutively we just need to modify the code as follows:
list.distinct.map(x=>(x,list.sliding(3).toList.count(_.forall(_==x))))
in SCALA REPL:
scala> val list = List(1,2,3,3,3,4,2,8,4,3,3,3,5)
list: List[Int] = List(1, 2, 3, 3, 3, 4, 2, 8, 4, 3, 3, 3, 5)
scala> list.distinct.map(x=>(x,list.sliding(3).toList.count(_==List(x,x,x))))
res29: List[(Int, Int)] = List((1,0), (2,0), (3,2), (4,0), (8,0), (5,0))
Even sliding value can be varied by defining a function as:
def repeatsByTimes(list:List[Int],n:Int) =
list.distinct.map(x=>(x,list.sliding(n).toList.count(_.forall(_==x))))
Now in REPL:
scala> val list = List(1,2,3,3,4,2,8,4,3,3,5)
list: List[Int] = List(1, 2, 3, 3, 4, 2, 8, 4, 3, 3, 5)
scala> repeatsByTimes(list,2)
res33: List[(Int, Int)] = List((1,0), (2,0), (3,2), (4,0), (8,0), (5,0))
scala> val list = List(1,2,3,3,3,4,2,8,4,3,3,3,2,4,3,3,3,5)
list: List[Int] = List(1, 2, 3, 3, 3, 4, 2, 8, 4, 3, 3, 3, 2, 4, 3, 3, 3, 5)
scala> repeatsByTimes(list,3)
res34: List[(Int, Int)] = List((1,0), (2,0), (3,3), (4,0), (8,0), (5,0))
scala>
We can go still further like given a list of integers and given a maximum number
of consecutive repetitions that any of the element can occur in the list, we may need a list of 3-tuples representing (the element, number of repetitions of this element, at how many places this repetition occurred). this is more exhaustive information than the above. Can be achieved by writing a function like this:
def repeats(list:List[Int],maxRep:Int) =
{ var v:List[(Int,Int,Int)] = List();
for(i<- 1 to maxRep)
v = v ++ list.distinct.map(x=>
(x,i,list.sliding(i).toList.count(_.forall(_==x))))
v.sortBy(_._1) }
in SCALA REPL:
scala> val list = List(1,2,3,3,3,4,2,8,4,3,3,3,2,4,3,3,3,5)
list: List[Int] = List(1, 2, 3, 3, 3, 4, 2, 8, 4, 3, 3, 3, 2, 4, 3, 3, 3, 5)
scala> repeats(list,3)
res38: List[(Int, Int, Int)] = List((1,1,1), (1,2,0), (1,3,0), (2,1,3),
(2,2,0), (2,3,0), (3,1,9), (3,2,6), (3,3,3), (4,1,3), (4,2,0), (4,3,0),
(5,1,1), (5,2,0), (5,3,0), (8,1,1), (8,2,0), (8,3,0))
scala>
These results can be understood as follows:
1 times the element '1' occurred at 1 places.
2 times the element '1' occurred at 0 places.
............................................
............................................
.............................................
2 times the element '3' occurred at 6 places..
.............................................
3 times the element '3' occurred at 3 places...
............................................and so on.
Thanks to Luigi Plinge I was able to use methods in run-length encoding to group together items in a list that repeat. I used some snippets from this page here: http://aperiodic.net/phil/scala/s-99/
var n = 0
runLengthEncode(totalFrequencies).foreach{ o =>
if(o._1 > 1 && o._2==subjectNumber) n+=1
}
n
The method runLengthEncode is as follows:
private def pack[A](ls: List[A]): List[List[A]] = {
if (ls.isEmpty) List(List())
else {
val (packed, next) = ls span { _ == ls.head }
if (next == Nil) List(packed)
else packed :: pack(next)
}
}
private def runLengthEncode[A](ls: List[A]): List[(Int, A)] =
pack(ls) map { e => (e.length, e.head) }
I'm not entirely satisfied that I needed to use the mutable var n to count the number of occurrences but it did the trick. This will count the number of times a number repeats itself no matter how many times it is repeated.
If you knew your list was not very long you could do it with Strings.
val list = List(1,2,3,3,4,2,8,4,3,3,5)
val matchList = List(3,3)
(matchList.mkString(",")).r.findAllMatchIn(list.mkString(",")).length
From you pseudocode I got this working:
val pairs = list.sliding(2).toList //create pairs of consecutive elements
val result = pairs.groupBy(x => x).map{ case(x,y) => (x,y.size); //group pairs and retain the size, which is the number of occurrences.
result will be a Map[List[Int], Int] so you can the count number like:
result(List(3,3)) // will return 2
I couldn't understand if you also want to check lists of several sizes, then you would need to change the parameter to sliding to the desired size.
def pack[A](ls: List[A]): List[List[A]] = {
if (ls.isEmpty) List(List())
else {
val (packed, next) = ls span { _ == ls.head }
if (next == Nil) List(packed)
else packed :: pack(next)
}
}
def encode[A](ls: List[A]): List[(Int, A)] = pack(ls) map { e => (e.length, e.head) }
val numberOfNs = list.distinct.map{ n =>
(n -> list.count(_ == n))
}.toMap
val runLengthPerN = runLengthEncode(list).map{ t => t._2 -> t._1}.toMap
val nRepeatedMostInSuccession = runLengthPerN.toList.sortWith(_._2 <= _._2).head._1
Where runLength is defined as below from scala's 99 problems problem 9 and scala's 99 problems problem 10.
Since numberOfNs and runLengthPerN are Maps, you can get the population count of any number in the list with numberOfNs(number) and the length of the longest repitition in succession with runLengthPerN(number). To get the runLength, just compute as above with runLength(list).map{ t => t._2 -> t._1 }.