Spark - Combinations without repetition - scala

I am trying to do all lines combinations without repetition of a text file.
Example:
1
2
2
1
1
Result:
Line 1 with line 2 = (1,2)
Line 1 with line 3 = (1,2)
Line 1 with line 4 = (1,1)
Line 1 with line 5 = (1,1)
Line 2 with line 3 = (2,2)
Line 2 with line 4 = (2,1)
Line 2 with line 5 = (2,1)
Line 3 with line 4 = (2,1)
Line 3 with line 5 = (2,1)
Line 4 with line 5 = (1,1)
or
Considering (x,y), if (x != y) 0 else 1:
0
0
1
1
1
0
0
0
0
1
I have the following code:
def processCombinations(rdd: RDD[String]) = {
rdd.mapPartitions({ partition => {
var previous: String = null;
if (partition.hasNext)
previous = partition.next
for (element <- partition) yield {
if (previous == element)
"1"
else
"0"
}
}
})
}
The piece of code above is doing the combinations of the first element of my RDD, in other words: (1,2) (1,2) (1,1) (1,1).
The problem is: This code ONLY works with ONE PARTITION. I'd like to make this work with many partitions, how could I do that?

It's not very clear exactly what you want as output, but this reproduces your first example, and translates directly to Spark. It generates combinations, but only where the index of the first element in the original list is less than the index of the second, which is I think what you're asking for.
val r = List(1,2,2,1,1)
val z = r zipWithIndex
z.flatMap(x=>z.map(y=>(x,y))).collect{case(x,y) if x._2 < y._2 => (x._1, y._1)}
//List((1,2), (1,2), (1,1), (1,1), (2,2), (2,1), (2,1), (2,1), (2,1), (1,1))
or, as a for-comprehension
for (x<-z; y<-z; if x._2 < y._2) yield (x._1, y._1)

This code calculate the combinations without repetitions by using recursion. It gets 2 arguments: number of elements for the combination and the list of elements.
It works in the following way: for the given list: 1, 2, 3, 4, 5 => It takes the 4 first elements for the first combination. Then It generates other combination with 5, the last element of the list. When there are not more elements left in the list, It moves one position back (third position) and takes the next element to generates more combinations from there: 1, 2, "4", 5. This operation is done recursively with all of elements of the list.
def combinator[A](n: Int, list: List[A], acc: List[A]): List[List[A]] = {
if (n == 0)
List(acc.reverse)
else if (list == Nil)
List()
else
combinator(n - 1, list.tail, list.head :: acc) ::: combinator(n, list.tail, acc)
}
combinator(4, List(1, 2, 3, 4, 5), List()).foreach(println)
// List(1, 2, 3, 4)
// List(1, 2, 3, 5)
// List(1, 2, 4, 5)
// List(1, 3, 4, 5)
// List(2, 3, 4, 5)

Related

How to compare two integer array in Scala?

Find a number available in first array with numbers in second. If number not found get the immediate lower.
val a = List(1,2,3,4,5,6,7,8,9)
val b = List(1,5,10)
expected output after comparing a with b
1 --> 1
2 --> 1
3 --> 1
4 --> 1
5 --> 5
6 --> 5
7 --> 5
8 --> 5
9 --> 5
Thanks
You can use TreeSet's to() and lastOption methods as follows:
val a = List(1, 2, 3, 4, 5, 6, 7, 8, 9)
val b = List(1, 5, 10)
import scala.collection.immutable.TreeSet
// Convert list `b` to TreeSet
val bs = TreeSet(b.toSeq: _*)
a.map( x => (x, bs.to(x).lastOption.getOrElse(Int.MinValue)) ).toMap
// res1: scala.collection.immutable.Map[Int,Int] = Map(
// 5 -> 5, 1 -> 1, 6 -> 5, 9 -> 5, 2 -> 1, 7 -> 5, 3 -> 1, 8 -> 5, 4 -> 1
// )
Note that neither list a or b needs to be ordered.
UPDATE:
Starting Scala 2.13, methods to for TreeSet is replaced with rangeTo.
Here is another approach using collect function
val a = List(1,2,3,4,5,6,7,8,9)
val b = List(1,5,10)
val result = a.collect{
case e if(b.filter(_<=e).size>0) => e -> b.filter(_<=e).reverse.head
}
//result: List[(Int, Int)] = List((1,1), (2,1), (3,1), (4,1), (5,5), (6,5), (7,5), (8,5), (9,5))
Here for every element in a check if there is a number in b i.e. which is greater than or equal to it and reverse the filter list and get its head to make it a pair.

if condition for partial argument in map

I understand how to use if in map. For example, val result = list.map(x => if (x % 2 == 0) x * 2 else x / 2).
However, I want to use if for only part of the arguments.
val inputColumns = List(
List(1, 2, 3, 4, 5, 6), // first "column"
List(4, 6, 5, 7, 12, 15) // second "column"
)
inputColumns.zipWithIndex.map{ case (col, idx) => if (idx == 0) col * 2 else col / 10}
<console>:1: error: ';' expected but integer literal found.
inputColumns.zipWithIndex
res4: List[(List[Int], Int)] = List((List(1, 2, 3, 4, 5, 6),0), (List(4, 6, 5, 7, 12, 15),1))
I have searched the error info but have not found a solution.
Why my code is not 'legal' in Scala? Is there a better way to write it? Basically, I want to do a pattern matching and then do something on other arguments.
To explain your problem another way, inputColumns has type List[List[Int]]. You can verify this in the Scala REPL:
$ scala
Welcome to Scala 2.12.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_161).
Type in expressions for evaluation. Or try :help.
scala> val inputColumns = List(
| List(1, 2, 3, 4, 5, 6), // first "column"
| List(4, 6, 5, 7, 12, 15) // second "column"
| )
inputColumns: List[List[Int]] = List(List(1, 2, 3, 4, 5, 6), List(4, 6, 5, 7, 12, 15))
Now, when you call .zipWithIndex on that list, you end up with a List[(List[Int], Int)] - that is, a list of a tuple, in which the first tuple type is a List[Int] (the column) and the second is an Int (the index):
scala> inputColumns.zipWithIndex
res0: List[(List[Int], Int)] = List((List(1, 2, 3, 4, 5, 6),0), (List(4, 6, 5, 7, 12, 15),1))
Consequently, when you try to apply a map function to this list, col is a List[Int] and not an Int, and so col * 2 makes no sense - you're multiplying a List[Int] by 2. You then also try to divide the list by 10, obviously.
scala> inputColumns.zipWithIndex.map{ case(col, idx) => if(idx == 0) col * 2 else col / 10 }
<console>:13: error: value * is not a member of List[Int]
inputColumns.zipWithIndex.map{ case(col, idx) => if(idx == 0) col * 2 else col / 10 }
^
<console>:13: error: value / is not a member of List[Int]
inputColumns.zipWithIndex.map{ case(col, idx) => if(idx == 0) col * 2 else col / 10 }
^
In order to resolve this, it depends what you're trying to achieve. If you want a single list of integers, and then zip those so that each value has an associated index, you should call flatten on inputColumns before calling zipWithIndex. This will result in List[(Int, Int)], where the first value in the tuple is the column value, and the second is the index. Your map function will then work correctly without modification:
scala> inputColumns.flatten.zipWithIndex.map{ case(col, idx) => if(idx == 0) col * 2 else col / 10 }
res3: List[Int] = List(2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1)
Of course, you no longer have separate columns.
If you wish each value in each list to have an associated index, you need to firstly map inputColumns into two zipped lists, using inputColumns.map(_.zipWithIndex) to create a List[List[(Int, Int)]] - a list of a list of (Int, Int) tuples:
scala> inputColumns.map(_.zipWithIndex)
res4: List[List[(Int, Int)]] = List(List((1,0), (2,1), (3,2), (4,3), (5,4), (6,5)), List((4,0), (6,1), (5,2), (7,3), (12,4), (15,5)))
We can now apply your original map function to the result of the zipWithIndex operation:
scala> inputColumns.map(_.zipWithIndex.map { case (col, idx) => if(idx == 0) col * 2 else col / 10 })
res5: List[List[Int]] = List(List(2, 0, 0, 0, 0, 0), List(8, 0, 0, 0, 1, 1))
The result is another List[List[Int]] with each internal list being the results of your map operation on the original two input columns.
On the other hand, if idx is meant to be the index of the column, and not of each value, and you want to multiply all of the values in the first column by 2 and divide all of the values in the other columns by 10, then you need to change your original map function to map across each column, as follows:
scala> inputColumns.zipWithIndex.map {
| case (col, idx) => {
| if(idx == 0) col.map(_ * 2) // Multiply values in first column by 1
| else col.map(_ / 10) // Divide values in all other columns by 10
| }
| }
res5: List[List[Int]] = List(List(2, 4, 6, 8, 10, 12), List(0, 0, 0, 0, 1, 1))
Let me know if you require any further clarification...
UPDATE:
The use of case in map is a common Scala shorthand. If a higher-order function takes a single argument, something such as this:
def someHOF[A, B](x: A => B) = //...
and you call that function like this (with what Scala terms a partial function - a function consisting solely of a list of case statements):
someHOF {
case expr1 => //...
case expr2 => //...
...
}
then Scala treats it as a kind-of shorthand for:
someHOF {a =>
a match {
case expr1 => //...
case expr2 => //...
...
}
}
or, being slightly more terse,
someHOF {
_ match {
case expr1 => //...
case expr2 => //...
...
}
}
For a List, for example, you can use it with functions such as map, flatMap, filter, etc.
In the case of your map function, the sole argument is a tuple, and the sole case statement acts to break open the tuple and expose its contents. That is:
val l = List((1, 2), (3, 4), (5, 6))
l.map { case(a, b) => println(s"First is $a, second is $b") }
is equivalent to:
l.map {x =>
x match {
case (a, b) => println(s"First is $a, second is $b")
}
}
and both will output:
First is 1, second is 2
First is 3, second is 4
First is 5, second is 6
Note: This latter is a bit of a dumb example, since map is supposed to map (i.e. change) the values in the list into new values in a new list. If all you were doing was printing the values, this would be better:
val l = List((1, 2), (3, 4), (5, 6))
l.foreach { case(a, b) => println(s"First is $a, second is $b") }
You are trying to multiply a list by 2 when you do col * 2 as col is List(1, 2, 3, 4, 5, 6) when idx is 0, which is not possible and similar is the case with else part col / 10
If you are trying to multiply the elements of first list by 2 and devide the elements of rest of the list by 10 then you should be doing the following
inputColumns.zipWithIndex.map{ case (col, idx) => if (idx == 0) col.map(_*2) else col.map(_/10)}
Even better approach would be to use match case
inputColumns.zipWithIndex.map(x => x._2 match {
case 0 => x._1.map(_*2)
case _ => x._1.map(_/10)
})

Attacking grouping sets of values functionally

Given a map associating indices to values, how do I create a separate map that accumulates the values that are above a particular threshold when the number of values that can be grouped together cannot exceed some limiting value?
For example, given a mapping like this:
val raw = Map(0 -> 2, 1 -> 1, 2 -> 2, 3 -> 0, 4 -> 1, 5 -> 2)
Group those values over 2 together but each grouping can only contain at most the sum of 2 values such that if the first value is >= 2 then the grouping would contain a single value. In contrast, if the 1st value is less than 2, the grouping will be of size 2 with a value consisting of the same of the 1st value and the second value.
Executing that on the mapping above would yield a map of the group's index to the value, e.g.,
Map(0 -> 2, 1 -> 3, 2 -> 1, 3 -> 2) // Result
Obviously the way to do this in a non-functional way would be like this:
var c = 0
var sortedIndex = 0
var acc: Map[Int, Int] = Map() // Result accumulator
val limit = 2 // Anything larger will be forced into the next group
while (c < raw.size) {
if (raw(c) >= limit) {
acc = acc ++ Map(sortedIndex -> raw(c))
c = c + 1
} else {
acc = acc ++ Map(sortedIndex -> raw(c) + raw(c + i)
c = c + 2
}
sortedIndex = sortedIndex + 1
}
acc
How would I do this functionally? I.e., immutable states, reducing my use of loops. (I understand that loops are not "dead" in FP, just trying to reinforce a use case where I can get away with NOT using loops.)
I do not think you need to work with Map for this problem. Since the key of the map is simple index. Any case the following work for your problem:
val testLimit = 2 // Update the constants as required
val takeUpto = 2
def accumulator(input: List[Int], output: List[Int] = List.empty[Int]): List[Int] = {
input match {
case Nil => output // We have reached at the end of the input
case head :: tail if head >= testLimit => accumulator(tail, output :+ head)
case m =>
val (toSum, next) = m.splitAt(takeUpto)
accumulator(next, output :+ toSum.sum)
}
}
// Map(0 -> 2, 1 -> 3, 3 -> 1, 4 -> 2) // Result
// val raw = Map(0 -> 2, 1 -> 1, 2 -> 2, 3 -> 0, 4 -> 1, 5 -> 2) equivalent is List(2, 1, 2, 0, 1, 2)
println(accumulator(List(2, 1, 2, 0, 1, 2)))

How can I find repeated items in a Scala List?

I have a Scala List that contains some repeated numbers. I want to count the number of times a specific number will repeat itself. For example:
val list = List(1,2,3,3,4,2,8,4,3,3,5)
val repeats = list.takeWhile(_ == List(3,3)).size
And the val repeats would equal 2.
Obviously the above is pseudo-code and takeWhile will not find two repeated 3s since _ represents an integer. I tried mixing both takeWhile and take(2) but with little success. I also referred code from How to find count of repeatable elements in scala list but it appears the author is looking to achieve something different.
Thanks for your help.
This will work in this case:
val repeats = list.sliding(2).count(_.forall(_ == 3))
The sliding(2) method gives you an iterator of lists of elements and successors and then we just count where these two are equal to 3.
Question is if it creates the correct result to List(3, 3, 3)? Do you want that to be 2 or just 1 repeat.
val repeats = list.sliding(2).toList.count(_==List(3,3))
and more generally the following code returns tuples of element and repeats value for all elements:
scala> list.distinct.map(x=>(x,list.sliding(2).toList.count(_.forall(_==x))))
res27: List[(Int, Int)] = List((1,0), (2,0), (3,2), (4,0), (8,0), (5,0))
which means that the element '3' repeats 2 times consecutively at 2 places and all others 0 times.
and also if we want element repeats 3 times consecutively we just need to modify the code as follows:
list.distinct.map(x=>(x,list.sliding(3).toList.count(_.forall(_==x))))
in SCALA REPL:
scala> val list = List(1,2,3,3,3,4,2,8,4,3,3,3,5)
list: List[Int] = List(1, 2, 3, 3, 3, 4, 2, 8, 4, 3, 3, 3, 5)
scala> list.distinct.map(x=>(x,list.sliding(3).toList.count(_==List(x,x,x))))
res29: List[(Int, Int)] = List((1,0), (2,0), (3,2), (4,0), (8,0), (5,0))
Even sliding value can be varied by defining a function as:
def repeatsByTimes(list:List[Int],n:Int) =
list.distinct.map(x=>(x,list.sliding(n).toList.count(_.forall(_==x))))
Now in REPL:
scala> val list = List(1,2,3,3,4,2,8,4,3,3,5)
list: List[Int] = List(1, 2, 3, 3, 4, 2, 8, 4, 3, 3, 5)
scala> repeatsByTimes(list,2)
res33: List[(Int, Int)] = List((1,0), (2,0), (3,2), (4,0), (8,0), (5,0))
scala> val list = List(1,2,3,3,3,4,2,8,4,3,3,3,2,4,3,3,3,5)
list: List[Int] = List(1, 2, 3, 3, 3, 4, 2, 8, 4, 3, 3, 3, 2, 4, 3, 3, 3, 5)
scala> repeatsByTimes(list,3)
res34: List[(Int, Int)] = List((1,0), (2,0), (3,3), (4,0), (8,0), (5,0))
scala>
We can go still further like given a list of integers and given a maximum number
of consecutive repetitions that any of the element can occur in the list, we may need a list of 3-tuples representing (the element, number of repetitions of this element, at how many places this repetition occurred). this is more exhaustive information than the above. Can be achieved by writing a function like this:
def repeats(list:List[Int],maxRep:Int) =
{ var v:List[(Int,Int,Int)] = List();
for(i<- 1 to maxRep)
v = v ++ list.distinct.map(x=>
(x,i,list.sliding(i).toList.count(_.forall(_==x))))
v.sortBy(_._1) }
in SCALA REPL:
scala> val list = List(1,2,3,3,3,4,2,8,4,3,3,3,2,4,3,3,3,5)
list: List[Int] = List(1, 2, 3, 3, 3, 4, 2, 8, 4, 3, 3, 3, 2, 4, 3, 3, 3, 5)
scala> repeats(list,3)
res38: List[(Int, Int, Int)] = List((1,1,1), (1,2,0), (1,3,0), (2,1,3),
(2,2,0), (2,3,0), (3,1,9), (3,2,6), (3,3,3), (4,1,3), (4,2,0), (4,3,0),
(5,1,1), (5,2,0), (5,3,0), (8,1,1), (8,2,0), (8,3,0))
scala>
These results can be understood as follows:
1 times the element '1' occurred at 1 places.
2 times the element '1' occurred at 0 places.
............................................
............................................
.............................................
2 times the element '3' occurred at 6 places..
.............................................
3 times the element '3' occurred at 3 places...
............................................and so on.
Thanks to Luigi Plinge I was able to use methods in run-length encoding to group together items in a list that repeat. I used some snippets from this page here: http://aperiodic.net/phil/scala/s-99/
var n = 0
runLengthEncode(totalFrequencies).foreach{ o =>
if(o._1 > 1 && o._2==subjectNumber) n+=1
}
n
The method runLengthEncode is as follows:
private def pack[A](ls: List[A]): List[List[A]] = {
if (ls.isEmpty) List(List())
else {
val (packed, next) = ls span { _ == ls.head }
if (next == Nil) List(packed)
else packed :: pack(next)
}
}
private def runLengthEncode[A](ls: List[A]): List[(Int, A)] =
pack(ls) map { e => (e.length, e.head) }
I'm not entirely satisfied that I needed to use the mutable var n to count the number of occurrences but it did the trick. This will count the number of times a number repeats itself no matter how many times it is repeated.
If you knew your list was not very long you could do it with Strings.
val list = List(1,2,3,3,4,2,8,4,3,3,5)
val matchList = List(3,3)
(matchList.mkString(",")).r.findAllMatchIn(list.mkString(",")).length
From you pseudocode I got this working:
val pairs = list.sliding(2).toList //create pairs of consecutive elements
val result = pairs.groupBy(x => x).map{ case(x,y) => (x,y.size); //group pairs and retain the size, which is the number of occurrences.
result will be a Map[List[Int], Int] so you can the count number like:
result(List(3,3)) // will return 2
I couldn't understand if you also want to check lists of several sizes, then you would need to change the parameter to sliding to the desired size.
def pack[A](ls: List[A]): List[List[A]] = {
if (ls.isEmpty) List(List())
else {
val (packed, next) = ls span { _ == ls.head }
if (next == Nil) List(packed)
else packed :: pack(next)
}
}
def encode[A](ls: List[A]): List[(Int, A)] = pack(ls) map { e => (e.length, e.head) }
val numberOfNs = list.distinct.map{ n =>
(n -> list.count(_ == n))
}.toMap
val runLengthPerN = runLengthEncode(list).map{ t => t._2 -> t._1}.toMap
val nRepeatedMostInSuccession = runLengthPerN.toList.sortWith(_._2 <= _._2).head._1
Where runLength is defined as below from scala's 99 problems problem 9 and scala's 99 problems problem 10.
Since numberOfNs and runLengthPerN are Maps, you can get the population count of any number in the list with numberOfNs(number) and the length of the longest repitition in succession with runLengthPerN(number). To get the runLength, just compute as above with runLength(list).map{ t => t._2 -> t._1 }.

Comparing items in two lists

I have two lists : List(1,1,1) , List(1,0,1)
I want to get the following :
A count of every element that contains a 1 in first list and a 0 in the corresponding list at same position and vice versa.
In above example this would be 1 , 0 since the first list contains a 1 at middle position and second list contains a 0 at same position (middle).
A count of every element where 1 is in first list and 1 is also in second list.
In above example this is two since there are two 1's in each corresponding list. I can get this using the intersect method of class List.
I am just looking an answer to point 1 above. I could use an iterative a approach to count the items but is there a more functional method ?
Here is the entire code :
class Similarity {
def getSimilarity(number1: List[Int], number2: List[Int]) = {
val num: List[Int] = number1.intersect(number2)
println("P is " + num.length)
}
}
object HelloWorld {
def main(args: Array[String]) {
val s = new Similarity
s.getSimilarity(List(1, 1, 1), List(1, 0, 1))
}
}
For the first one:
scala> val a = List(1,1,1)
a: List[Int] = List(1, 1, 1)
scala> val b = List(1,0,1)
b: List[Int] = List(1, 0, 1)
scala> a.zip(b).filter(x => x._1==1 && x._2==0).size
res7: Int = 1
For the second:
scala> a.zip(b).filter(x => x._1==1 && x._2==1).size
res7: Int = 2
You can count all combinations easily and have it in a map with
def getSimilarity(number1 : List[Int] , number2 : List[Int]) = {
//sorry for the 1-liner, explanation follows
val countMap = (number1 zip number2) groupBy (identity) mapValues {_.length}
}
/*
* Example
* number1 = List(1,1,0,1,0,0,1)
* number2 = List(0,1,1,1,0,1,1)
*
* countMap = Map((1,0) -> 1, (1,1) -> 3, (0,1) -> 2, (0,0) -> 1)
*/
The trick is a common one
// zip the elements pairwise
(number1 zip number2)
/* List((1,0), (1,1), (0,1), (1,1), (0,0), (0,1), (1,1))
*
* then group together with the identity function, so pairs
* with the same elements are grouped together and the key is the pair itself
*/
.groupBy(identity)
/* Map( (1,0) -> List((1,0)),
* (1,1) -> List((1,1), (1,1), (1,1)),
* (0,1) -> List((0,1), (0,1)),
* (0,0) -> List((0,0))
* )
*
* finally you count the pairs mapping the values to the length of each list
*/
.mapValues(_.length)
/* Map( (1,0) -> 1,
* (1,1) -> 3,
* (0,1) -> 2,
* (0,0) -> 1
* )
Then all you need to do is lookup on the map
a.zip(b).filter(x => x._1 != x._2).size
Almost the same solution that was proposed by Jatin, except that you can useList.countfor a better lisibility:
def getSimilarity(l1: List[Int], l2: List[Int]) =
l1.zip(l2).count({case (x,y) => x != y})
You can also use foldLeft. Assuming there are no non-negative numbers:
a.zip(b).foldLeft(0)( (x,y) => if (y._1 + y._2 == 1) x + 1 else x )
1) You could zip 2 lists to get list of (Int, Int), collect only pairs (1, 0) and (0, 1), replace (1, 0) with 1 and (0, 1) with -1 and get sum. If count of (1, 0) and count of (0, 1) are the same the sum would be equal 0:
val (l1, l2) = (List(1,1,1) , List(1,0,1))
(l1 zip l2).collect{
case (1, 0) => 1
case (0, 1) => -1
}.sum == 0
You could use view method to prevent creation intermediate collections.
2) You could use filter and length to get count of elements with some condition:
(l1 zip l2).filter{ _ == (1, 1) }.length
(l1 zip l2).collect{ case (1, 1) => () }.length