RDD intersections - scala

I have a query about intersection between two RDDs.
My first RDD has a list of elements like this:
A = List(1,2,3,4), List(4,5,6), List(8,3,1),List(1,6,8,9,2)
And the second RDD is like this:
B = (1,2,3,4,5,6,8,9)
(I could store B in memory as a Set but not the first one.)
I would like to do an intersection of each element in A with B
List(1,2,3,4).intersect((1,2,3,4,5,6,8,9))
List(4,5,6).intersect((1,2,3,4,5,6,8,9))
List(8,3,1).intersect((1,2,3,4,5,6,8,9))
List(1,6,8,9,2).intersect((1,2,3,4,5,6,8,9))
How can I do this in Scala?

val result = rdd.map( x => x.intersect(B))
Both B and rdd have to have the same type (in this case List[Int]). Also, note that if B is big but fits in memory, you would want to probably broadcast it as documented here.
scala> val B = List(1,2,3,4,5,6,8,9)
B: List[Int] = List(1, 2, 3, 4, 5, 6, 8, 9)
scala> val rdd = sc.parallelize(Seq(List(1,2,3,4), List(4,5,6), List(8,3,1),List(1,6,8,9,2)))
rdd: org.apache.spark.rdd.RDD[List[Int]] = ParallelCollectionRDD[0] at parallelize at <console>:21
scala> rdd.map( x => x.intersect(B)).collect.mkString("\n")
res3: String =
List(1, 2, 3, 4)
List(4, 5, 6)
List(8, 3, 1)
List(1, 6, 8, 9, 2)

Related

partition an Array with offset

in Clojure I can partition a vector with offset step like
(partition 2 1 [1 2 3 4])
this returns a sequence of lists of n items each at offsets step apart.
for example the previous method returns
((1 2) (2 3) (3 4))
I just wonder how can I acheive the same in Scala
use sliding - Array(1, 2, 3, 4).sliding(2). This would give you an Iterator and you can just call e.g. toArray and get Array[Array[Int]] where internals are as desired.
There is function in the standard library sliding for this purpose
scala> val x = Array(1, 2, 3).sliding(2, 1)
x: Iterator[Array[Int]] = non-empty iterator
scala> x.next
res8: Array[Int] = Array(1, 2)
scala> x.next
res9: Array[Int] = Array(2, 3)
scala> val l = List(1, 2, 3, 4, 5)
l: List[Int] = List(1, 2, 3, 4, 5)
scala> l.sliding(2).toList
res0: List[List[Int]] = List(List(1, 2), List(2, 3), List(3, 4), List(4, 5))
I think this does what you need:
List(1,2,3,4).sliding(2,1).toList

Scala collection select elements until first one meet a requirement

For example I have following Scala list, I want get a sublist until there is a requirement can be met.
val list = Seq(1,2,3,4,5,5,4,1,2,5)
The requirement is the number is 5, so I want the result as:
Seq(1,2,3,4)
Currently I use Scala collection's indexWhere and splitAt to return:
list.splitAt(list.indexWhere(x => x == 5))
(Seq(1,2,3,4), Seq(5,5,4,1,2,5))
I am not sure there are more better ways to achieve the same with better Scala collection's method I didn't realise?
You can use takeWhile:
scala> val list = Seq(1,2,3,4,5,5,4,1,2,5)
list: Seq[Int] = List(1, 2, 3, 4, 5, 5, 4, 1, 2, 5)
scala> list.takeWhile(_ != 5)
res30: Seq[Int] = List(1, 2, 3, 4)
Use span like this,
val (l,r) = list.span(_ != 5)
which delivers
l: List(1, 2, 3, 4)
r: List(5, 5, 4, 1, 2, 5)
Alternatively, you can write
val l = list.span(_ != 5)._1
to access only the first element of the resulting tuple.
This bisects the list at the first element that does not hold the condition.

Complement method for .last when working with List objects?

Working with Lists in Scala I would like a simple way to get all elements but the last element. Is there a complementary method for .last similar to .head/.tail complement?
I'd rather not dirty up code with something like:
val x: List[String] = List("abc", "def", "ghi")
val allButLast: List[String] = x.reverse.tail.reverse
// List(abc, def)
Thanks.
init selects all elements but the last one.
List API for init.
scala> List(1,2,3,4,5)
res0: List[Int] = List(1, 2, 3, 4, 5)
scala> res0.init
res1: List[Int] = List(1, 2, 3, 4)
The 4 related methods here are head, tail, init, and last.
head and last get the first and final member, whereas
tail and init exclude the first and final members.
scala> val list = (0 to 10).toList
list: List[Int] = List(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
scala> list.head
res0: Int = 0
scala> list.tail
res1: List[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
scala> list.init
res2: List[Int] = List(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
scala> list.last
res3: Int = 10
You should also take care, because all 4 of them are unsafe on the empty list and will throw exceptions.
These methods are defined on GenTraversableLike, which List implements.
That's init.
link to Scaladoc: http://www.scala-lang.org/api/2.11.5/index.html#scala.collection.immutable.List#init:Repr
def init: List[A]
Selects all elements except the last.
Also, note that it's defined in GenTraversableLike, so pretty much any Scala collection has this method.
For dropping off any number of items from the end of a list consider dropRight,
val xs = (1 to 5).toList
xs.dropRight(1)
List(1, 2, 3, 4)
xs.dropRight(2)
List(1, 2, 3)
xs.dropRight(10)
List()

takeWhile: also need first element failed the condition in scala

scala> List(1,2,3,4,5,6,7).takeWhile(i=>i<5)
res1: List[Int] = List(1, 2, 3, 4)
What if I also need to include 5 in the result?
Assuming that the function that you will be using is more complicated than the taking first 5 elements then,
You can do
scala> List(1,2,3,4,5,6,7)
res5: List[Int] = List(1, 2, 3, 4, 5, 6, 7)
scala> res5.takeWhile(_<5) ++ res5.dropWhile(_<5).take(1)
res7: List[Int] = List(1, 2, 3, 4, 5)
Also
scala> res5.span(_<5)
res8: (List[Int], List[Int]) = (List(1, 2, 3, 4),List(5, 6, 7))
scala> res8._1 ++ res8._2.take(1)
res10: List[Int] = List(1, 2, 3, 4, 5)
Also
scala> res5.take(res5.segmentLength(_<5, 0) + 1)
res17: List[Int] = List(1, 2, 3, 4, 5)
scala> res5.take(res5.indexWhere(_>5))
res18: List[Int] = List(1, 2, 3, 4, 5)
Edit
If this is not a parallelized computation and you won't be using the parallel collections:
var last = myList.head
val rem = myList.takeWhile{ x=>
last = x
x < 5
}
last :: rem
anonymous function forming a closure around the solution you want.
Previous Answer
I'd settle for the far less complicated:
.takeWhile(_ <= 5)
wherein you're just using the less than or equal operator.
List(1,2,3,4,5,6,7,8).takeWhile(_ <= 5)
This is best optimal solution for it

How can I sort List[Int] objects?

What I want to do is sort List objects in Scala, not sort the elements in the list. For example If I have two lists of Ints:
val l1 = List(1, 2, 3, 7)
val l2 = List(1, 2, 3, 4, 10)
I want to be able to put them in order where l1 > l2.
I have created a case class that does what I need it to but the problem is that when I use it none of my other methods work. Do I need to implement all the other methods in the class i.e. flatten, sortWith etc.?
My class code looks like this:
class ItemSet(itemSet: List[Int]) extends Ordered[ItemSet] {
val iSet: List[Int] = itemSet
def compare(that: ItemSet) = {
val thisSize = this.iSet.size
val thatSize = that.iSet.size
val hint = List(thisSize, thatSize).min
var result = 0
var loop = 0
val ths = this.iSet.toArray
val tht = that.iSet.toArray
while (loop < hint && result == 0) {
result = ths(loop).compare(tht(loop))
loop += 1
}
if (loop == hint && result == 0 && thisSize != thatSize) {
thisSize.compare(thatSize)
} else
result
}
}
Now if I create an Array of ItemSets I can sort it:
val is1 = new ItemSet(List(1, 2, 5, 8))
val is2 = new ItemSet(List(1, 2, 5, 6))
val is3 = new ItemSet(List(1, 2, 3, 7, 10))
Array(is1, is2, is3).sorted.foreach(i => println(i.iSet))
scala> List(1, 2, 3, 7, 10)
List(1, 2, 5, 6)
List(1, 2, 5, 8)
The two methods that are giving me problems are:
def itemFrequencies(transDB: Array[ItemSet]): Map[Int, Int] = transDB.flatten.groupBy(x => x).mapValues(_.size)
The error I get is:
Expression of type Map[Nothing, Int] doesn't conform to expected type Map[Int, Int]
And for this one:
def sortListAscFreq(transDB: Array[ItemSet], itemFreq: Map[Int, Int]): Array[List[Int]] = {
for (l <- transDB) yield
l.sortWith(itemFreq(_) < itemFreq(_))
}
I get:
Cannot resolve symbol sortWith.
Is there a way I can just extend List[Int] so that I can sort a collection of lists without loosing the functionality of other methods?
The standard library provides a lexicographic ordering for collections of ordered things. You can put it into scope and you're done:
scala> import scala.math.Ordering.Implicits._
import scala.math.Ordering.Implicits._
scala> val is1 = List(1, 2, 5, 8)
is1: List[Int] = List(1, 2, 5, 8)
scala> val is2 = List(1, 2, 5, 6)
is2: List[Int] = List(1, 2, 5, 6)
scala> val is3 = List(1, 2, 3, 7, 10)
is3: List[Int] = List(1, 2, 3, 7, 10)
scala> Array(is1, is2, is3).sorted foreach println
List(1, 2, 3, 7, 10)
List(1, 2, 5, 6)
List(1, 2, 5, 8)
The Ordering type class is often more convenient than Ordered in Scala—it allows you to specify how some existing type should be ordered without having to change its code or create a proxy class that extends Ordered[Whatever], which as you've seen can get messy very quickly.