Spark Scala - Split columns into multiple rows - scala

Following the question that I post here:
Spark Mllib - Scala
I've another one doubt... Is possible to transform a dataset like this:
2,1,3
1
3,6,8
Into this:
2,1
2,3
1,3
1
3,6
3,8
6,8
Basically I want to discover all the relationships between the movies. Is possible to do this?
My current code is:
val input = sc.textFile("PATH")
val raw = input.lines.map(_.split(",")).toArray
val twoElementArrays = raw.flatMap(_.combinations(2))
val result = twoElementArrays ++ raw.filter(_.length == 1)

Given that input is a multi-line string.
scala> val raw = input.lines.map(_.split(",")).toArray
raw: Array[Array[String]] = Array(Array(2, 1, 3), Array(1), Array(3, 6, 8))
Following approach discards one-element arrays, 1 in your example.
scala> val twoElementArrays = raw.flatMap(_.combinations(2))
twoElementArrays: Array[Array[String]] = Array(Array(2, 1), Array(2, 3), Array(1, 3), Array(3, 6), Array(3, 8), Array(6, 8))
It can be fixed by appending filtered raw collection.
scala> val result = twoElementArrays ++ raw.filter(_.length == 1)
result: Array[Array[String]] = Array(Array(2, 1), Array(2, 3), Array(1, 3), Array(3, 6), Array(3, 8), Array(6, 8), Array(1))
Order of combinations is not relevant I believe.
Update
SparkContext.textFile returns RDD of lines, so it could be plugged in as:
val raw = rdd.map(_.split(","))

Related

finding least subset in an rdd in spark

can anyone help me
I have an RDD of BitSets like
Array[scala.collection.mutable.BitSet] = Array(BitSet(1, 2), BitSet(1, 7), BitSet(8, 9, 10, 11), BitSet(1, 2, 3, 4),BitSet(8,9,10),BitSet(1,2,3))
I want rdd of (BitSet(1,2),BitSet(1,7),BitSet(8,9,10))
that is i require least subset or the BitSet which is not subset to any BitSet
Explanation:
here (1,2),(1,2,3)(1,2,3,4) -----here least subset is (1,2) and
(1,7) is not subset to other BitSets ---so (1,7) also present in the result
This problem can be solved using aggregation
val rdd = sc.parallelize(
Seq(BitSet(1, 2),
BitSet(1, 7),
BitSet(8, 9, 10, 11),
BitSet(1, 2, 3, 4),
BitSet(8, 9, 10),
BitSet(1, 2, 3)))
val result = rdd.aggregate(Seq[BitSet]())(accumulate, merge)
println(result)
output:
List(BitSet(8, 9, 10), BitSet(1, 7), BitSet(1, 2))
accumulate is a function that collects data on each spark partition
def accumulate(acc: Seq[BitSet], elem: BitSet): Seq[BitSet] = {
val subsets = acc.filterNot(elem.subsetOf)
if (notSuperSet(subsets)(elem)) elem +: subsets else subsets
}
merge combines results of partitions
def merge(left: Seq[BitSet], right: Seq[BitSet]): Seq[BitSet] = {
val leftSubsets = left.filter(notSuperSet(right))
val rightSubsets = right.filter(notSuperSet(leftSubsets))
leftSubsets ++ rightSubsets
}
and a helper function notSuperSet that is a predicate to check whether a new set is superset of other sets
def notSuperSet(subsets: Seq[BitSet])(set: BitSet): Boolean =
subsets.forall(!_.subsetOf(set))

partition an Array with offset

in Clojure I can partition a vector with offset step like
(partition 2 1 [1 2 3 4])
this returns a sequence of lists of n items each at offsets step apart.
for example the previous method returns
((1 2) (2 3) (3 4))
I just wonder how can I acheive the same in Scala
use sliding - Array(1, 2, 3, 4).sliding(2). This would give you an Iterator and you can just call e.g. toArray and get Array[Array[Int]] where internals are as desired.
There is function in the standard library sliding for this purpose
scala> val x = Array(1, 2, 3).sliding(2, 1)
x: Iterator[Array[Int]] = non-empty iterator
scala> x.next
res8: Array[Int] = Array(1, 2)
scala> x.next
res9: Array[Int] = Array(2, 3)
scala> val l = List(1, 2, 3, 4, 5)
l: List[Int] = List(1, 2, 3, 4, 5)
scala> l.sliding(2).toList
res0: List[List[Int]] = List(List(1, 2), List(2, 3), List(3, 4), List(4, 5))
I think this does what you need:
List(1,2,3,4).sliding(2,1).toList

Is there some extended version of unzip in scala which works for any List[n-tuple] instead of just List[pairs] like Unzip?

If I have a list of 3-tuples I want three separate lists. Is there some better way than this:
(listA, listB, listC) = (list.map(_._1), list.map(_._2). list.map(_._3))
which can work for any n-tuple?
EDIT: Though for three unzip3 exists of which I was unaware while writing this question, is there way to write a function for getting in general n lists?
How about this?
scala> val array = Array((1, 2, 3), (4, 5, 6), (7, 8, 9))
array: Array[(Int, Int, Int)] = Array((1,2,3), (4,5,6), (7,8,9))
scala> val tripleArray = array.unzip3
tripleArray: (Array[Int], Array[Int], Array[Int]) = (Array(1, 4, 7),Array(2, 5,8),Array(3, 6, 9))

RDD intersections

I have a query about intersection between two RDDs.
My first RDD has a list of elements like this:
A = List(1,2,3,4), List(4,5,6), List(8,3,1),List(1,6,8,9,2)
And the second RDD is like this:
B = (1,2,3,4,5,6,8,9)
(I could store B in memory as a Set but not the first one.)
I would like to do an intersection of each element in A with B
List(1,2,3,4).intersect((1,2,3,4,5,6,8,9))
List(4,5,6).intersect((1,2,3,4,5,6,8,9))
List(8,3,1).intersect((1,2,3,4,5,6,8,9))
List(1,6,8,9,2).intersect((1,2,3,4,5,6,8,9))
How can I do this in Scala?
val result = rdd.map( x => x.intersect(B))
Both B and rdd have to have the same type (in this case List[Int]). Also, note that if B is big but fits in memory, you would want to probably broadcast it as documented here.
scala> val B = List(1,2,3,4,5,6,8,9)
B: List[Int] = List(1, 2, 3, 4, 5, 6, 8, 9)
scala> val rdd = sc.parallelize(Seq(List(1,2,3,4), List(4,5,6), List(8,3,1),List(1,6,8,9,2)))
rdd: org.apache.spark.rdd.RDD[List[Int]] = ParallelCollectionRDD[0] at parallelize at <console>:21
scala> rdd.map( x => x.intersect(B)).collect.mkString("\n")
res3: String =
List(1, 2, 3, 4)
List(4, 5, 6)
List(8, 3, 1)
List(1, 6, 8, 9, 2)

Scala - read a file columnwise

I am new to Scala. I have a file with following data :
1:2:3
2:4:5
8:9:6
I need to read this file columnwise (1:2:8)(2:4:9)(3:5:6)to find out minimum of each column. How to read it column wise and put it in separate arrays?
Read the file in row wise, split the rows into columns, then use transpose:
scala> val file="1:2:3 2:4:5 8:9:6"
file: String = 1:2:3 2:4:5 8:9:6
scala> file.split(" ")
res1: Array[String] = Array(1:2:3, 2:4:5, 8:9:6)
scala> file.split(" ").map(_.split(":"))
res2: Array[Array[String]] = Array(Array(1, 2, 3), Array(2, 4, 5), Array(8, 9, 6))
scala> file.split(" ").map(_.split(":")).transpose
res3: Array[Array[String]] = Array(Array(1, 2, 8), Array(2, 4, 9), Array(3, 5, 6))
scala> file.split(" ").map(_.split(":")).transpose.map(_.min)
res4: Array[String] = Array(1, 2, 3)