Scala - read a file columnwise - scala

I am new to Scala. I have a file with following data :
1:2:3
2:4:5
8:9:6
I need to read this file columnwise (1:2:8)(2:4:9)(3:5:6)to find out minimum of each column. How to read it column wise and put it in separate arrays?

Read the file in row wise, split the rows into columns, then use transpose:
scala> val file="1:2:3 2:4:5 8:9:6"
file: String = 1:2:3 2:4:5 8:9:6
scala> file.split(" ")
res1: Array[String] = Array(1:2:3, 2:4:5, 8:9:6)
scala> file.split(" ").map(_.split(":"))
res2: Array[Array[String]] = Array(Array(1, 2, 3), Array(2, 4, 5), Array(8, 9, 6))
scala> file.split(" ").map(_.split(":")).transpose
res3: Array[Array[String]] = Array(Array(1, 2, 8), Array(2, 4, 9), Array(3, 5, 6))
scala> file.split(" ").map(_.split(":")).transpose.map(_.min)
res4: Array[String] = Array(1, 2, 3)

Related

Scala -- List into List of Count and List of Element

Let's say I have Scala list like this:
val mylist = List(4,2,5,6,4,4,2,6,5,6,6,2,5,4,4)
How can I transform it into list of count and list of element? For example, I want to convert mylist into:
val count = List(3,5,3,4)
val elements = List(2,4,5,6)
Which means, in mylist, I have three occurrences of 2, five occurrences of 4, etc.
In procedural, this is easy as I can just make two empty lists (for count and elements) and fill them while doing iteration. However, I have no idea on how to achieve this in Scala.
Arguably a shortest version:
val elements = mylist.distinct
val count = elements map (e => mylist.count(_ == e))
Use .groupBy(identity) to create a Map regrouping elements with their occurences:
scala> val mylist = List(4,2,5,6,4,4,2,6,5,6,6,2,5,4,4)
mylist: List[Int] = List(4, 2, 5, 6, 4, 4, 2, 6, 5, 6, 6, 2, 5, 4, 4)
scala> mylist.groupBy(identity)
res0: scala.collection.immutable.Map[Int,List[Int]] = Map(2 -> List(2, 2, 2), 5 -> List(5, 5, 5), 4 -> List(4, 4, 4, 4, 4), 6 -> List(6, 6, 6, 6))
Then you can use .mapValues(_.length) to change the 'value' part of the map to the size of the list:
scala> mylist.groupBy(identity).mapValues(_.length)
res1: scala.collection.immutable.Map[Int,Int] = Map(2 -> 3, 5 -> 3, 4 -> 5, 6 -> 4)
If you want to get 2 lists out of this you can use .unzip, which returns a tuple, the first part being the keys (ie the elements), the second being the values (ie the number of instances of the element in the original list):
scala> val (elements, counts) = mylist.groupBy(identity).mapValues(_.length).unzip
elements: scala.collection.immutable.Iterable[Int] = List(2, 5, 4, 6)
counts: scala.collection.immutable.Iterable[Int] = List(3, 3, 5, 4)
One way would be to use groupBy and then check the size of each "group":
val withSizes = mylist.groupBy(identity).toList.map { case (v, l) => (v, l.size) }
val count = withSizes.map(_._2)
val elements = withSizes.map(_._1)
You can try like this as well alternative way of doing the same.
Step - 1
scala> val mylist = List(4,2,5,6,4,4,2,6,5,6,6,2,5,4,4)
mylist: List[Int] = List(4, 2, 5, 6, 4, 4, 2, 6, 5, 6, 6, 2, 5, 4, 4)
// Use groupBy { x => x } returns a "Map[Int, List[Int]]"
step - 2
scala> mylist.groupBy(x => (x))
res0: scala.collection.immutable.Map[Int,List[Int]] = Map(2 -> List(2, 2, 2), 5 -> List(5, 5, 5), 4 -> List(4, 4, 4, 4, 4), 6 -> List(6, 6, 6, 6))
step - 3
scala> mylist.groupBy(x => (x)).map{case(num,times) =>(num,times.size)}.toList
res1: List[(Int, Int)] = List((2,3), (5,3), (4,5), (6,4))
step -4 - sort by num
scala> mylist.groupBy(x => (x)).map{case(num,times) =>(num,times.size)}.toList.sortBy(_._1)
res2: List[(Int, Int)] = List((2,3), (4,5), (5,3), (6,4))
step -5 - unzip to beak into to list it return tuple
scala> mylist.groupBy(x => (x)).map{case(num,times) =>(num,times.size)}.toList.sortBy(_._1).unzip
res3: (List[Int], List[Int]) = (List(2, 4, 5, 6),List(3, 5, 3, 4))

Scala grouping list into list tuples with one shared element

What would be short functional way to split list
List(1, 2, 3, 4, 5) into List((1,2), (2, 3), (3, 4), (4, 5))
(assuming you don't care if you nested pairs are Lists and not Tuples)
Scala collections have a sliding window function:
# val lazyWindow = List(1, 2, 3, 4, 5).sliding(2)
lazyWindow: Iterator[List[Int]] = non-empty iterator
To realize the collection:
# lazyWindow.toList
res1: List[List[Int]] = List(List(1, 2), List(2, 3), List(3, 4), List(4, 5))
You can even do more "funcy" windows, like of length 3 but with step 2:
# List(1, 2, 3, 4, 5).sliding(3,2).toList
res2: List[List[Int]] = List(List(1, 2, 3), List(3, 4, 5))
You can zip the list with its tail:
val list = List(1, 2, 3, 4, 5)
// list: List[Int] = List(1, 2, 3, 4, 5)
list zip list.tail
// res6: List[(Int, Int)] = List((1,2), (2,3), (3,4), (4,5))
I have always been a big fan of pattern matching. So you could also do:
val list = List(1, 2, 3, 4, 5, 6)
def splitList(list: List[Int], result: List[(Int, Int)] = List()): List[(Int, Int)] = {
list match {
case Nil => result
case x :: Nil => result
case x1 :: x2 :: ls => splitList(x2 :: ls, result.:+(x1, x2))
}
}
splitList(list)
//List((1,2), (2,3), (3,4), (4,5), (5,6))

partition an Array with offset

in Clojure I can partition a vector with offset step like
(partition 2 1 [1 2 3 4])
this returns a sequence of lists of n items each at offsets step apart.
for example the previous method returns
((1 2) (2 3) (3 4))
I just wonder how can I acheive the same in Scala
use sliding - Array(1, 2, 3, 4).sliding(2). This would give you an Iterator and you can just call e.g. toArray and get Array[Array[Int]] where internals are as desired.
There is function in the standard library sliding for this purpose
scala> val x = Array(1, 2, 3).sliding(2, 1)
x: Iterator[Array[Int]] = non-empty iterator
scala> x.next
res8: Array[Int] = Array(1, 2)
scala> x.next
res9: Array[Int] = Array(2, 3)
scala> val l = List(1, 2, 3, 4, 5)
l: List[Int] = List(1, 2, 3, 4, 5)
scala> l.sliding(2).toList
res0: List[List[Int]] = List(List(1, 2), List(2, 3), List(3, 4), List(4, 5))
I think this does what you need:
List(1,2,3,4).sliding(2,1).toList

Spark Scala - Split columns into multiple rows

Following the question that I post here:
Spark Mllib - Scala
I've another one doubt... Is possible to transform a dataset like this:
2,1,3
1
3,6,8
Into this:
2,1
2,3
1,3
1
3,6
3,8
6,8
Basically I want to discover all the relationships between the movies. Is possible to do this?
My current code is:
val input = sc.textFile("PATH")
val raw = input.lines.map(_.split(",")).toArray
val twoElementArrays = raw.flatMap(_.combinations(2))
val result = twoElementArrays ++ raw.filter(_.length == 1)
Given that input is a multi-line string.
scala> val raw = input.lines.map(_.split(",")).toArray
raw: Array[Array[String]] = Array(Array(2, 1, 3), Array(1), Array(3, 6, 8))
Following approach discards one-element arrays, 1 in your example.
scala> val twoElementArrays = raw.flatMap(_.combinations(2))
twoElementArrays: Array[Array[String]] = Array(Array(2, 1), Array(2, 3), Array(1, 3), Array(3, 6), Array(3, 8), Array(6, 8))
It can be fixed by appending filtered raw collection.
scala> val result = twoElementArrays ++ raw.filter(_.length == 1)
result: Array[Array[String]] = Array(Array(2, 1), Array(2, 3), Array(1, 3), Array(3, 6), Array(3, 8), Array(6, 8), Array(1))
Order of combinations is not relevant I believe.
Update
SparkContext.textFile returns RDD of lines, so it could be plugged in as:
val raw = rdd.map(_.split(","))

RDD intersections

I have a query about intersection between two RDDs.
My first RDD has a list of elements like this:
A = List(1,2,3,4), List(4,5,6), List(8,3,1),List(1,6,8,9,2)
And the second RDD is like this:
B = (1,2,3,4,5,6,8,9)
(I could store B in memory as a Set but not the first one.)
I would like to do an intersection of each element in A with B
List(1,2,3,4).intersect((1,2,3,4,5,6,8,9))
List(4,5,6).intersect((1,2,3,4,5,6,8,9))
List(8,3,1).intersect((1,2,3,4,5,6,8,9))
List(1,6,8,9,2).intersect((1,2,3,4,5,6,8,9))
How can I do this in Scala?
val result = rdd.map( x => x.intersect(B))
Both B and rdd have to have the same type (in this case List[Int]). Also, note that if B is big but fits in memory, you would want to probably broadcast it as documented here.
scala> val B = List(1,2,3,4,5,6,8,9)
B: List[Int] = List(1, 2, 3, 4, 5, 6, 8, 9)
scala> val rdd = sc.parallelize(Seq(List(1,2,3,4), List(4,5,6), List(8,3,1),List(1,6,8,9,2)))
rdd: org.apache.spark.rdd.RDD[List[Int]] = ParallelCollectionRDD[0] at parallelize at <console>:21
scala> rdd.map( x => x.intersect(B)).collect.mkString("\n")
res3: String =
List(1, 2, 3, 4)
List(4, 5, 6)
List(8, 3, 1)
List(1, 6, 8, 9, 2)