Need some clarity on for loop usage in Spark scala - scala

I am trying to run below code to create pair using spark rdd, when I am the code only for one mapping it's working fine but when I am using for loop to iterate over all the elements then I am not getting the expected output.
val file = sc.textFile("filepath")
file.collect.foreach(println)
1,Abc,300
2,Def,200
3,Xyz,400
file.map(x => x.split(",")).map(x => (x(0)->x(1))).collect.foreach(println)
Output is coming as expected :-
(1,Abc)
(2,Def)
(3,Xyz)
Using for loop:-
file.map(x => x.split(",")).map(x => {
for(i <- 0 to 2){
x(0) -> x(i)
}
}).collect.foreach(println)
Output is coming as (which is not the expected output):-
()
()
()
Expected output is:-
(1,1)
(2,2)
(3,3)
(1,Abc)
(2,Def)
(3,Xyz)
(1,300)
(2,200)
(3,400)
tried using yield in for loop but getting some syntax errors.

First, let me explain the output you obtain. A for loop simply returns an object of type Unit, regardless of what's in it. Here is a way to verify that using the REPL:
scala> val test = for(i<- 0 to 2) { i }
test: Unit = ()
NB: () is the only object of type Unit
If you want to change that, you need to use yield as you suggest it. Here is an example:
scala> val test = for(i<- 0 to 2) yield { i }
test: scala.collection.immutable.IndexedSeq[Int] = Vector(0, 1, 2)
That's more like it.
In your case, adding yield is not enough. It would yield collections of tuples like this:
Vector((1,1), (1,Abc), (1,300))
Vector((2,2), (2,Def), (2,200))
Vector((3,3), (3,Xyz), (3,400))
What you need is to use is the flatMap function which will flatten the collections (i.e. it transforms a RDD of collections of elements into a RDD of elements).
file.map(x => x.split(",")).flatMap(x => {
for(i <- 0 to 2) yield {
x(0) -> x(i)
}
}).collect.foreach(println)
which gives you what you expect:
(1,1)
(1,Abc)
(1,300)
(2,2)
(2,Def)
(2,200)
(3,3)
(3,Xyz)
(3,400)

Related

How to skip appending some condition in map scala?

I have a situation in which I need to create a map from a collection by applying some filter inside like given in the code below:
//Say I have a list
//I don't have to apply filter function ...
val myList = List(2,3,4,5)
val evenList = myList.map(x=>{
if ( x is even) x
else 0
}
//And the output is : List(2,0,4,0)
//The output actually needed was List(2,4) without applying filter on top like - ```myList.filter```
//I have objects instead of numbers of a case class so the output becomes :List(object1, None, object2, None)
But actual output needed was : List(object1,object2)
//The updated scenario
val basket = List(2,4,5,6)
case class Apple(name:Option[String],size:Option[Int])
val listApples: List[Apple] = basket.map(x=>{
val r = new scala.util.Random
val size = r.nextInt(10)
if(x%2!=0){
Apple(None,None)
}
else Apple(Some("my-apple"),Some(size))
})
Current Output :
Apple(Some(my-apple),Some(2))
Apple(Some(my-apple),Some(0))
Apple(None,None)
Apple(Some(my-apple),Some(4))
Expected was :
Apple(Some(my-apple),Some(2))
Apple(Some(my-apple),Some(0))
Apple(Some(my-apple),Some(4))
I believe collect best suits your case. It takes a partial function as an argument and only if that function matches then the element is transformed and added to result:
val myList = List(2,3,4,5)
case class Wrapper(i: Int)
val evenList = myList.collect{
case x if x % 2 == 0 => Wrapper(x)
}
In this case only 2 and 4 will be wrapped inside Wrapper:
List(Wrapper(2), Wrapper(4))
I'm not sure if I understand you correctly, but why not just use a filter directly:
val myList = List(2,3,4,5)
myList.filter(_ % 2 == 0)
If you want to have the Filter as a function:
def even(n:Int) = n % 2 == 0
myList.filter(even)
After question update, here the difference between filter and collect:
Filter:
myList
.filter(even)
.map(s => Apple(Some("my-apple"),Some(s)))
Collect:
myList
.collect{ case s if(even(s)) => Apple(Some("my-apple"),Some(s))}
Both return List(Apple(Some(my-apple),Some(2)), Apple(Some(my-apple),Some(4)))
So the only difference is that you can do both steps at once with collect.
However for me to separate these 2 steps is mostly more readable.

Scala - conditional product/join of two arrays with default values using for comprehensions

I have two Sequences, say:
val first = Array("B", "L", "T")
val second = Array("T70", "B25", "B80", "A50", "M100", "B50")
How do I get a product such that elements of the first array are joined with each element of the second array which startsWith the former and also yield a default empty result when no element in the second array meets the condition.
Effectively to get an Output:
expectedProductArray = Array("B-B25", "B-B80", "B-B50", "L-Default", "T-T70")
I tried doing,
val myProductArray: Array[String] = for {
f <- first
s <- second if s.startsWith(f)
} yield s"""$f-$s"""
and i get:
myProductArray = Array("B-B25", "B-B80", "B-B50", "T-T70")
Is there an Idiomatic way of adding a default value for values in first sequence not having a corresponding value in the second sequence with the given criteria? Appreciate your thoughts.
Here's one approach by making array second a Map and looking up the Map for elements in array first with getOrElse:
val first = Array("B", "L", "T")
val second = Array("T70", "B25", "B80", "A50", "M100", "B50")
val m = second.groupBy(_(0).toString)
// m: scala.collection.immutable.Map[String,Array[String]] =
// Map(M -> Array(M100), A -> Array(A50), B -> Array(B25, B80, B50), T -> Array(T70))
first.flatMap(x => m.getOrElse(x, Array("Default")).map(x + "-" + _))
// res1: Array[String] = Array(B-B25, B-B80, B-B50, L-Default, T-T70)
In case you prefer using for-comprehension:
for {
x <- first
y <- m.getOrElse(x, Array("Default"))
} yield s"$x-$y"

Why does iterating over multiple streams only iterate over the first element?

I've recently run into a bug in my code, in which iterating over multiple streams causes them to only iterate only through the first item. I converted my streams to buffers (I wasn't even aware that the function's implementation that I was calling returns a stream) and the problem was fixed. I found this hard to believe, so I created a minimum verifiable example:
def f(as: Seq[String], bs: Seq[String]): Unit =
for {
a <- as
b <- bs
} yield println((a, b))
val seq = Seq(1, 2, 3).map(_.toString)
f(seq, seq)
println()
val stream = Stream.iterate(1)(_ + 1).map(_.toString).take(3)
f(stream, stream)
A function that prints every combination of its inputs, and is invoked with the Seq [1, 2, 3] and the Stream [1, 2, 3].
The result with the seq is:
(1,1)
(1,2)
(1,3)
(2,1)
(2,2)
(2,3)
(3,1)
(3,2)
(3,3)
And the result with the stream is:
(1,1)
I've only been able to replicate this when iterating through multiple generators, iterating through a single stream seems to work fine.
So my questions are: why does this happen, and how can I avoid this kind of glitch? That is, short of using .toBuffer or .to[Vector] before every multi-generator iteration?
Thanks.
The manner in which you're using the for-comprehension (with the println in the yield) is a bit strange and probably not what you want to do. If you really just want to print out the entries, then just use foreach. This will force lazy sequences like Stream, i.e.
def f_strict(as: Seq[String], bs: Seq[String]): Unit = {
for {
a <- as
b <- bs
} println((a, b))
}
The reason you're getting the strange behavior with your f is that Streams are lazy, and elements are only computed (and then memoized) as needed. Since you never use the Stream created by f (necessarily because your f returns a Unit), only the head ever gets computed (which is why you're seeing the single (1, 1).) If you were instead to have it return the sequence it generated (which will have type Seq[Unit]), i.e.
def f_new(as: Seq[String], bs: Seq[String]): Seq[Unit] = {
for {
a <- as
b <- bs
} yield println((a, b))
}
Then you'll get the following behavior which should hopefully help to elucidate what's going on:
val xs = Stream(1, 2, 3)
val result = f_new(xs.map(_.toString), xs.map(_.toString))
//prints out (1, 1) as a result of evaluating the head of the resulting Stream
result.foreach(aUnit => {})
//prints out the other elements as the rest of the entries of Stream are computed, i.e.
//(1,2)
//(1,3)
//(2,1)
//...
result.foreach(aUnit => {})
//probably won't print out anything because elements of Stream have been computed,
//memoized and probably don't need to be computed again at this point.

How to optionally return value from map function

map function on collections requires to return some value for each iteration. But I'm trying to find a way to return value not for each iteration, but only for initial values which matches some predicate .
What I want looks something like this:
(1 to 10).map { x =>
val res: Option[Int] = service.getById(x)
if (res.isDefined) Pair(x, res.get )// no else part
}
I think something like .collect function could do it, but seems with collect function I need to write many code in guards blocks (case x if {...// too much code here})
If you are returning an Option you can flatMap it and get only the values that are present (that is, are not None).
(1 to 10).flatMap { x =>
val res: Option[Int] = service.getById(x)
res.map{y => Pair(x, y) }
}
As you suggest, an alternative way to combine map and filter is to use collect and a partially applied function. Here is a simplified example:
(1 to 10).collect{ case x if x > 5 => x*2 }
res0: scala.collection.immutable.IndexedSeq[Int] = Vector(12, 14, 16, 18, 20)
You can use the collect function (see here) to do exactly what you want. Your example would then look like:
(1 to 10) map (x => (x, service.getById(x))) collect {
case (x, Some(res)) => Pair(x, res)
}
Using a for comprehension, like this,
for ( x <- 1 to 10; res <- service.getById(x) ) yield Pair(x, res.get)
This yields pairs where res does not evaluate to None.
Getting the first element:
(1 to 10).flatMap { x =>
val res: Option[Int] = service.getById(x)
res.map{y => Pair(x, y) }
}.head

Scala for comprehension of sequence inside a Try

I am writing a Scala program in which there is an operation that creates a sequence. The operation might fail, so I enclose it inside a Try. I want to do sequence creation and enumeration inside a for comprehension, so that a successfully-created sequence yields a sequence of tuples where the first element is the sequence and the second is an element of it.
To simplify the problem, make my sequence a Range of integers and define a createRange function that fails if it is asked to create a range of an odd length. Here is a simple for comprehension that does what I want.
import scala.util.Try
def createRange(n: Int): Try[Range] = {
Try {
if (n % 2 == 1) throw new Exception
else Range(0, n)
}
}
def rangeElements(n: Int) {
for {
r <- createRange(n)
x <- r
} println(s"$r\t$x")
}
def main(args: Array[String]) {
println("Range length 3")
rangeElements(3)
println("Range length 4")
rangeElements(4)
}
If you run this it correctly prints.
Range length 3
Range length 4
Range(0, 1, 2, 3) 0
Range(0, 1, 2, 3) 1
Range(0, 1, 2, 3) 2
Range(0, 1, 2, 3) 3
Now I would like to rewrite my rangeElements function so that instead of printing as a side-effect it returns a sequence of integers, where the sequence is empty if the range was not created. What I want to write is this.
def rangeElements(n: Int):Seq[(Range,Int)] = {
for {
r <- createRange(n)
x <- r
} yield (r, x)
}
// rangeElements(3) returns an empty sequence
// rangeElements(4) returns the sequence (Range(0,1,2,3), 0), (Range(0,1,2,3), 1) etc.
This gives me two type mismatch compiler errors. The r <- createRange(n) line required Seq[Int] but found scala.util.Try[Nothing]. The x <- r line required scala.util.Try[?] but found scala.collection.immutable.IndexedSeq[Int].
Presumably there is some type erasure with the Try that is messing me up, but I can't figure out what it is. I've tried various toOption and toSeq qualifiers on the lines in the for comprehension to no avail.
If I only needed to yield the range elements I could explicitly handle the Success and Failure conditions of createRange myself as suggested by the first two answers below. However, I need access to both the range and its individual elements.
I realize this is a strange-sounding example. The real problem I am trying to solve is a complicated recursive search, but I don't want to add in all its details because that would just confuse the issue here.
How do I write rangeElements to yield the desired sequences?
The problem becomes clear if you translate the for comprehension to its map/flatMap implementation (as described in the Scala Language Spec 6.19). The flatMap has the result type Try[U] but your function expects Seq[Int].
for {
r <- createRange(n)
x <- r
} yield x
createRange(n).flatMap {
case r => r.map {
case x => x
}
}
Is there any reason why you don't use the getOrElse method?
def rangeElements(n: Int):Seq[Int] =
createRange(n) getOrElse Seq.empty
The Try will be Success with a Range when n is even or a Failure with an Exception when n is odd. In rangeElements match and extract those values. Success will contain the valid Range and Failure will contain the Exception. Instead of returning the Exception return an empty Seq.
import scala.util.{Try, Success, Failure}
def createRange(n: Int): Try[Range] = {
Try {
if (n % 2 == 1) throw new Exception
else Range(0, n)
}
}
def rangeElements(n: Int):Seq[Tuple2[Range, Int]] = createRange(n) match {
case Success(s) => s.map(xs => (s, xs))
case Failure(f) => Seq()
}
scala> rangeElements(3)
res35: Seq[(Range, Int)] = List()
scala> rangeElements(4)
res36: Seq[(Range, Int)] = Vector((Range(0, 1, 2, 3),0), (Range(0, 1, 2, 3),1), (Range(0, 1, 2, 3),2), (Range(0, 1, 2,3),3))