Spark : Split input into multiple Arrays - scala

Being new to spark and scala. I need to check how I can achieve this:
input:val x = sc.parallelize(1 to 10, 3)
o/p after collecting:
Array[Int] = Array(1,2,3,4,5))
Array[Int] = Array(2,4,6,8,10)

I guess this is what you are looking for
val evens = x.filter(value => value % 2 == 0)
val odds = x.filter(value => value % 2 != 0)
Thanks
Outputs :
scala> evens.foreach(println)
4
6
2
8
10
scala> odds.foreach(println)
5
1
3
7
9

I think this helps you to understand as per your question.
val spark =
SparkSession.builder().master("local").appName("test").getOrCreate()
import spark.implicits._
val data = spark.sparkContext.parallelize(1 to 10, 3)
data.filter(_ < 6).collect().foreach(println)
Output
1
2
3
4
5
data.filter(_%2 == 0).collect().foreach(println)
Output
2
4
6
8
10

Related

How to convert Flink DataSet tuple to one column

I've a graph data like
1 2
1 4
4 1
4 2
4 3
3 2
2 3
But I couldn't find a way to convert it a one column dataset like
1
2
1
4
4
1
...
here is my code, I used scala ListBuffer, but couldn't find a way doing it in Flink DataSet
val params: ParameterTool = ParameterTool.fromArgs(args)
val env = ExecutionEnvironment.getExecutionEnvironment
env.getConfig.setGlobalJobParameters(params)
val text = env.readTextFile(params.get("input"))
val tupleText = text.map { line =>
val arr = line.split(" ")
(arr(0), arr(1))
}
var x: Seq[(String, String)] = tupleText.collect()
var tempList = new ListBuffer[String]
x.foreach(line => {
tempList += line._1
tempList += line._2
})
tempList.foreach(println)
You can do that with flatMap:
// get some input
val input: DataSet[(Int, Int)] = env.fromElements((1, 2), (2, 3), (3, 4))
// emit every tuple element as own record
val output: DataSet[Int] = input.flatMap( (t, out) => {
out.collect(t._1)
out.collect(t._2)
})
// print result
output.print()

How to compare two integer array in Scala?

Find a number available in first array with numbers in second. If number not found get the immediate lower.
val a = List(1,2,3,4,5,6,7,8,9)
val b = List(1,5,10)
expected output after comparing a with b
1 --> 1
2 --> 1
3 --> 1
4 --> 1
5 --> 5
6 --> 5
7 --> 5
8 --> 5
9 --> 5
Thanks
You can use TreeSet's to() and lastOption methods as follows:
val a = List(1, 2, 3, 4, 5, 6, 7, 8, 9)
val b = List(1, 5, 10)
import scala.collection.immutable.TreeSet
// Convert list `b` to TreeSet
val bs = TreeSet(b.toSeq: _*)
a.map( x => (x, bs.to(x).lastOption.getOrElse(Int.MinValue)) ).toMap
// res1: scala.collection.immutable.Map[Int,Int] = Map(
// 5 -> 5, 1 -> 1, 6 -> 5, 9 -> 5, 2 -> 1, 7 -> 5, 3 -> 1, 8 -> 5, 4 -> 1
// )
Note that neither list a or b needs to be ordered.
UPDATE:
Starting Scala 2.13, methods to for TreeSet is replaced with rangeTo.
Here is another approach using collect function
val a = List(1,2,3,4,5,6,7,8,9)
val b = List(1,5,10)
val result = a.collect{
case e if(b.filter(_<=e).size>0) => e -> b.filter(_<=e).reverse.head
}
//result: List[(Int, Int)] = List((1,1), (2,1), (3,1), (4,1), (5,5), (6,5), (7,5), (8,5), (9,5))
Here for every element in a check if there is a number in b i.e. which is greater than or equal to it and reverse the filter list and get its head to make it a pair.

For loop to create tuples of adjacent elements

I have a array
[1,2,2,3,4,6,2,4,6,8,2,3,5]
I want to iterate over this array using a for loop to get a collection of tuples of adjacent elements. How should I code in Scala?
Expected output :
1-2|2-2|2-3|3-4|4-6|6-2|2-4|4-6|6-8|8-2|2-3|3-5
If you want the output like 1-2|2-2|2-3|3-4|........ as you mentioned in your comment you can try following,
val arr = Array(1,2,2,3,4,6,2,4,6,8,2,3,5)
//here first separate array elements by - then whole array by |
val str = arr.sliding(2).map(_.mkString("-")).mkString("|")
print(str)
//output
//1-2|2-2|2-3|3-4|4-6|6-2|2-4|4-6|6-8|8-2|2-3|3-5
In scala you have sliding function for that.
scala> val arr = Array(1,2,2,3,4,6,2,4,6,8,2,3,5)
arr: Array[Int] = Array(1, 2, 2, 3, 4, 6, 2, 4, 6, 8, 2, 3, 5)
scala> arr.sliding(2).foreach(tuple => println(tuple.mkString(" ")))
1 2
2 2
2 3
3 4
4 6
6 2
2 4
4 6
6 8
8 2
2 3
3 5
scala> arr.sliding(2).map(tuple => tuple.mkString("-")).mkString("|")
res10: String = 1-2|2-2|2-3|3-4|4-6|6-2|2-4|4-6|6-8|8-2|2-3|3-5

Scala stream behaves counterintuitive

I am playing with Scala's streams and I'm not sure I catch the idea.
Let's consider following code
def fun(s: Stream[Int]): Stream[Int] = Stream.cons(s.head, fun(s.tail))
executing this
val f = fun(Stream.from(7))
f take 14 foreach println
results with
7 8 9 10 ... up to 20
Let's say I understand this.
Now, changing slightly code (adding 2 to head)
def fun(s: Stream[Int]): Stream[Int] = Stream.cons(s.head + 2, fun(s.tail))
results in
9 10 11 ... up to 22
Again I think I understand. Problems starts with next example (d
def fun(s: Stream[Int]): Stream[Int] = Stream.cons(s.head / 2, fun(s.tail))
3 4 4 5 5 6 6 7 7 8 8 9 9 10
This I do not get, please explain why it results this way?
Similar, subtracting also does not behave as I expect
def fun(s: Stream[Int]): Stream[Int] = Stream.cons(s.head - 2, fun(s.tail))
Output
5 6 7 8 9 10 ... up to 18
Given your "take": 7 8 9 10 ... up to 20,
what happens when you + 2 on each element?
what happens when you / 2 on each element (int arithmetic)?
what happens when you - 2 on each element?
Is it more intuitive if you think of it as mapping the Stream?
scala> val s1 = Stream.from(10)
s1: scala.collection.immutable.Stream[Int] = Stream(10, ?)
scala> val s2 = s1 map (_ * 2)
s2: scala.collection.immutable.Stream[Int] = Stream(20, ?)
scala> s2.take(5).toList
res0: List[Int] = List(20, 22, 24, 26, 28)
scala> val s3 = s1 map (_ / 2)
s3: scala.collection.immutable.Stream[Int] = Stream(5, ?)
scala> s3.take(5).toList
res1: List[Int] = List(5, 5, 6, 6, 7)
scala> val s4 = s1 map (_ - 2)
s4: scala.collection.immutable.Stream[Int] = Stream(8, ?)
scala> s4.take(5).toList
res2: List[Int] = List(8, 9, 10, 11, 12)
Ok, let's try and break it down...
def fun(s: Stream[Int]): Stream[Int] = Stream.cons(s.head, fun(s.tail))
is a function that takes a Stream and separates its head and tail, applies itself recursively on the tail, and then recombines the two results with the cons operator.
Since the head is not touched during this operation, the Stream is rebuilt element by element as it was before.
val f = fun(Stream.from(7))
f it's the same as Stream.from(7) [i.e. an infinite sequence of increasing integers starting from 7]
Printing f take 14 in fact shows that we have the first 14 numbers starting from 7 [i.e. 7,8,9,...,20]
What happens next is that, while rebuilding the stream with the cons, each element is modified in some way
def fun(s: Stream[Int]): Stream[Int] = Stream.cons(s.head + 2, fun(s.tail))
This adds 2 to the head before recombining it with the modified tail. The latter is modified in the same way, its first element being added to 2 and then recombined to its own tail, and so own.
If we assume again that s contains the number from 7 on, what happens looks like
fun(s) = cons(7 + 2, cons(8 + 2, cons(9 + 2, ... ad infinitum ... )))))
This is the same as adding 2 to each and every element of the stream s.
The code confirms that by printing "9 to 22", which is exactly "7 to 20" with 2 added to every element.
The others examples are analogous:
the stream with each element divided by 2 (and rounded to the floor
integer value, since the Stream is typed with Int values)
the stream where each element is decremented by 2

Get list of elements that are divisible by 3 or 5 from 1 - 1000

I'm trying to write a functional approach in scala to get a list of all numbers between 1 & 1000 that are divisible by 3 or 5
Here is what I have so far :
def getListOfElements(): List[Int] = {
val list = List()
for (i <- 0 until 1000) {
//list.
}
list match {
case Nil => 0
}
list
}
The for loop seems like an imperative approach and I'm not sure what to match on in the case class. Some guidance please ?
Here's how I would do it with a for expression.
for( i <- 1 to 1000 if i % 3 == 0 || i % 5 == 0) yield i
This gives:
scala.collection.immutable.IndexedSeq[Int] = Vector(3, 5, 6, 9, 10, 12, 15, 18, 20, 21...
Here's another approach filtering on a Range of numbers.
scala> 1 to 1000
res0: scala.collection.immutable.Range.Inclusive = Range(1, 2, 3, 4, 5, 6, 7, 8, 9, 10...
scala> res0.filter(x => x % 3 == 0 || x % 5 == 0)
res1: scala.collection.immutable.IndexedSeq[Int] = Vector(3, 5, 6, 9, 10, 12, 15, 18, 20, 21...
If you really want a List on the return value use toList. e.g. res0.toList.
(Range(3, 1000, 3) ++ Range(5, 1000, 5)).toSet.toList.sorted
Sorted can be omitted.
another aproach:
(1 to 1000).filter(i => i % 3 == 0 || i % 5 == 0)
Looks like Brian beat me to it :)
Just thought I'd mention that a Stream might be more preferable here for better performance:
val x = (1 until 1000).toStream //> x : scala.collection.immutable.Stream[Int] = Stream(1, ?)
x filter (t=>(t%3==0)||(t%5==0)) //> res0: scala.collection.immutable.Stream[Int] = Stream(3, ?)
The problem from projecteuler.net also wants a sum of those numbers at the end.
"Find the sum of all the multiples of 3 or 5 below 1000."
object prb1 {
def main(args: Array[String]) {
val retval = for{ a <- 1 to 999
if a % 3 == 0 || a % 5 == 0
} yield a
val sum = retval.reduceLeft[Int](_+_)
println("The sum of all multiples of 3 and 5 below 1000 is " + sum)
}
}
The correct answer should be 233168
No any answer without division or list recreation. No any answer with recursion.
Also, any benchmarking?
#scala.annotation.tailrec def div3or5(list: Range, result: List[Int]): List[Int] = {
var acc = result
var tailList = list
try {
acc = list.drop(2).head :: acc // drop 1 2 save 3
acc = list.drop(4).head :: acc // drop 3 4 save 5
acc = list.drop(5).head :: acc // drop 5 save 6
acc = list.drop(8).head :: acc // drop 6 7 8 save 9
acc = list.drop(9).head :: acc // drop 9 save 10
acc = list.drop(11).head :: acc // drop 10 11 save 12
acc = list.drop(14).head :: acc // drop 12 13 14 save 15
tailList = list.drop(15) // drop 15
} catch {
case e: NoSuchElementException => return acc // found
}
div3or5(tailList, acc) // continue search
}
div3or5(Range(1, 1001), Nil)
EDIT
scala> val t0 = System.nanoTime; div3or5(Range(1, 10000001), Nil).toList;
(System.nanoTime - t0) / 1000000000.0
t0: Long = 1355346955285989000
res20: Double = 6.218004
One of answers that looks good to me:
scala> val t0 = System.nanoTime; Range(1, 10000001).filter(i =>
i % 3 == 0 || i % 5 == 0).toList; (System.nanoTime - t0) / 1000000000.0
java.lang.OutOfMemoryError: Java heap space
Another one:
scala> val t0 = System.nanoTime; (Range(1, 10000001).toStream filter (
(t: Int)=>(t%3==0)||(t%5==0))).toList ; (System.nanoTime - t0) / 1000000000.0
java.lang.OutOfMemoryError: Java heap space
First one:
scala> val t0 = System.nanoTime; (for( i <- 1 to 10000000 if i % 3 == 0 ||
i % 5 == 0) yield i).toList; (System.nanoTime - t0) / 1000000000.0
java.lang.OutOfMemoryError: Java heap space
Why Scala does not optimize for example Vector -> List?