Unexpected behavior inside the foreachPartition method of a RDD - scala

I evaluated through the spark-shell the following lines of scala codes:
val a = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10))
val b = a.coalesce(1)
b.foreachPartition { p =>
p.map(_ + 1).foreach(println)
p.map(_ * 2).foreach(println)
}
The output is the following:
2
3
4
5
6
7
8
9
10
11
Why the partition p becomes empty after the first map?

It does not look strange to me since p is Iterator, when you walk through it with map, it has no more values, and taking into account that length is shortcut for size which is implemented like this:
def size: Int = {
var result = 0
for (x <- self) result += 1
result
}
you get 0.

The answer is in the scala doc http://www.scala-lang.org/api/2.11.8/#scala.collection.Iterator. It explicitely states that an iterator (p is an iterator) must be discarded after calling on it the map method.

Related

loop inside spark RDD filter

I am new to Spark and am trying to code in scala. I have an RDD which consists of data in the form :
1: 2 3 5
2: 5 6 7
3: 1 8 9
4: 1 2 4
and another list in the form [1,4,8,9]
I need to filter the RDD such that it takes those lines in which either the value before ':' is present in the list or if any of the values after ':' are present in the list.
I have written the following code:
val links = linksFile.filter(t => {
val l = t.split(": ")
root.contains(l(0).toInt) ||
for(x<-l(0).split(" ")){
root.contains(x.toInt)
}
})
linksFile is the RDD and root is the list.
But this doesn't work. any suggestions??
You're close: the for-loop just doesn't actually use the value computed inside it. You should use the exists method instead. Also I think you want l(1), not l(0) for the second check:
val links = linksFile.filter(t => {
val l = t.split(": ")
root.contains(l(0).toInt) ||
l(1).split(" ").exists { x =>
root.contains(x.toInt)
}
})
For-comprehension without a yield doesn't ... well ... yield :)
But you don't really need for-comprehension (or any "loop" for that matter) here.
Something like this:
linksFile.map(
_.split(": ").map(_.toInt)
).filter(_.exits(list.toSet))
.map(_.mkString)
should do it.

Scala comprehension from input

I am new to Scala and I am having troubles constructing a Map from inputs.
Here is my problem :
I am getting an input for elevators information. It consists of n lines, each one has the elevatorFloor number and the elevatorPosition on the floor.
Example:
0 5
1 3
4 5
So here I have 3 elevators, first one is on floor 0 at position 5, second one at floor 1 position 3 etc..
Is there a way in Scala to put it in a Map without using var ?
What I get so far is a Vector of all the elevators' information :
val elevators = {
for{i <- 0 until n
j <- readLine split " "
} yield j.toInt
}
I would like to be able split the lines in two variables "elevatorFloor" and "elevatorPos" and group them in a data structure (my guess is Map would be the appropriate choice) I would like to get something looking like:
elevators: SomeDataStructure[Int,Int] = ( 0->5, 1 -> 3, 4 -> 5)
I would like to clarify that I know I could write Javaish code, initialise a Map and then add the values to it, but I am trying to keep as close to functionnal programming as possible.
Thanks for the help or comments
You can do:
val res: Map[Int, Int] =
Source.fromFile("myfile.txt")
.getLines
.map { line =>
Array(floor, position) = line.split(' ')
(floor.toInt -> position.toInt)
}.toMap

understanding aggregate in Scala

I am trying to understand aggregate in Scala and with one example, i understood the logic, but the result of second one i tried confused me.
Please let me know, where i went wrong.
Code:
val list1 = List("This", "is", "an", "example");
val b = list1.aggregate(1)(_ * _.length(), _ * _)
1 * "This".length = 4
1 * "is".length = 2
1 * "an".length = 2
1 * "example".length = 7
4 * 2 = 8 , 2 * 7 = 14
8 * 14 = 112
the output also came as 112.
but for the below,
val c = list1.aggregate(1)(_ * _.length(), _ + _)
I Thought it will be like this.
4, 2, 2, 7
4 + 2 = 6
2 + 7 = 9
6 + 9 = 15,
but the output still came as 112.
It is ideally doing whatever the operation i mentioned at seqop, here _ * _.length
Could you please explain or correct me where i went wrong.?
aggregate should be used to compute only associative and commutative operations. Let's look at the signature of the function :
def aggregate[B](z: ⇒ B)(seqop: (B, A) ⇒ B, combop: (B, B) ⇒ B): B
B can be seen as an accumulator (and will be your output). You give an initial output value, then the first function is how to add a value A to this accumulator and the second is how to merge 2 accumulators. Scala "chooses" a way to aggregate your collection but if your aggregation is not associative and commutative the output is not deterministic because the order matter. Look at this example :
val l = List(1, 2, 3, 4)
l.aggregate(0)(_ + _, _ * _)
If we create one accumulator and then aggregate all the values we get 1 + 2 + 3 + 4 = 10 but if we decide to parallelize the process by splitting the list in halves we could have (1 + 2) * (3 + 4) = 21.
So now what happens in reality is that for List aggregate is the same as foldLeft which explains why changing your second function didn't change the output. But where aggregate can be useful is in Spark for example or other distributed environments where it may be useful to do the folding on each partition independently and then combine the results with the second function.

transforming from native matrix format, scalding

So this question is related to question Transforming matrix format, scalding
But now, I want to make the back operation. So i can make it in a such way:
Tsv(in, ('row, 'col, 'v))
.read
.groupBy('row) { _.sortBy('col).mkString('v, "\t") }
.mapTo(('row, 'v) -> ('c)) { res : (Long, String) =>
val (row, v) = res
v }
.write(Tsv(out))
But, there, we got problem with zeros. As we know, scalding skips zero values fields. So for example we got matrix:
1 0 8
4 5 6
0 8 9
In scalding format is is:
1 1 1
1 3 8
2 1 4
2 2 5
2 3 6
3 2 8
3 3 9
Using my function I wrote above we can only get:
1 8
4 5 6
8 9
And that's incorrect. So, how can i deal with it? I see two possible variants:
To find way, to add zeros (actually, dunno how to insert data)
To write own operations on own matrix format (it is unpreferable, cause I'm interested in Scalding matrix operations, and dont want to write all of them my own)
Mb there r some methods, and I can avoid skipping zeros in matrix?
Scalding stores a sparse representation of the data. If you want to output a dense matrix (first of all, that won't scale, because the rows will be bigger than can fit in memory at some point), you will need to enumerate all the rows and columns:
// First, I highly suggest you use the TypedPipe api, as it is easier to get
// big jobs right generally
val mat = // has your matrix in 'row1, 'col1, 'val1
def zero: V = // the zero of your value type
val rows = IterableSource(0 to 1000, 'row)
val cols = IterableSource(0 to 2000, 'col)
rows.crossWithTiny(cols)
.leftJoinWithSmaller(('row, 'col) -> ('row1, 'col1), mat)
.map('val1 -> 'val1) { v: V =>
if(v == null) // this value should be 0 in your type:
zero
else
v
}
.groupBy('row) {
_.toList[(Int, V)](('col, 'val1) -> 'cols)
}
.map('cols -> 'cols) { cols: List[(Int, V)] =>
cols.sortBy(_._1).map(_._2).mkString("\t")
}
.write(TypedTsv[(Int, String)]("output"))

Incrementing the for loop (loop variable) in scala by power of 5

I had asked this question on Javaranch, but couldn't get a response there. So posting it here as well:
I have this particular requirement where the increment in the loop variable is to be done by multiplying it with 5 after each iteration. In Java we could implement it this way:
for(int i=1;i<100;i=i*5){}
In scala I was trying the following code-
var j=1
for(i<-1.to(100).by(scala.math.pow(5,j).toInt))
{
println(i+" "+j)
j=j+1
}
But its printing the following output:
1 1
6 2
11 3
16 4
21 5
26 6
31 7
36 8
....
....
Its incrementing by 5 always. So how do I got about actually multiplying the increment by 5 instead of adding it.
Let's first explain the problem. This code:
var j=1
for(i<-1.to(100).by(scala.math.pow(5,j).toInt))
{
println(i+" "+j)
j=j+1
}
is equivalent to this:
var j = 1
val range: Range = Predef.intWrapper(1).to(100)
val increment: Int = scala.math.pow(5, j).toInt
val byRange: Range = range.by(increment)
byRange.foreach {
println(i+" "+j)
j=j+1
}
So, by the time you get to mutate j, increment and byRange have already been computed. And Range is an immutable object -- you can't change it. Even if you produced new ranges while you did the foreach, the object doing the foreach would still be the same.
Now, to the solution. Simply put, Range is not adequate for your needs. You want a geometric progression, not an arithmetic one. To me (and pretty much everyone else answering, it seems), the natural solution would be to use a Stream or Iterator created with iterate, which computes the next value based on the previous one.
for(i <- Iterator.iterate(1)(_ * 5) takeWhile (_ < 100)) {
println(i)
}
EDIT: About Stream vs Iterator
Stream and Iterator are very different data structures, that share the property of being non-strict. This property is what enables iterate to even exist, since this method is creating an infinite collection1, from which takeWhile will create a new2 collection which is finite. Let's see here:
val s1 = Stream.iterate(1)(_ * 5) // s1 is infinite
val s2 = s1.takeWhile(_ < 100) // s2 is finite
val i1 = Iterator.iterate(1)(_ * 5) // i1 is infinite
val i2 = i1.takeWhile(_ < 100) // i2 is finite
These infinite collections are possible because the collection is not pre-computed. On a List, all elements inside the list are actually stored somewhere by the time the list has been created. On the above examples, however, only the first element of each collection is known in advance. All others will only be computed if and when required.
As I mentioned, though, these are very different collections in other respects. Stream is an immutable data structure. For instance, you can print the contents of s2 as many times as you wish, and it will show the same output every time. On the other hand, Iterator is a mutable data structure. Once you used a value, that value will be forever gone. Print the contents of i2 twice, and it will be empty the second time around:
scala> s2 foreach println
1
5
25
scala> s2 foreach println
1
5
25
scala> i2 foreach println
1
5
25
scala> i2 foreach println
scala>
Stream, on the other hand, is a lazy collection. Once a value has been computed, it will stay computed, instead of being discarded or recomputed every time. See below one example of that behavior in action:
scala> val s2 = s1.takeWhile(_ < 100) // s2 is finite
s2: scala.collection.immutable.Stream[Int] = Stream(1, ?)
scala> println(s2)
Stream(1, ?)
scala> s2 foreach println
1
5
25
scala> println(s2)
Stream(1, 5, 25)
So Stream can actually fill up the memory if one is not careful, whereas Iterator occupies constant space. On the other hand, one can be surprised by Iterator, because of its side effects.
(1) As a matter of fact, Iterator is not a collection at all, even though it shares a lot of the methods provided by collections. On the other hand, from the problem description you gave, you are not really interested in having a collection of numbers, just in iterating through them.
(2) Actually, though takeWhile will create a new Iterator on Scala 2.8.0, this new iterator will still be linked to the old one, and changes in one have side effects on the other. This is subject to discussion, and they might end up being truly independent in the future.
In a more functional style:
scala> Stream.iterate(1)(i => i * 5).takeWhile(i => i < 100).toList
res0: List[Int] = List(1, 5, 25)
And with more syntactic sugar:
scala> Stream.iterate(1)(_ * 5).takeWhile(_ < 100).toList
res1: List[Int] = List(1, 5, 25)
Maybe a simple while-loop would do?
var i=1;
while (i < 100)
{
println(i);
i*=5;
}
or if you want to also print the number of iterations
var i=1;
var j=1;
while (i < 100)
{
println(j + " : " + i);
i*=5;
j+=1;
}
it seems you guys likes functional so how about a recursive solution?
#tailrec def quints(n:Int): Unit = {
println(n);
if (n*5<100) quints(n*5);
}
Update: Thanks for spotting the error... it should of course be power, not multiply:
Annoyingly, there doesn't seem to be an integer pow function in the standard library!
Try this:
def pow5(i:Int) = math.pow(5,i).toInt
Iterator from 1 map pow5 takeWhile (100>=) toList
Or if you want to use it in-place:
Iterator from 1 map pow5 takeWhile (100>=) foreach {
j => println("number:" + j)
}
and with the indices:
val iter = Iterator from 1 map pow5 takeWhile (100>=)
iter.zipWithIndex foreach { case (j, i) => println(i + " = " + j) }
(0 to 2).map (math.pow (5, _).toInt).zipWithIndex
res25: scala.collection.immutable.IndexedSeq[(Int, Int)] = Vector((1,0), (5,1), (25,2))
produces a Vector, with i,j in reversed order.