Why in an RDD, map gives NotSerializableException while foreach doesn't? - scala

I understand the basic difference between map & foreach (lazy and eager), also I understand why this code snippet
sc.makeRDD(Seq("a", "b")).map(s => new java.io.ByteArrayInputStream(s.getBytes)).collect
should give
java.io.NotSerializableException: java.io.ByteArrayInputStream
And then I think so should the following code snippet
sc.makeRDD(Seq("a", "b")).foreach(s => {
val is = new java.io.ByteArrayInputStream(s.getBytes)
println("is = " + is)
})
But this code runs fine. Why so?

Actually fundamental difference between map and foreach is not evaluation strategy. Lets take a look at the signatures (I've omitted implicit part of map for brevity):
def map[U](f: (T) ⇒ U): RDD[U]
def foreach(f: (T) ⇒ Unit): Unit
map takes a function from T to U applies it to each element of the existing RDD[T] and returns RDD[U]. To allow operations likes shuffling U has to be serializable.
foreach takes a function from T to Unit (which is analogous to Java void) and by itself returns nothing. Everything happens locally, there is no network traffic involved so there is no need for serialization. Unlike map, foreach should be used when want to get some kind of side effect, like in your previous question.
On a side note these two are actually different. Anonymous function you use in map is a function:
(s: String) => java.io.ByteArrayInputStream
and one you use in foreach like this:
(s: String) => Unit
If you use the second function with map your code will compile, although result would be far from what you want (RDD[Unit]).

collect call after map is causing the issue.
Below are results of my testing in spark-shell.
Below passes as no data has to be sent to other nodes.
sc.makeRDD(1 to 1000, 1).map(_ => {NullWritable.get}).count
Below calls fail, as map output can be sent to other nodes.
sc.makeRDD(1 to 1000, 1).map(_ => {NullWritable.get}).first
sc.makeRDD(1 to 1000, 1).map(_ => {NullWritable.get}).collect
Repartition forces distribution of data to nodes, which fails.
sc.makeRDD(1 to 1000, 1).map(_ => {NullWritable.get}).repartition(2).saveAsTextFile("/tmp/NWRepart")
Without repartition below call passes.
sc.makeRDD(1 to 1000, 1).map(_ => {NullWritable.get}).saveAsTextFile("/tmp/NW")

Related

Scala - how to read method signatures

I am new to Scala and trying to understand a following method:
def method1 = {
val key = "k1"
val value = "v1"
basicSetup() { (a, b, c) =>
val json = s"""{"field1":"$value"}"""
someMethodTest.send(a, b, json, c)
} { (record, avroObject, schema) =>
if (avroObject.get("field1").toString != value) {
failure("failed")
} else {
success
}
}
}
So far I worked on simple methods and understand when is a simple call and return but here it looks like is bundled stuff in it.
Need a help to understand how to read it from basicSetup line (just a general flow, signature and return).
e.g. Why is here 2 blocks of code: basicSetup() { ... } { ...} (how is it being executed?)
private def basicSetup()
(run: (Producer, String, Schema) => Unit)
(verify: (ProducerRecord[String, Array[Byte]], GenericRecord, Schema) => Result) = {
...
...
}
Thanks
It would be helpful to look at the definition of basicSetup, but it looks like a method with three parameter groups, the last two of which are themselves functions (making basicSetup a higher-order function).
The first parameter group is empty ().
The second and third are two "closures" or blocks of code or anonymous functions.
You could rewrite this as
// give names to these blocks
def doSomethingWithABC(a:A, b:B, c:C) = ???
def doSomethingWithAvro(record: R, avro: O, schema: S) = ???
basicSetup()(doSomethingWithABC)(doSomethingWithAvro)
Why is here 2 blocks of code ?
This is syntactic sugar to make function calls (especially higher-order function calls) look more like "built-in" constructs. So you can roll your own control flow methods. Keyword here is DSL.
These two blocks are parameters to basicSetup. They can appear as just bare blocks (without any parameter parentheses) to make it more concise (and natural, once you get used to it).
Update (now that we have the signature):
private def basicSetup()
(run: (Producer, String, Schema) => Unit)
(verify: (ProducerRecord[String, Array[Byte]], GenericRecord, Schema) => Result) = {
Indeed. The function takes three parameter groups.
The first one is actually empty, so you just call it with (). But it could have some parameters, even optionals, maybe to add configuration.
The second one is your "callback" to "run" (after this basic setup has been completed). It itself is a function that will be called with three parameters, a Producer, a String and a Schema.
The third one is your code to "verify" the results of all that. It looks at three parameters and returns a Result (presumably indicating that all is good or what went wrong).

Mapping RDD to function does not invoke the function

I am using Scala Spark API. In my code, I have an RDD of the following structure:
Tuple2[String, (Iterable[Array[String]], Option[Iterable[Array[String]]])]
I need to process (perform validations and modify values) the second element of the RDD. I am using map function to do that:
myRDD.map(line => mappingFunction(line))
Unfortunately, the mappingFunction is not invoked. This is the code of the mapping function:
def mappingFunction(line: Tuple2[String, (Iterable[Array[String]], Option[Iterable[Array[String]]])] ): Tuple2[String, (Iterable[Array[String]], Option[Iterable[Array[String]]])] = {
println("Inside mappingFunction")
return line
}
When my program ends, there are no printed messages in the stdout.
In order to investigate the problem, I implemented a code snippet that worked:
val x = List.range(1, 10)
val mappedX = x.map(i => callInt(i))
And the following mapping function was invoked:
def callInt(i: Int) = {
println("Inside callInt")
}
Please assist in getting the RDD mapping function mappingFunction invoked. Thank you.
x is a List, so there is no laziness there, that's why your action is being invoked regardless you are not calling an action.
myRDD is an RDD, RDDs are lazy, this means that you don't actually execute your transformations (map, flatMap, filter) until you need to.
That means that you are not running your map function until you perform an action. An action is an operation that triggers the precedent operations (called transformations) to be executed.
Some examples of actions are collect or count
If you do this:
myRDD.map(line => mappingFunction(line)).count()
You'll see your prints. Anyway, there is no problem with your code at all, you just need to take into consideration the laziness nature of the RDDs
There is a good answer about this topic here.
Also you can find more info and a whole list of transformations and actions here

Type mismatch mapping Future[Seq[model]] using Play with Scala

I still struggle sometimes to map Futures using Play with Scala...
I am trying to pass to a view my entire Supply DB table (i.e. all the supplies in the DB).
I have tried two distinct ways but both have failed...
Below are the methods I've tried and the errors I get.
Can someone please help me solve this, and also explain me why both of these have failed?
Thank you in advance!
Note: Calling supplyService.all returns a Future[Seq[Supply]].
First attempt
def index = SecuredAction.async { implicit request =>
supplyService.all map { supplies =>
Future.successful(Ok(views.html.supplies.index(request.identity, SupplyForm.form, supplies)))
}
}
Second attempt
def index = SecuredAction.async { implicit request =>
val supplies = supplyService.all
Future.successful(Ok(views.html.supplies.index(request.identity, SupplyForm.form, supplies)))
}
First variant without Future.succesfull
supplyService.all.map( supplies => Ok(views.html.supplies.index(request.identity, SupplyForm.form, supplies)) )
Since you have can construct function Seq[Supply] => Result you can easily map
Future[Seq[Supply]] to Future[Result] via functor interface.
Future is also a monad , so you can use Seq[Supply] => Future[Result] with flatMap method.
But Future.successfull is monad unit , and as for many monads for Future its true that
mx.flatMap(f andThen unit) = mx.map(f)
so your
ms.flatMap(supplies => Future.successfull(f(supplies)) =
ms.flatMap(f andThen Future.successfull) =
ms.map(f)

iterative lookup from within rdd.map in scala

def retrieveindex (stringlist: List[String], lookuplist: List[String]) =
stringlist.foreach(y => lookuplist.indexOf(y))
is my function.
I am trying to use this within an rdd like this:
val libsvm = libsvmlabel.map(x =>
Array(x._2._2,retrieveindex(x._2._1.toList,featureSet.toList)))
However, I am getting an output that is empty. There is no error, but the output from retrieveindex is empty. When I use println to see if I am retrieving correctly, I do see the indices printed. Is there any way to do this? Should I first 'distribute' the function to all the workers? I am a newbie.
retrieveindex has a return type of type Unit (because of foreach which just applies a function (String) ⇒ Unit on each element). Therefore it does not map to anything.
You probably want it to return the list of indices, like:
def retrieveindex(stringlist: List[String], lookuplist: List[String]): List[Int] =
stringlist.map(y => lookuplist.indexOf(y))

Iteratees in Scala that use lazy evaluation or fusion?

I have heard that iteratees are lazy, but how lazy exactly are they? Alternatively, can iteratees be fused with a postprocessing function, so that an intermediate data structure does not have to be built?
Can I in my iteratee for example build a 1 million item Stream[Option[String]] from a java.io.BufferedReader, and then subsequently filter out the Nones, in a compositional way, without requiring the entire Stream to be held in memory? And at the same time guarantee that I don't blow the stack? Or something like that - it doesn't have to use a Stream.
I'm currently using Scalaz 6 but if other iteratee implementations are able to do this in a better way, I'd be interested to know.
Please provide a complete solution, including closing the BufferedReader and calling unsafePerformIO, if applicable.
Here's a quick iteratee example using the Scalaz 7 library that demonstrates the properties you're interested in: constant memory and stack usage.
The problem
First assume we've got a big text file with a string of decimal digits on each line, and we want to find all the lines that contain at least twenty zeros. We can generate some sample data like this:
val w = new java.io.PrintWriter("numbers.txt")
val r = new scala.util.Random(0)
(1 to 1000000).foreach(_ =>
w.println((1 to 100).map(_ => r.nextInt(10)).mkString)
)
w.close()
Now we've got a file named numbers.txt. Let's open it with a BufferedReader:
val reader = new java.io.BufferedReader(new java.io.FileReader("numbers.txt"))
It's not excessively large (~97 megabytes), but it's big enough for us to see easily whether our memory use is actually staying constant while we process it.
Setting up our enumerator
First for some imports:
import scalaz._, Scalaz._, effect.IO, iteratee.{ Iteratee => I }
And an enumerator (note that I'm changing the IoExceptionOrs into Options for the sake of convenience):
val enum = I.enumReader(reader).map(_.toOption)
Scalaz 7 doesn't currently provide a nice way to enumerate a file's lines, so we're chunking through the file one character at a time. This will of course be painfully slow, but I'm not going to worry about that here, since the goal of this demo is to show that we can process this large-ish file in constant memory and without blowing the stack. The final section of this answer gives an approach with better performance, but here we'll just split on line breaks:
val split = I.splitOn[Option[Char], List, IO](_.cata(_ != '\n', false))
And if the fact that splitOn takes a predicate that specifies where not to split confuses you, you're not alone. split is our first example of an enumeratee. We'll go ahead and wrap our enumerator in it:
val lines = split.run(enum).map(_.sequence.map(_.mkString))
Now we've got an enumerator of Option[String]s in the IO monad.
Filtering the file with an enumeratee
Next for our predicate—remember that we said we wanted lines with at least twenty zeros:
val pred = (_: String).count(_ == '0') >= 20
We can turn this into a filtering enumeratee and wrap our enumerator in that:
val filtered = I.filter[Option[String], IO](_.cata(pred, true)).run(lines)
We'll set up a simple action that just prints everything that makes it through this filter:
val printAction = (I.putStrTo[Option[String]](System.out) &= filtered).run
Of course we've not actually read anything yet. To do that we use unsafePerformIO:
printAction.unsafePerformIO()
Now we can watch the Some("0946943140969200621607610...")s slowly scroll by while our memory usage remains constant. It's slow and the error handling and output are a little clunky, but not too bad I think for about nine lines of code.
Getting output from an iteratee
That was the foreach-ish usage. We can also create an iteratee that works more like a fold—for example gathering up the elements that make it through the filter and returning them in a list. Just repeat everything above up until the printAction definition, and then write this instead:
val gatherAction = (I.consume[Option[String], IO, List] &= filtered).run
Kick that action off:
val xs: Option[List[String]] = gatherAction.unsafePerformIO().sequence
Now go get a coffee (it might need to be pretty far away). When you come back you'll either have a None (in the case of an IOException somewhere along the way) or a Some containing a list of 1,943 strings.
Complete (faster) example that automatically closes the file
To answer your question about closing the reader, here's a complete working example that's roughly equivalent to the second program above, but with an enumerator that takes responsibility for opening and closing the reader. It's also much, much faster, since it reads lines, not characters. First for imports and a couple of helper methods:
import java.io.{ BufferedReader, File, FileReader }
import scalaz._, Scalaz._, effect._, iteratee.{ Iteratee => I, _ }
def tryIO[A, B](action: IO[B]) = I.iterateeT[A, IO, Either[Throwable, B]](
action.catchLeft.map(
r => I.sdone(r, r.fold(_ => I.eofInput, _ => I.emptyInput))
)
)
def enumBuffered(r: => BufferedReader) =
new EnumeratorT[Either[Throwable, String], IO] {
lazy val reader = r
def apply[A] = (s: StepT[Either[Throwable, String], IO, A]) => s.mapCont(
k =>
tryIO(IO(reader.readLine())).flatMap {
case Right(null) => s.pointI
case Right(line) => k(I.elInput(Right(line))) >>== apply[A]
case e => k(I.elInput(e))
}
)
}
And now the enumerator:
def enumFile(f: File): EnumeratorT[Either[Throwable, String], IO] =
new EnumeratorT[Either[Throwable, String], IO] {
def apply[A] = (s: StepT[Either[Throwable, String], IO, A]) => s.mapCont(
k =>
tryIO(IO(new BufferedReader(new FileReader(f)))).flatMap {
case Right(reader) => I.iterateeT(
enumBuffered(reader).apply(s).value.ensuring(IO(reader.close()))
)
case Left(e) => k(I.elInput(Left(e)))
}
)
}
And we're ready to go:
val action = (
I.consume[Either[Throwable, String], IO, List] %=
I.filter(_.fold(_ => true, _.count(_ == '0') >= 20)) &=
enumFile(new File("numbers.txt"))
).run
Now the reader will be closed when the processing is done.
I should have read a little bit further... this is precisely what enumeratees are for. Enumeratees are defined in Scalaz 7 and Play 2, but not in Scalaz 6.
Enumeratees are for "vertical" composition (in the sense of "vertically integrated industry") while ordinary iteratees compose monadically in a "horizontal" way.