RDD constructed from parsed case class: serialization fails - scala

I'm trying to understand how serialization in the case of a self constructed case class and a parser in a separate object works -- and I fail.
I tried to boil down the problem to:
parsing a string into case classes
constructing an RDD from those
taking the first element in order to print it
case class article(title: String, text: String) extends Serializable {
override def toString = title + s"/" + text
}
object parser {
def parse(line: String): article = {
val subs = "</end>"
val i = line.indexOf(subs)
val title = line.substring(6, i)
val text = line.substring(i + subs.length, line.length)
article(title, text)
}
}
val text = """"<beg>Title1</end>Text 1"
"<beg>Title2</end>Text 2"
"""
val lines = text.split('\n')
val res = lines.map( line => parser.parse(line) )
val rdd = sc.parallelize(res)
rdd.take(1).map( println )
I get a
Job aborted due to stage failure: Failed to serialize task, not attempting to retry it. Exception during serialization: java.io.NotSerializableException
Can a gifted Scala expert please help me -- just that I understand the interaction of serialization in workers and master -- how to fix the parser / article interaction such that serialization works?
Thank you very much.

In your map function from lines.map( line => parser.parse(line) ) you call parser.parse and parser it's your object which is not serializable. Spark internally uses partitions which are spread across the cluster. The map functions will be called on each partitions. Because the partitions are not on the same JVM process, the function that is called on each partition needs to be serializable, that is why your object parser has to obey the rule.

Related

org.apache.spark.SparkException: Task not serializable -- Scala

I am reading a text file and it is fixed width file which I need to convert to csv. My program works fine in local machine but when I run it on cluster, it throws "Task not serializable" exception.
I tried to solve same problem with map and mapPartition.
It works fine by using toLocalIterator on RDD. But it doesm't work with large file(I have files of 8GB)
Below is code by using mapPartition which I recently tried
//reading source file and creating RDD
def main(){
var inpData = sc.textFile(s3File)
LOG.info(s"\n inpData >>>>>>>>>>>>>>> [${inpData.count()}]")
val rowRDD = inpData.mapPartitions(iter=>{
var listOfRow = new ListBuffer[Row]
while(iter.hasNext){
var line = iter.next()
if(line.length() >= maxIndex){
listOfRow += getRow(line,indexList)
}else{
counter+=1
}
}
listOfRow.toIterator
})
rowRDD .foreach(println)
}
case class StartEnd(startingPosition: Int, endingPosition: Int) extends Serializable
def getRow(x: String, inst: List[StartEnd]): Row = {
val columnArray = new Array[String](inst.size)
for (f <- 0 to inst.size - 1) {
columnArray(f) = x.substring(inst(f).startingPosition, inst(f).endingPosition)
}
Row.fromSeq(columnArray)
}
//Note : for your refernce, indexList I have created using StartEnd case class, which looks like below after creation
[List(StartEnd(0,4), StartEnd(4,10), StartEnd(7,12), StartEnd(10,14))]
This program works fine in my local machine. But when I put on cluster(AWS) it throws exception as shown below.
>>>>>>>>Map(ResultantDF -> [], ExceptionString ->
Exception occurred while applying the FileConversion transformation and the exception Message is :Task not serializable
Exception occurred while applying the FileConversion transformation and the exception Message is :Task not serializable)
[Driver] TRACE reflection.ReflectionInvoker$.invokeDTMethod - Exit
I am not able to understand what's wrong here and what is not serializable, why it is throwing exception.
Any help is appreciated.
Thanks in advance!
You call getRow method inside a Spark mapPartition transformation. It forces spark to pass an instance of you main class to workers. The main class contains LOG as a field. Seems that this log is not serialization-friendly.
You can
a) move getRow of LOG to a different object (general way to solve such issues)
b) make LOG a lazy val
c) use another logging library

Processing an akka stream asynchronously and writing to a file sink

I am trying to write a piece of code that would consume a stream of tickers (stock exchange symbol of a company) and fetch company information from a REST API for each ticker.
I want to fetch information for multiple companies asynchronously.
I would like to save the results to a file in a continuous manner as the entire data set might not fit into memory.
Following the documentation of akka streams and resources that I was able to google on this subject I have come up with the following piece of code (some parts are omitted for brevity):
implicit val actorSystem: ActorSystem = ActorSystem("stock-fetcher-system")
implicit val materializer: ActorMaterializer = ActorMaterializer(None, Some("StockFetcher"))(actorSystem)
implicit val context = system.dispatcher
import CompanyJsonMarshaller._
val parallelism = 10
val connectionPool = Http().cachedHostConnectionPoolHttps[String](s"api.iextrading.com")
val listOfSymbols = symbols.toList
val outputPath = "out.txt"
Source(listOfSymbols)
.mapAsync(parallelism) {
stockSymbol => Future(HttpRequest(uri = s"https://api.iextrading.com/1.0/stock/${stockSymbol.symbol}/company"), stockSymbol.symbol)
}
.via(connectionPool)
.map {
case (Success(response), _) => Unmarshal(response.entity).to[Company]
case (Failure(ex), symbol) => println(s"Unable to fetch char data for $symbol") "x"
}
.runWith(FileIO.toPath(new File(outputPath).toPath, Set(StandardOpenOption.APPEND)))
.onComplete { _ =>
bufferedSource.close
actorSystem.terminate()
}
This is the problematic line:
runWith(FileIO.toPath(new File(outputPath).toPath, Set(StandardOpenOption.APPEND)))
which doesn't compile and the compiler gives me this mysteriously looking error:
Type mismatch, expected Graph[SinkShape[Any, NotInferedMat2], actual Sink[ByeString, Future[IOResult]]
If I change the sink to Sink.ignore or println(_) it works.
I'd appreciate some more detailed explanation.
As the compiler is indicating, the types don't match. In the call to .map...
.map {
case (Success(response), _) =>
Unmarshal(response.entity).to[Company]
case (Failure(ex), symbol) =>
println(s"Unable to fetch char data for $symbol")
"x"
}
...you're returning either a Company instance or a String, so the compiler infers the closest supertype (or "least upper bounds") to be Any. The Sink expects input elements of type ByteString, not Any.
One approach is to send the response to the file sink without unmarshalling the response:
Source(listOfSymbols)
.mapAsync(parallelism) {
...
}
.via(connectionPool)
.map(_.entity.dataBytes) // entity.dataBytes is a Source[ByteString, _]
.flatMapConcat(identity)
.runWith(FileIO.toPath(...))

Akka streams: dealing with futures within graph stage

Within an akka stream stage FlowShape[A, B] , part of the processing I need to do on the A's is to save/query a datastore with a query built with A data. But that datastore driver query gives me a future, and I am not sure how best to deal with it (my main question here).
case class Obj(a: String, b: Int, c: String)
case class Foo(myobject: Obj, name: String)
case class Bar(st: String)
//
class SaveAndGetId extends GraphStage[FlowShape[Foo, Bar]] {
val dao = new DbDao // some dao with an async driver
override def createLogic(inheritedAttributes: Attributes) = new GraphStageLogic(shape) {
setHandlers(in, out, new InHandler with Outhandler {
override def onPush() = {
val foo = grab(in)
val add = foo.record.value()
val result: Future[String] = dao.saveAndGetRecord(add.myobject)//saves and returns id as string
//the naive approach
val record = Await(result, Duration.inf)
push(out, Bar(record))// ***tests pass every time
//mapping the future approach
result.map { x=>
push(out, Bar(x))
} //***tests fail every time
The next stage depends on the id of the db record returned from query, but I want to avoid Await. I am not sure why mapping approach fails:
"it should work" in {
val source = Source.single(Foo(Obj("hello", 1, "world")))
val probe = source
.via(new SaveAndGetId))
.runWith(TestSink.probe)
probe
.request(1)
.expectBarwithId("one")//say we know this will be
.expectComplete()
}
private implicit class RichTestProbe(probe: Probe[Bar]) {
def expectBarwithId(expected: String): Probe[Bar] =
probe.expectNextChainingPF{
case r # Bar(str) if str == expected => r
}
}
When run with mapping future, I get failure:
should work ***FAILED***
java.lang.AssertionError: assertion failed: expected: message matching partial function but got unexpected message OnComplete
at scala.Predef$.assert(Predef.scala:170)
at akka.testkit.TestKitBase$class.expectMsgPF(TestKit.scala:406)
at akka.testkit.TestKit.expectMsgPF(TestKit.scala:814)
at akka.stream.testkit.TestSubscriber$ManualProbe.expectEventPF(StreamTestKit.scala:570)
The async side channels example in the docs has the future in the constructor of the stage, as opposed to building the future within the stage, so doesn't seem to apply to my case.
I agree with Ramon. Constructing a new FlowShapeis not necessary in this case and it is too complicated. It is very much convenient to use mapAsync method here:
Here is a code snippet to utilize mapAsync:
import akka.stream.scaladsl.{Sink, Source}
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
object MapAsyncExample {
val numOfParallelism: Int = 10
def main(args: Array[String]): Unit = {
Source.repeat(5)
.mapAsync(parallelism)(x => asyncSquare(x))
.runWith(Sink.foreach(println)) previous stage
}
//This method returns a Future
//You can replace this part with your database operations
def asyncSquare(value: Int): Future[Int] = Future {
value * value
}
}
In the snippet above, Source.repeat(5) is a dummy source to emit 5 indefinitely. There is a sample function asyncSquare which takes an integer and calculates its square in a Future. .mapAsync(parallelism)(x => asyncSquare(x)) line uses that function and emits the output of Future to the next stage. In this snipet, the next stage is a sink which prints every item.
parallelism is the maximum number of asyncSquare calls that can run concurrently.
I think your GraphStage is unnecessarily overcomplicated. The below Flow performs the same actions without the need to write a custom stage:
val dao = new DbDao
val parallelism = 10 //number of parallel db queries
val SaveAndGetId : Flow[Foo, Bar, _] =
Flow[Foo]
.map(foo => foo.record.value().myobject)
.mapAsync(parallelism)(rec => dao.saveAndGetRecord(rec))
.map(Bar.apply)
I generally try to treat GraphStage as a last resort, there is almost always an idiomatic way of getting the same Flow by using the methods provided by the akka-stream library.

Trying to understand Scala enumerator/iteratees

I am new to Scala and Play!, but have a reasonable amount of experience of building webapps with Django and Python and of programming in general.
I've been doing an exercise of my own to try to improve my understanding - simply pull some records from a database and output them as a JSON array. I'm trying to use the Enumarator/Iteratee functionality to do this.
My code follows:
TestObjectController.scala:
def index = Action {
db.withConnection { conn=>
val stmt = conn.createStatement()
val result = stmt.executeQuery("select * from datatable")
logger.debug(result.toString)
val resultEnum:Enumerator[TestDataObject] = Enumerator.generateM {
logger.debug("called enumerator")
result.next() match {
case true =>
val obj = TestDataObject(result.getString("name"), result.getString("object_type"),
result.getString("quantity").toInt, result.getString("cost").toFloat)
logger.info(obj.toJsonString)
Future(Some(obj))
case false =>
logger.warn("reached end of iteration")
stmt.close()
null
}
}
val consume:Iteratee[TestDataObject,Seq[TestDataObject]] = {
Iteratee.fold[TestDataObject,Seq[TestDataObject]](Seq.empty[TestDataObject]) { (result,chunk) => result :+ chunk }
}
val newIteree = Iteratee.flatten(resultEnum(consume))
val eventuallyResult:Future[Seq[TestDataObject]] = newIteree.run
eventuallyResult.onSuccess { case x=> println(x)}
Ok("")
}
}
TestDataObject.scala:
package models
case class TestDataObject (name: String, objtype: String, quantity: Int, cost: Float){
def toJsonString: String = {
val mapper = new ObjectMapper()
mapper.registerModule(DefaultScalaModule)
mapper.writeValueAsString(this)
}
}
I have two main questions:
How do i signal that the input is complete from the Enumerator callback? The documentation says "this method takes a callback function e: => Future[Option[E]] that will be called each time the iteratee this Enumerator is applied to is ready to take some input." but I am unable to pass any kind of EOF that I've found because it;s the wrong type. Wrapping it in a Future does not help, but instinctively I am not sure that's the right approach.
How do I get the final result out of the Future to return from the controller view? My understanding is that I would effectively need to pause the main thread to wait for the subthreads to complete, but the only examples I've seen and only things i've found in the future class is the onSuccess callback - but how can I then return that from the view? Does Iteratee.run block until all input has been consumed?
A couple of sub-questions as well, to help my understanding:
Why do I need to wrap my object in Some() when it's already in a Future? What exactly does Some() represent?
When I run the code for the first time, I get a single record logged from logger.info and then it reports "reached end of iteration". Subsequent runs in the same session call nothing. I am closing the statement though, so why do I get no results the second time around? I was expecting it to loop indefinitely as I don't know how to signal the correct termination for the loop.
Thanks a lot in advance for any answers, I thought I was getting the hang of this but obviously not yet!
How do i signal that the input is complete from the Enumerator callback?
You return a Future(None).
How do I get the final result out of the Future to return from the controller view?
You can use Action.async (doc):
def index = Action.async {
db.withConnection { conn=>
...
val eventuallyResult:Future[Seq[TestDataObject]] = newIteree.run
eventuallyResult map { data =>
OK(...)
}
}
}
Why do I need to wrap my object in Some() when it's already in a Future? What exactly does Some() represent?
The Future represents the (potentially asynchronous) processing to obtain the next element. The Option represents the availability of the next element: Some(x) if another element is available, None if the enumeration is completed.

Streaming a CSV file to the browser in Spray

In one part of my application I have to send a CSV file back to the browser. I have an actor that replies with a Stream[String], each element of the Stream is one line in the file that should be sent to the user.
This is what I currently have. Note that I'm currently returning MediaType text/plain for debugging purposes, it will be text/csv in the final version:
trait TripService extends HttpService
with Matchers
with CSVTripDefinitions {
implicit def actorRefFactory: ActorRefFactory
implicit val executionContext = actorRefFactory.dispatcher
lazy val csvActor = Ridespark.system.actorSelection(s"/user/listeners/csv").resolveOne()(1000)
implicit val stringStreamMarshaller = Marshaller.of[Stream[String]](MediaTypes.`text/plain`) { (v, ct, ctx) =>
ctx.marshalTo(HttpEntity(ct, v.mkString("\n")))
}
val route = {
get {
path("trips" / Segment / DateSegment) { (network, date) =>
respondWithMediaType(MediaTypes.`text/plain`) {
complete(askTripActor(date, network))
}
}
}
}
def askTripActor(date: LocalDate, network: String)
(implicit timeout: Timeout = Timeout(1000, TimeUnit.MILLISECONDS)): Future[Stream[String]] = {
csvActor flatMap { actor => (actor ? GetTripsForDate(date, Some(network))).mapTo[Stream[String]] }
}
}
class TripApi(info: AuthInfo)(implicit val actorRefFactory: ActorRefFactory) extends TripService
My problem with this is that it will load the whole CSV into memory (due to the mkString in stringStreamMarshaller). Is there a way of not doing this, streaming the lines of the CSV as they are generated?
You don't need the stringStreamMarshaller. Using the built-in stream marshaller should already do what you want without adding the extra line-breaks. However, adding the linebreaks shouldn't be more difficult than using streamOfStrings.map(_ + "\n") or just adding the linebreaks when rendering the lines in the csv-actor in the first place.