ReactiveMongo w/ Play Scala - scala

New to play,scala, and reactivemongo and the documentation is not very noob friendly.
I see the Bulk Insert section at See Bulk Insert
but I don't know why they aren't showing it contained in a method?
I am expecting a request with JSON data containing multiple objects in it. How do I set up a bulk insert that handles multiple inserts with errors that can be returned.
For example by single insert method is as follows:
def createFromJson = Action(parse.json) {
request =>
try {
val person = request.body.validate[Person].get
val mongoResult = Await.result(collection.insert(person),Duration.apply(20,"seconds"))
if(mongoResult.hasErrors) throw new Exception(mongoResult.errmsg.getOrElse("something unknown"))
Created(Json.toJson(person))
}
catch {
case e: Exception => BadRequest(e.getMessage)
}
}

Here is a full example how you can do it:
class ExampleController #Inject()(database: DefaultDB) extends Controller {
case class Person(firstName: String, lastName: String)
val personCollection: BSONCollection = database.collection("persons")
implicit val PersonJsonReader: Reads[Person] = Json.reads[Person]
implicit val PersonSeqJsonReader: Reads[Seq[Person]] = Reads.seq(PersonJsonReader)
implicit val PersonJsonWriter: Writes[Person] = Json.writes[Person]
implicit val PersonSeqJsonWriter: Writes[Seq[Person]] = Writes.seq(PersonJsonWriter)
implicit val PersonBsonWriter = Macros.writer[Person]
def insertMultiple = Action.async(parse.json) { implicit request =>
val validationResult: JsResult[Seq[Person]] = request.body.validate[Seq[Person]]
validationResult.fold(
invalidValidationResult => Future.successful(BadRequest),
// [1]
validValidationResult => {
val bulkDocs = validValidationResult.
map(implicitly[personCollection.ImplicitlyDocumentProducer](_))
personCollection.bulkInsert(ordered = true)(bulkDocs: _*).map {
case insertResult if insertResult.ok =>
Created(Json.toJson(validationResult.get))
case insertResult =>
InternalServerError
}
}
)
}
}
The meat of it all sits in the lines after [1]. validValidationResult is a variable of type Seq[Person] and contains valid data at this point. Thats what we want to insert into the database.
To do that we need to prepare the documents by mapping each document through the ImplicitlyDocumentProducer of your target collection (here personCollection). Thats leaves you with bulkDocs of type Seq[personCollection.ImplicitlyDocumentProducer]. You can just use bulkInsert() with that:
personCollection.bulkInsert(ordered = true)(bulkDocs: _*)
We use _* here to splat the Seq since bulkInsert() expects varargs and not a Seq. See this thread for more info about it. And thats basically it already.
The remaing code is handling play results and validating the received request body to make sure it contains valid data.
Here are a few general tips to work with play/reactivemongo/scala/futures:
Avoid Await.result. You basically never need it in production code. The idea behind futures is to perform non-blocking operations. Making them blocking again with Await.result defeats the purpose. It can be useful for debugging or test code, but even then there are usually better ways to go about things. Scala futures (unlike java ones) are very powerful and you can do a lot with them, see e.g. flatMap/map/filter/foreach/.. in the Future scaladoc. The above code for instance makes use of exactly that. It uses Action.async instead of Action at the controller method. This means it has to return a Future[Result] instead of a Result. Which is great because ReactiveMongo returns a bunch of Futures for all operations. So all you have to do is execute bulkInsert, which returns a Future and use map() to map the returned Future[MultiBulkWriteResult] to a Future[Result]. This results in no blocking and play can work with the returned future just fine.
Of course the above example can be improved a bit, I tried to keep it simple.
For instance you should return proper error messages when returning BadRequest (request body validation failed) or InternalServerError (database write failed). You can get more info about the errors from invalidValidationResult and insertResult. And you could use Formats instead of that many Reads/Writes (and also use them for ReactiveMongo). Check the play json documentation as well as the reactive mongo doc for more info on that.

Although the previous answer is correct.
We can reduce the boilerplate using JSONCollection
package controllers
import javax.inject._
import play.api.libs.json._
import play.api.mvc._
import play.modules.reactivemongo._
import reactivemongo.play.json.collection.{JSONCollection, _}
import utils.Errors
import scala.concurrent.{ExecutionContext, Future}
case class Person(name: String, age: Int)
object Person {
implicit val formatter = Json.format[Person]
}
#Singleton
class PersonBulkController #Inject()(val reactiveMongoApi: ReactiveMongoApi)(implicit exec: ExecutionContext) extends Controller with MongoController with ReactiveMongoComponents {
val persons: JSONCollection = db.collection[JSONCollection]("person")
def createBulkFromJson = Action.async(parse.json) { request =>
Json.fromJson[Seq[Person]](request.body) match {
case JsSuccess(newPersons, _) =>
val documents = newPersons.map(implicitly[persons.ImplicitlyDocumentProducer](_))
persons.bulkInsert(ordered = true)(documents: _*).map{ multiResult =>
Created(s"Created ${multiResult.n} persons")
}
case JsError(errors) =>
Future.successful(BadRequest("Could not build an array of persons from the json provided. " + errors))
}
}
}
In build.sbt
libraryDependencies ++= Seq(
"org.reactivemongo" %% "play2-reactivemongo" % "0.11.12"
)
Tested with play 2.5.1 although it should compile in previous versions of play.

FYI, as previous answers said, there are two ways to manipulate JSON data: use ReactiveMongo module + Play JSON library, or use ReactiveMongo's BSON library.
The documentation of ReactiveMongo module for Play Framework is available online. You can find code examples there.

Related

No instance of play.api.libs.json.Format is available for akka.actor.typed.ActorRef[org.knoldus.eventSourcing.UserState.Confirmation]

No instance of play.api.libs.json.Format is available for akka.actor.typed.ActorRef[org.knoldus.eventSourcing.UserState.Confirmation] in the implicit scope (Hint: if declared in the same file, make sure it's declared before)
[error] implicit val userCommand: Format[AddUserCommand] = Json.format
I am getting this error even though I have made Implicit instance of Json Format for AddUserCommand.
Here is my code:
trait UserCommand extends CommandSerializable
object AddUserCommand{
implicit val format: Format[AddUserCommand] = Json.format[AddUserCommand]
}
final case class AddUserCommand(user:User, reply: ActorRef[Confirmation]) extends UserCommand
Can anyone please help me with this error and how to solve it?
As Gael noted, you need to provide a Format for ActorRef[Confirmation]. The complication around this is that the natural serialization, using the ActorRefResolver requires that an ExtendedActorSystem be present, which means that the usual approaches to defining a Format in a companion object won't quite work.
Note that because of the way Lagom does dependency injection, this approach doesn't really work in Lagom: commands in Lagom basically can't use Play JSON.
import akka.actor.typed.scaladsl.adapter.ClassicActorSystemOps
import play.api.libs.json._
class PlayJsonActorRefFormat(system: ExtendedActorSystem) {
def reads[A] = new Reads[ActorRef[A]] {
def reads(jsv: JsValue): JsResult[ActorRef[A]] =
jsv match {
case JsString(s) => JsSuccess(ActorRefResolver(system.toTyped).resolveActorRef(s))
case _ => JsError(Seq(JsPath() -> Seq(JsonValidationError(Seq("ActorRefs are strings"))))) // hopefully parenthesized that right...
}
}
def writes[A] = new Writes[ActorRef[A]] {
def writes(a: ActorRef[A]): JsValue = JsString(ActorRefResolver(system.toTyped).toSerializationFormat(a))
}
def format[A] = Format[ActorRef[A]](reads, writes)
}
You can then define a format for AddUserCommand as
object AddUserCommand {
def format(arf: PlayJsonActorRefFormat): Format[AddUserCommand] = {
implicit def arfmt[A]: Format[ActorRef[A]] = arf.format
Json.format[AddUserCommand]
}
}
Since you're presumably using JSON to serialize the messages sent around a cluster (otherwise, the ActorRef shouldn't be leaking out like this), you would then construct an instance of the format in your Akka Serializer implementation.
(NB: I've only done this with Circe, not Play JSON, but the basic approach is common)
The error says that it cannot construct a Format for AddUserCommand because there's no Format for ActorRef[Confirmation].
When using Json.format[X], all the members of the case class X must have a Format defined.
In your case, you probably don't want to define a formatter for this case class (serializing an ActorRef doesn't make much sense) but rather build another case class with data only.
Edit: See Levi's answer on how to provide a formatter for ActorRef if you really want to send out there the actor reference.

Transforming Scala case class into JSON

I have two case classes. The main one, Request, contains two maps.
The first map has a string for both key and value.
The second map has a string key, and value which is an instance of the second case class, KVMapList.
case class Request (var parameters:MutableMap[String, String] = MutableMap[String, String](), var deps:MutableMap[String, KVMapList] = MutableMap[String, KVMapList]())
case class KVMapList(kvMap:MutableMap[String, String], list:ListBuffer[MutableMap[String, String]])
The requirement is to transform Request into a JSON representation.
The following code is trying to do this:
import com.fasterxml.jackson.annotation.PropertyAccessor
import com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper
import com.fasterxml.jackson.annotation.JsonAutoDetect.Visibility
import com.fasterxml.jackson.databind.ObjectMapper
def test(req:Request):String {
val mapper = new ObjectMapper() with ScalaObjectMapper
mapper.setVisibility(PropertyAccessor.ALL, Visibility.ANY)
var jsonInString: String = null
try {
jsonInString = mapper.writeValueAsString(request)
}
catch {
=case e: IOException => {
e.printStackTrace
}
jsonString
}
This however is not working. Even when the Request class is populated, the output is :
{"parameters":{"underlying":{"some-value":""},"empty":false,"traversableAgain":true},"deps":{"sizeMapDefined":false,"empty":false,"traversableAgain":true}}
Using the JSON object mapper with corresponding Java classes is straightforward, but have not yet got it working in Scala. Any assistance is very much appreciated.
Jackson is more of a bad old memory in Scala to some degree. You should use a native Scala library for JSON processing, particularly one really good at compile time derivation of JSON serializers, such as circe.
I'm aware this doesn't directly answer your question, but after using circe I would never go back to anything else.
import io.circe.generic.auto._
import io.circe.parser._
import io.circe.syntax._
val req = new Request(...)
val json = req.asJson.noSpaces
val reparsed = decode[Request](json)
On a different note, using mutable maps inside case classes is as non-idiomatic as it gets, and it should be quite trivial to implement immutable ops for your maps using the auto-generated copy method.
case class Request(parameters: Map[String, String] {
def +(key: String, value: String): Request = {
this.copy(parameters = parameters + (key -> value))
}
}
You should really avoid mutability wherever possible, and it looks like avoiding it here wouldn't be much work at all.
I am not sure what this ScalaObjectMapper does, doesn't look like it is useful.
If you add mapper.registerModule(DefaultScalaModule) in the beginning, it should work ... assuming that by MutableMap you mean mutable.Map, and not some sort of home-made class (because, if you do, you'd have to provide a serializer for it yourself).
(DefaultScalaModule is in jackson-module-scala library. Just add it to your build if you don't already have it).

The most trivial self-hosted REST-like web service in Scala?

I have a function of the form:
(inp:CaseClass) => SomeOtherCaseClass
In other words, it takes a function with a single argument whose type is a Case Class. It returns a single value whose argument is a different kind of case class.
Just about every public function in my application will be of this form.
I'd like to expose this as a web-service so that a client can use this function by HTTP-Posting some JSON. The client will receive the response as a JSON encoded document.
An ideal solution will be:
Simple (e.g. few lines of code, using mostly non-obscure language features)
Should automatically marshal and unmarshal JSON - I don't want to have to write manual converters.
As easy to use as Flask (a popular Python micro-framework for web-stuff).
Things I don't yet care about:
High performance
Authentication / Encryption
I do have a working implementation based on Scalatra. It's OK-ish, but not particularly pretty because it includes quite a bit of boilerplate code just to make it start a server. I'm wondering if I can go for something even more minimal.
This solution was based on some 3-year old code samples I found at work. I'm sure that there must be something more appropriate which has been developed in the last few years?
Have a look at akka-http. It's relatively new (rewrite of the now obsolete spray library) and has a large support community. It's easy to get started for your simple use case but if you ever need more advanced features in the future, they're probably supported. JSON de-/serialization is done using spray-json adapter. You may also use other libraries, like circe with minimal amount of boilerplate. Here's a simple web server implementation accepting POST requests (copied from here):
import akka.actor.ActorSystem
import akka.http.scaladsl.Http
import akka.stream.ActorMaterializer
import akka.Done
import akka.http.scaladsl.server.Route
import akka.http.scaladsl.server.Directives._
import akka.http.scaladsl.model.StatusCodes
import akka.http.scaladsl.marshallers.sprayjson.SprayJsonSupport._
import spray.json.DefaultJsonProtocol._
import scala.io.StdIn
import scala.concurrent.Future
object WebServer {
// domain model
final case class Item(name: String, id: Long)
final case class Order(items: List[Item])
// formats for unmarshalling and marshalling
implicit val itemFormat = jsonFormat2(Item)
implicit val orderFormat = jsonFormat1(Order)
// (fake) async database query api
def fetchItem(itemId: Long): Future[Option[Item]] = ???
def saveOrder(order: Order): Future[Done] = ???
def main(args: Array[String]) {
// needed to run the route
implicit val system = ActorSystem()
implicit val materializer = ActorMaterializer()
// needed for the future map/flatmap in the end
implicit val executionContext = system.dispatcher
val route: Route =
get {
pathPrefix("item" / LongNumber) { id =>
// there might be no item for a given id
val maybeItem: Future[Option[Item]] = fetchItem(id)
onSuccess(maybeItem) {
case Some(item) => complete(item)
case None => complete(StatusCodes.NotFound)
}
}
} ~
post {
path("create-order") {
entity(as[Order]) { order =>
val saved: Future[Done] = saveOrder(order)
onComplete(saved) { done =>
complete("order created")
}
}
}
}
val bindingFuture = Http().bindAndHandle(route, "localhost", 8080)
println(s"Server online at http://localhost:8080/\nPress RETURN to stop...")
StdIn.readLine() // let it run until user presses return
bindingFuture
.flatMap(_.unbind()) // trigger unbinding from the port
.onComplete(_ ⇒ system.terminate()) // and shutdown when done
}
}
Does this suit your needs?

Play 2.5 and ReactiveMongo 0.11 compilation error: value find is not a member of scala.concurrent.Future

I have created a simple app with Play!2.5 where data from a form is written to the database for which I use ReactiveMongo 0.11.
Currently I am trying to get the index page to display all the entries in the DB.
My code:
def index = Action.async{
val cursor: Cursor[User] = collectionF.find(Json.obj()).cursor[User]
val futureUserList :Future[List[User]] = cursor.collect[List]()
val futureUserListJsonArray :Future[JsArray] = futureUserList.map{user =>
Json.arr(user)
}
futureUserListJsonArray.map{ user=>
Ok(user)
}
}
I receive the following compilation error:
value find is not a member of scala.concurrent.Future[reactivemongo.play.json.collection.JSONCollection]
Where the find refers to the collectionF.find() on the second line.
This is what I have imported with regards to concurrency:
import scala.concurrent.Future
import play.api.libs.concurrent.Execution.Implicits.defaultContext
Thanks
As cchantep already pointed out in a comment, you are trying to call find on a Future, not on ReactiveMongo's JSONCollection, and you need a way to access the JSONCollection instead.
However, I'm wondering how you ended up using a Future[JSONCollection] in the first place. It's not only that I can't imagine a viable reason to do that, it's also cumbersome to use that construct afterwards. Yes, you could use e.g. map to get access to the JSONCollection, but this means that the returned Cursor is also a Future (which subsequently forces you to flatMap any operation you are doing with the Cursor).
val cursor: Future[Cursor[User]] = collectionF.map(_.find(...).cursor[User])
val users = cursor.flatMap(...)
Have a look at the documentation of ReactiveMongo's play plugin, especially at the examples at the bottom of the page:
class Application #Inject() (val reactiveMongoApi: ReactiveMongoApi)
extends Controller with MongoController with ReactiveMongoComponents {
// no Future here:
def collection: JSONCollection = db.collection[JSONCollection]("users")
}

Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects

Getting strange behavior when calling function outside of a closure:
when function is in a object everything is working
when function is in a class get :
Task not serializable: java.io.NotSerializableException: testing
The problem is I need my code in a class and not an object. Any idea why this is happening? Is a Scala object serialized (default?)?
This is a working code example:
object working extends App {
val list = List(1,2,3)
val rddList = Spark.ctx.parallelize(list)
//calling function outside closure
val after = rddList.map(someFunc(_))
def someFunc(a:Int) = a+1
after.collect().map(println(_))
}
This is the non-working example :
object NOTworking extends App {
new testing().doIT
}
//adding extends Serializable wont help
class testing {
val list = List(1,2,3)
val rddList = Spark.ctx.parallelize(list)
def doIT = {
//again calling the fucntion someFunc
val after = rddList.map(someFunc(_))
//this will crash (spark lazy)
after.collect().map(println(_))
}
def someFunc(a:Int) = a+1
}
RDDs extend the Serialisable interface, so this is not what's causing your task to fail. Now this doesn't mean that you can serialise an RDD with Spark and avoid NotSerializableException
Spark is a distributed computing engine and its main abstraction is a resilient distributed dataset (RDD), which can be viewed as a distributed collection. Basically, RDD's elements are partitioned across the nodes of the cluster, but Spark abstracts this away from the user, letting the user interact with the RDD (collection) as if it were a local one.
Not to get into too many details, but when you run different transformations on a RDD (map, flatMap, filter and others), your transformation code (closure) is:
serialized on the driver node,
shipped to the appropriate nodes in the cluster,
deserialized,
and finally executed on the nodes
You can of course run this locally (as in your example), but all those phases (apart from shipping over network) still occur. [This lets you catch any bugs even before deploying to production]
What happens in your second case is that you are calling a method, defined in class testing from inside the map function. Spark sees that and since methods cannot be serialized on their own, Spark tries to serialize the whole testing class, so that the code will still work when executed in another JVM. You have two possibilities:
Either you make class testing serializable, so the whole class can be serialized by Spark:
import org.apache.spark.{SparkContext,SparkConf}
object Spark {
val ctx = new SparkContext(new SparkConf().setAppName("test").setMaster("local[*]"))
}
object NOTworking extends App {
new Test().doIT
}
class Test extends java.io.Serializable {
val rddList = Spark.ctx.parallelize(List(1,2,3))
def doIT() = {
val after = rddList.map(someFunc)
after.collect().foreach(println)
}
def someFunc(a: Int) = a + 1
}
or you make someFunc function instead of a method (functions are objects in Scala), so that Spark will be able to serialize it:
import org.apache.spark.{SparkContext,SparkConf}
object Spark {
val ctx = new SparkContext(new SparkConf().setAppName("test").setMaster("local[*]"))
}
object NOTworking extends App {
new Test().doIT
}
class Test {
val rddList = Spark.ctx.parallelize(List(1,2,3))
def doIT() = {
val after = rddList.map(someFunc)
after.collect().foreach(println)
}
val someFunc = (a: Int) => a + 1
}
Similar, but not the same problem with class serialization can be of interest to you and you can read on it in this Spark Summit 2013 presentation.
As a side note, you can rewrite rddList.map(someFunc(_)) to rddList.map(someFunc), they are exactly the same. Usually, the second is preferred as it's less verbose and cleaner to read.
EDIT (2015-03-15): SPARK-5307 introduced SerializationDebugger and Spark 1.3.0 is the first version to use it. It adds serialization path to a NotSerializableException. When a NotSerializableException is encountered, the debugger visits the object graph to find the path towards the object that cannot be serialized, and constructs information to help user to find the object.
In OP's case, this is what gets printed to stdout:
Serialization stack:
- object not serializable (class: testing, value: testing#2dfe2f00)
- field (class: testing$$anonfun$1, name: $outer, type: class testing)
- object (class testing$$anonfun$1, <function1>)
Grega's answer is great in explaining why the original code does not work and two ways to fix the issue. However, this solution is not very flexible; consider the case where your closure includes a method call on a non-Serializable class that you have no control over. You can neither add the Serializable tag to this class nor change the underlying implementation to change the method into a function.
Nilesh presents a great workaround for this, but the solution can be made both more concise and general:
def genMapper[A, B](f: A => B): A => B = {
val locker = com.twitter.chill.MeatLocker(f)
x => locker.get.apply(x)
}
This function-serializer can then be used to automatically wrap closures and method calls:
rdd map genMapper(someFunc)
This technique also has the benefit of not requiring the additional Shark dependencies in order to access KryoSerializationWrapper, since Twitter's Chill is already pulled in by core Spark
Complete talk fully explaining the problem, which proposes a great paradigm shifting way to avoid these serialization problems: https://github.com/samthebest/dump/blob/master/sams-scala-tutorial/serialization-exceptions-and-memory-leaks-no-ws.md
The top voted answer is basically suggesting throwing away an entire language feature - that is no longer using methods and only using functions. Indeed in functional programming methods in classes should be avoided, but turning them into functions isn't solving the design issue here (see above link).
As a quick fix in this particular situation you could just use the #transient annotation to tell it not to try to serialise the offending value (here, Spark.ctx is a custom class not Spark's one following OP's naming):
#transient
val rddList = Spark.ctx.parallelize(list)
You can also restructure code so that rddList lives somewhere else, but that is also nasty.
The Future is Probably Spores
In future Scala will include these things called "spores" that should allow us to fine grain control what does and does not exactly get pulled in by a closure. Furthermore this should turn all mistakes of accidentally pulling in non-serializable types (or any unwanted values) into compile errors rather than now which is horrible runtime exceptions / memory leaks.
http://docs.scala-lang.org/sips/pending/spores.html
A tip on Kryo serialization
When using kyro, make it so that registration is necessary, this will mean you get errors instead of memory leaks:
"Finally, I know that kryo has kryo.setRegistrationOptional(true) but I am having a very difficult time trying to figure out how to use it. When this option is turned on, kryo still seems to throw exceptions if I haven't registered classes."
Strategy for registering classes with kryo
Of course this only gives you type-level control not value-level control.
... more ideas to come.
I faced similar issue, and what I understand from Grega's answer is
object NOTworking extends App {
new testing().doIT
}
//adding extends Serializable wont help
class testing {
val list = List(1,2,3)
val rddList = Spark.ctx.parallelize(list)
def doIT = {
//again calling the fucntion someFunc
val after = rddList.map(someFunc(_))
//this will crash (spark lazy)
after.collect().map(println(_))
}
def someFunc(a:Int) = a+1
}
your doIT method is trying to serialize someFunc(_) method, but as method are not serializable, it tries to serialize class testing which is again not serializable.
So make your code work, you should define someFunc inside doIT method. For example:
def doIT = {
def someFunc(a:Int) = a+1
//function definition
}
val after = rddList.map(someFunc(_))
after.collect().map(println(_))
}
And if there are multiple functions coming into picture, then all those functions should be available to the parent context.
I solved this problem using a different approach. You simply need to serialize the objects before passing through the closure, and de-serialize afterwards. This approach just works, even if your classes aren't Serializable, because it uses Kryo behind the scenes. All you need is some curry. ;)
Here's an example of how I did it:
def genMapper(kryoWrapper: KryoSerializationWrapper[(Foo => Bar)])
(foo: Foo) : Bar = {
kryoWrapper.value.apply(foo)
}
val mapper = genMapper(KryoSerializationWrapper(new Blah(abc))) _
rdd.flatMap(mapper).collectAsMap()
object Blah(abc: ABC) extends (Foo => Bar) {
def apply(foo: Foo) : Bar = { //This is the real function }
}
Feel free to make Blah as complicated as you want, class, companion object, nested classes, references to multiple 3rd party libs.
KryoSerializationWrapper refers to: https://github.com/amplab/shark/blob/master/src/main/scala/shark/execution/serialization/KryoSerializationWrapper.scala
I'm not entirely certain that this applies to Scala but, in Java, I solved the NotSerializableException by refactoring my code so that the closure did not access a non-serializable final field.
Scala methods defined in a class are non-serializable, methods can be converted into functions to resolve serialization issue.
Method syntax
def func_name (x String) : String = {
...
return x
}
function syntax
val func_name = { (x String) =>
...
x
}
FYI in Spark 2.4 a lot of you will probably encounter this issue. Kryo serialization has gotten better but in many cases you cannot use spark.kryo.unsafe=true or the naive kryo serializer.
For a quick fix try changing the following in your Spark configuration
spark.kryo.unsafe="false"
OR
spark.serializer="org.apache.spark.serializer.JavaSerializer"
I modify custom RDD transformations that I encounter or personally write by using explicit broadcast variables and utilizing the new inbuilt twitter-chill api, converting them from rdd.map(row => to rdd.mapPartitions(partition => { functions.
Example
Old (not-great) Way
val sampleMap = Map("index1" -> 1234, "index2" -> 2345)
val outputRDD = rdd.map(row => {
val value = sampleMap.get(row._1)
value
})
Alternative (better) Way
import com.twitter.chill.MeatLocker
val sampleMap = Map("index1" -> 1234, "index2" -> 2345)
val brdSerSampleMap = spark.sparkContext.broadcast(MeatLocker(sampleMap))
rdd.mapPartitions(partition => {
val deSerSampleMap = brdSerSampleMap.value.get
partition.map(row => {
val value = sampleMap.get(row._1)
value
}).toIterator
})
This new way will only call the broadcast variable once per partition which is better. You will still need to use Java Serialization if you do not register classes.
I had a similar experience.
The error was triggered when I initialize a variable on the driver (master), but then tried to use it on one of the workers.
When that happens, Spark Streaming will try to serialize the object to send it over to the worker, and fail if the object is not serializable.
I solved the error by making the variable static.
Previous non-working code
private final PhoneNumberUtil phoneUtil = PhoneNumberUtil.getInstance();
Working code
private static final PhoneNumberUtil phoneUtil = PhoneNumberUtil.getInstance();
Credits:
https://learn.microsoft.com/en-us/answers/questions/35812/sparkexception-job-aborted-due-to-stage-failure-ta.html ( The answer of pradeepcheekatla-msft)
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/javaionotserializableexception.html
def upper(name: String) : String = {
var uppper : String = name.toUpperCase()
uppper
}
val toUpperName = udf {(EmpName: String) => upper(EmpName)}
val emp_details = """[{"id": "1","name": "James Butt","country": "USA"},
{"id": "2", "name": "Josephine Darakjy","country": "USA"},
{"id": "3", "name": "Art Venere","country": "USA"},
{"id": "4", "name": "Lenna Paprocki","country": "USA"},
{"id": "5", "name": "Donette Foller","country": "USA"},
{"id": "6", "name": "Leota Dilliard","country": "USA"}]"""
val df_emp = spark.read.json(Seq(emp_details).toDS())
val df_name=df_emp.select($"id",$"name")
val df_upperName= df_name.withColumn("name",toUpperName($"name")).filter("id='5'")
display(df_upperName)
this will give error
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
Solution -
import java.io.Serializable;
object obj_upper extends Serializable {
def upper(name: String) : String =
{
var uppper : String = name.toUpperCase()
uppper
}
val toUpperName = udf {(EmpName: String) => upper(EmpName)}
}
val df_upperName=
df_name.withColumn("name",obj_upper.toUpperName($"name")).filter("id='5'")
display(df_upperName)
My solution was to add a compagnion class that handles all methods that are not seriazable within the class.