Strange "Task not serializable" with Spark - scala

In my program, I have a method which returns some RDD, let's call it myMethod which takes a non-serializable parameter and let the RDD be of the type Long (my real RDD is a Tuple type but only contains primitive types).
When I try something like this:
val x: NonSerializableThing = ...
val l: Long = ...
myMethod(x, l).map(res => res + l) // myMethod's RDD does NOT include the NonSerializableThing
I get Task not serializable.
When I replace res + l by res + 1L (i.e., some constant), it runs.
From the serialization trace, it tries to serialize the NonSerializableThing and chokes there, but I double-checked my method and this object never appears in an RDD.
When I try to collect output of myMethod directly, i.e. with
myMethod(x, l).take(1) foreach println
I also get no problems.
The method uses the NonSerializableThing to get a (local) Seq of values on which multiple Cassandra queries are made (this is needed because I need to construct the partition keys to query for), like this:
def myMethod(x: NonSerializableThing, l: Long): RDD[Long] = {
val someParam1: String = x.someProperty
x.getSomeSeq.flatMap(y: OtherNonSerializableThing => {
val someParam2: String = y.someOtherProperty
y.someOtherSeq.map(someParam3: String =>
sc.cassandraTable("fooKeyspace", "fooTable").
select("foo").
where("bar=? and quux=? and baz=? and l=?", someParam1, someParam2, someParam3, l).
map(_.getLong(0))
}.reduce((a, b) => a.union(b))
}
The getSomeSeq and someOtherSeq return plain non-spark Seqs
What I want to achieve is to "union" multiple Cassandra queries.
What could be the problem here?
EDIT, Addendum, as requested by Jem Tucker:
What I have in my class is something like this:
implicit class MySparkExtension(sc: SparkContext) {
def getThing(/* some parameters */): NonSerializableThing = { ... }
def myMethod(x: NonSerializableThing, l: Long): RDD[Long] = {
val someParam1: String = x.someProperty
x.getSomeSeq.flatMap(y: OtherNonSerializableThing => {
val someParam2: String = y.someOtherProperty
y.someOtherSeq.map(someParam3: String =>
sc.cassandraTable("fooKeyspace", "fooTable").
select("foo").
where("bar=? and quux=? and baz=? and l=?", someParam1, someParam2, someParam3, l).
map(_.getLong(0))
}.reduce((a, b) => a.union(b))
}
}
This is declared in a package object. The problem occurrs here:
// SparkContext is already declared as sc
import my.pkg.with.extension._
val thing = sc.getThing(/* parameters */)
val l = 42L
val rdd = sc.myMethod(thing, l)
// until now, everything is OK.
// The following still works:
rdd.take(5) foreach println
// The following causes the exception:
rdd.map(x => x >= l).take(5) foreach println
// While the following works:
rdd.map(x => x >= 42L).take(5) foreach println
I tested this entered "live" into a Spark shell as well as in an algorithm submitted via spark-submit.
What I now want to try (as per my last comment) is the following:
implicit class MySparkExtension(sc: SparkContext) {
def getThing(/* some parameters */): NonSerializableThing = { ... }
def myMethod(x: NonSerializableThing, l: Long): RDD[Long] = {
val param1 = x.someProperty
val partitionKeys =
x.getSomeSeq.flatMap(y => {
val param2 = y.someOtherProperty
y.someOtherSeq.map(param3 => (param1, param2, param3, l)
}
queryTheDatabase(partitionKeys)
}
private def queryTheDatabase(partitionKeys: Seq[(String, String, String, Long)]): RDD[Long] = {
partitionKeys.map(k =>
sc.cassandraTable("fooKeyspace", "fooTable").
select("foo").
where("bar=? and quux=? and baz=? and l=?", k._1, k._2, k._3, k._4).
map(_.getLong(0))
).reduce((a, b) => a.union(b))
}
}
I believe this could work because the RDD is constructed in the method queryTheDatabase now, where no NonSerializableThing exists.
Another option might be: The NonSerializableThing would indeed be serializable, but I pass in the SparkContext in there as an implicit constructor parameter. I think if I make this transient, it would (uselessly) get serialized but not cause any problems.

When you replace l with 1L Spark no longer tries to serialize the class with the method / variables in and so the error is not thrown.
You should be able to fix by marking val x: NonSerializableThing = ... as transient e.g.
#transient
val x: NonSerializableThing = ...
This means when the class is serialized this variable should be ignored.

Related

request timeout from flatMapping over cats.effect.IO

I am attempting to transform some data that is encapsulated in cats.effect.IO with a Map that also is in an IO monad. I'm using http4s with blaze server and when I use the following code the request times out:
def getScoresByUserId(userId: Int): IO[Response[IO]] = {
implicit val formats = DefaultFormats + ShiftJsonSerializer() + RawShiftSerializer()
implicit val shiftJsonReader = new Reader[ShiftJson] {
def read(value: JValue): ShiftJson = value.extract[ShiftJson]
}
implicit val shiftJsonDec = jsonOf[IO, ShiftJson]
// get the shifts
var getDbShifts: IO[List[Shift]] = shiftModel.findByUserId(userId)
// use the userRoleId to get the RoleId then get the tasks for this role
val taskMap : IO[Map[String, Double]] = taskModel.findByUserId(userId).flatMap {
case tskLst: List[Task] => IO(tskLst.map((task: Task) => (task.name -> task.standard)).toMap)
}
val traversed: IO[List[Shift]] = for {
shifts <- getDbShifts
traversed <- shifts.traverse((shift: Shift) => {
val lstShiftJson: IO[List[ShiftJson]] = read[List[ShiftJson]](shift.roleTasks)
.map((sj: ShiftJson) =>
taskMap.flatMap((tm: Map[String, Double]) =>
IO(ShiftJson(sj.name, sj.taskType, sj.label, sj.value.toString.toDouble / tm.get(sj.name).get)))
).sequence
//TODO: this flatMap is bricking my request
lstShiftJson.flatMap((sjLst: List[ShiftJson]) => {
IO(Shift(shift.id, shift.shiftDate, shift.shiftStart, shift.shiftEnd,
shift.lunchDuration, shift.shiftDuration, shift.breakOffProd, shift.systemDownOffProd,
shift.meetingOffProd, shift.trainingOffProd, shift.projectOffProd, shift.miscOffProd,
write[List[ShiftJson]](sjLst), shift.userRoleId, shift.isApproved, shift.score, shift.comments
))
})
})
} yield traversed
traversed.flatMap((sLst: List[Shift]) => Ok(write[List[Shift]](sLst)))
}
as you can see the TODO comment. I've narrowed down this method to the flatmap below the TODO comment. If I remove that flatMap and merely return "IO(shift)" to the traversed variable the request does not timeout; However, that doesn't help me much because I need to make use of the lstShiftJson variable which has my transformed json.
My intuition tells me I'm abusing the IO monad somehow, but I'm not quite sure how.
Thank you for your time in reading this!
So with the guidance of Luis's comment I refactored my code to the following. I don't think it is optimal (i.e. the flatMap at the end seems unecessary, but I couldnt' figure out how to remove it. BUT it's the best I've got.
def getScoresByUserId(userId: Int): IO[Response[IO]] = {
implicit val formats = DefaultFormats + ShiftJsonSerializer() + RawShiftSerializer()
implicit val shiftJsonReader = new Reader[ShiftJson] {
def read(value: JValue): ShiftJson = value.extract[ShiftJson]
}
implicit val shiftJsonDec = jsonOf[IO, ShiftJson]
// FOR EACH SHIFT
// - read the shift.roleTasks into a ShiftJson object
// - divide each task value by the task.standard where task.name = shiftJson.name
// - write the list of shiftJson back to a string
val traversed = for {
taskMap <- taskModel.findByUserId(userId).map((tList: List[Task]) => tList.map((task: Task) => (task.name -> task.standard)).toMap)
shifts <- shiftModel.findByUserId(userId)
traversed <- shifts.traverse((shift: Shift) => {
val lstShiftJson: List[ShiftJson] = read[List[ShiftJson]](shift.roleTasks)
.map((sj: ShiftJson) => ShiftJson(sj.name, sj.taskType, sj.label, sj.value.toString.toDouble / taskMap.get(sj.name).get ))
shift.roleTasks = write[List[ShiftJson]](lstShiftJson)
IO(shift)
})
} yield traversed
traversed.flatMap((t: List[Shift]) => Ok(write[List[Shift]](t)))
}
Luis mentioned that mapping my List[Shift] to a Map[String, Double] is a pure operation so we want to use a map instead of flatMap.
He mentioned that I'm wrapping every operation that comes from the database in IO which is causing a great deal of recomputation. (including DB transactions)
To solve this issue I moved all of the database operations inside of my for loop, using the "<-" operator to flatMap each of the return values allows the variables being used to preside within the IO monads, hence preventing the recomputation experienced before.
I do think there must be a better way of returning my return value. flatMapping the "traversed" variable to get back inside of the IO monad seems to be unnecessary recomputation, so please anyone correct me.

Convert Seq[Try[Option(String, Any)]] into Try[Option[Map[String, Any]]]

How to conveniently convert Seq[Try[Option[String, Any]]] into Try[Option[Map[String, Any]]].
If any Try before convert throws an exception, the converted Try should throw as well.
Assuming that the input type has a tuple inside the Option then this should give you the result you want:
val in: Seq[Try[Option[(String, Any)]]] = ???
val out: Try[Option[Map[String,Any]]] = Try(Some(in.flatMap(_.get).toMap))
If any of the Trys is Failure then the outer Try will catch the exception raised by the get and return Failure
The Some is there to give the correct return type
The get extracts the Option from the Try (or raises an exception)
Using flatMap rather than map removes the Option wrapper, keeping all Some values and discaring None values, giving Seq[(String, Any)]
The toMap call converts the Seq to a Map
Here is something that's not very clean but may help get you started. It assumes Option[(String,Any)], returns the first Failure if there are any in the input Seq and just drops None elements.
foo.scala
package foo
import scala.util.{Try,Success,Failure}
object foo {
val x0 = Seq[Try[Option[(String, Any)]]]()
val x1 = Seq[Try[Option[(String, Any)]]](Success(Some(("A",1))), Success(None))
val x2 = Seq[Try[Option[(String, Any)]]](Success(Some(("A",1))), Success(Some(("B","two"))))
val x3 = Seq[Try[Option[(String, Any)]]](Success(Some(("A",1))), Success(Some(("B","two"))), Failure(new Exception("bad")))
def f(x: Seq[Try[Option[(String, Any)]]]) =
x.find( _.isFailure ).getOrElse( Success(Some(x.map( _.get ).filterNot( _.isEmpty ).map( _.get ).toMap)) )
}
Example session
bash-3.2$ scalac foo.scala
bash-3.2$ scala -classpath .
Welcome to Scala 2.13.1 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66).
Type in expressions for evaluation. Or try :help.
scala> import foo.foo._
import foo.foo._
scala> f(x0)
res0: scala.util.Try[Option[Equals]] = Success(Some(Map()))
scala> f(x1)
res1: scala.util.Try[Option[Equals]] = Success(Some(Map(A -> 1)))
scala> f(x2)
res2: scala.util.Try[Option[Equals]] = Success(Some(Map(A -> 1, B -> two)))
scala> f(x3)
res3: scala.util.Try[Option[Equals]] = Failure(java.lang.Exception: bad)
scala> :quit
If you're willing to use a functional support library like Cats then there are two tricks that can help this along:
Many things like List and Try are traversable, which means that (if Cats's implicits are in scope) they have a sequence method that can swap two types, for example converting List[Try[T]] to Try[List[T]] (failing if any of the items in the list are failure).
Almost all of the container types support a map method that can operate on the contents of a container, so if you have a function from A to B then map can convert a Try[A] to a Try[B]. (In Cats language they are functors but the container-like types in the standard library generally have map already.)
Cats doesn't directly support Seq, so this answer is mostly in terms of List instead.
Given that type signature, you can iteratively sequence the item you have to in effect push the list type down one level in the type chain, then map over that container to work on its contents. That can look like:
import cats.implicits._
import scala.util._
def convert(listTryOptionPair: List[Try[Option[(String, Any)]]]): Try[
Option[Map[String, Any]]
] = {
val tryListOptionPair = listTryOptionPair.sequence
tryListOptionPair.map { listOptionPair =>
val optionListPair = listOptionPair.sequence
optionListPair.map { listPair =>
Map.from(listPair)
}
}
}
https://scastie.scala-lang.org/xbQ8ZbkoRSCXGDJX0PgJAQ has a slightly more complete example.
One way to approach this is by using a foldLeft:
// Let's say this is the object you're trying to convert
val seq: Seq[Try[Option[(String, Any)]]] = ???
seq.foldLeft(Try(Option(Map.empty[String, Any]))) {
case (acc, e) =>
for {
accOption <- acc
elemOption <- e
} yield elemOption match {
case Some(value) => accOption.map(_ + value)
case None => accOption
}
}
You start off with en empty Map. You then use a for comprehension to go through the current map and element and finally you add a new tuple in the map if present.
The following solutions is based on this answer to the point that almost makes the question a duplicate.
Method 1: Using recursion
def trySeqToMap1[X,Y](trySeq : Seq[Try[Option[(X, Y)]]]) : Try[Option[Map[X,Y]]] = {
def helper(it : Iterator[Try[Option[(X,Y)]]], m : Map[X,Y] = Map()) : Try[Option[Map[X,Y]]] = {
if(it.hasNext) {
val x = it.next()
if(x.isFailure)
Failure(x.failed.get)
else if(x.get.isDefined)
helper(it, m + (x.get.get._1-> x.get.get._2))
else
helper(it, m)
} else Success(Some(m))
}
helper(trySeq.iterator)
}
Method 2: directly pattern matching in case you are able to get a stream or a List instead:
def trySeqToMap2[X,Y](trySeq : LazyList[Try[Option[(X, Y)]]], m : Map[X,Y]= Map.empty[X,Y]) : Try[Option[Map[X,Y]]] =
trySeq match {
case Success(Some(h)) #:: tail => trySeqToMap2(tail, m + (h._1 -> h._2))
case Success(None) #:: tail => tail => trySeqToMap2(tail, m)
case Failure(f) #:: _ => Failure(f)
case _ => Success(Some(m))
}
note: this answer was previously using different method signatures. It has been updated to conform to the signature given in the question.

Scala Future Sequence Mapping: finding length?

I want to return both a Future[Seq[String]] from a method and the length of that Seq[String] as well. Currently I'm building the Future[Seq[String]] using a mapping function from another Future[T].
Is there any way to do this without awaiting for the Future?
You can map over the current Future to create a new one with the new data added to the type.
val fss: Future[Seq[String]] = Future(Seq("a","b","c"))
val x: Future[(Seq[String],Int)] = fss.map(ss => (ss, ss.length))
If you somehow know what the length of the Seq will be without actually waiting for it, then something like this;
val t: Future[T] = ???
def foo: (Int, Future[Seq[String]]) = {
val length = 42 // ???
val fut: Future[Seq[String]] = t map { v =>
genSeqOfLength42(v)
}
(length, fut)
}
If you don't, then you will have to return Future[(Int, Seq[String])] as jwvh said, or you can easily get the length later in the calling function.

Cats Writer Vector is empty

I wrote this simple program in my attempt to learn how Cats Writer works
import cats.data.Writer
import cats.syntax.applicative._
import cats.syntax.writer._
import cats.instances.vector._
object WriterTest extends App {
type Logged2[A] = Writer[Vector[String], A]
Vector("started the program").tell
val output1 = calculate1(10)
val foo = new Foo()
val output2 = foo.calculate2(20)
val (log, sum) = (output1 + output2).pure[Logged2].run
println(log)
println(sum)
def calculate1(x : Int) : Int = {
Vector("came inside calculate1").tell
val output = 10 + x
Vector(s"Calculated value ${output}").tell
output
}
}
class Foo {
def calculate2(x: Int) : Int = {
Vector("came inside calculate 2").tell
val output = 10 + x
Vector(s"calculated ${output}").tell
output
}
}
The program works and the output is
> run-main WriterTest
[info] Compiling 1 Scala source to /Users/Cats/target/scala-2.11/classes...
[info] Running WriterTest
Vector()
50
[success] Total time: 1 s, completed Jan 21, 2017 8:14:19 AM
But why is the vector empty? Shouldn't it contain all the strings on which I used the "tell" method?
When you call tell on your Vectors, each time you create a Writer[Vector[String], Unit]. However, you never actually do anything with your Writers, you just discard them. Further, you call pure to create your final Writer, which simply creates a Writer with an empty Vector. You have to combine the writers together in a chain that carries your value and message around.
type Logged[A] = Writer[Vector[String], A]
val (log, sum) = (for {
_ <- Vector("started the program").tell
output1 <- calculate1(10)
foo = new Foo()
output2 <- foo.calculate2(20)
} yield output1 + output2).run
def calculate1(x: Int): Logged[Int] = for {
_ <- Vector("came inside calculate1").tell
output = 10 + x
_ <- Vector(s"Calculated value ${output}").tell
} yield output
class Foo {
def calculate2(x: Int): Logged[Int] = for {
_ <- Vector("came inside calculate2").tell
output = 10 + x
_ <- Vector(s"calculated ${output}").tell
} yield output
}
Note the use of for notation. The definition of calculate1 is really
def calculate1(x: Int): Logged[Int] = Vector("came inside calculate1").tell.flatMap { _ =>
val output = 10 + x
Vector(s"calculated ${output}").tell.map { _ => output }
}
flatMap is the monadic bind operation, which means it understands how to take two monadic values (in this case Writer) and join them together to get a new one. In this case, it makes a Writer containing the concatenation of the logs and the value of the one on the right.
Note how there are no side effects. There is no global state by which Writer can remember all your calls to tell. You instead make many Writers and join them together with flatMap to get one big one at the end.
The problem with your example code is that you're not using the result of the tell method.
If you take a look at its signature, you'll see this:
final class WriterIdSyntax[A](val a: A) extends AnyVal {
def tell: Writer[A, Unit] = Writer(a, ())
}
it is clear that tell returns a Writer[A, Unit] result which is immediately discarded because you didn't assign it to a value.
The proper way to use a Writer (and any monad in Scala) is through its flatMap method. It would look similar to this:
println(
Vector("started the program").tell.flatMap { _ =>
15.pure[Logged2].flatMap { i =>
Writer(Vector("ended program"), i)
}
}
)
The code above, when executed will give you this:
WriterT((Vector(started the program, ended program),15))
As you can see, both messages and the int are stored in the result.
Now this is a bit ugly, and Scala actually provides a better way to do this: for-comprehensions. For-comprehension are a bit of syntactic sugar that allows us to write the same code in this way:
println(
for {
_ <- Vector("started the program").tell
i <- 15.pure[Logged2]
_ <- Vector("ended program").tell
} yield i
)
Now going back to your example, what I would recommend is for you to change the return type of compute1 and compute2 to be Writer[Vector[String], Int] and then try to make your application compile using what I wrote above.

Allocation of Function Literals in Scala

I have a class that represents sales orders:
class SalesOrder(val f01:String, val f02:Int, ..., f50:Date)
The fXX fields are of various types. I am faced with the problem of creating an audit trail of my orders. Given two instances of the class, I have to determine which fields have changed. I have come up with the following:
class SalesOrder(val f01:String, val f02:Int, ..., val f50:Date){
def auditDifferences(that:SalesOrder): List[String] = {
def diff[A](fieldName:String, getField: SalesOrder => A) =
if(getField(this) != getField(that)) Some(fieldName) else None
val diffList = diff("f01", _.f01) :: diff("f02", _.f02) :: ...
:: diff("f50", _.f50) :: Nil
diffList.flatten
}
}
I was wondering what the compiler does with all the _.fXX functions: are they instanced just once (statically), and can be shared by all instances of my class, or will they be instanced every time I create an instance of my class?
My worry is that, since I will use a lot of SalesOrder instances, it may create a lot of garbage. Should I use a different approach?
One clean way of solving this problem would be to use the standard library's Ordering type class. For example:
class SalesOrder(val f01: String, val f02: Int, val f03: Char) {
def diff(that: SalesOrder) = SalesOrder.fieldOrderings.collect {
case (name, ord) if !ord.equiv(this, that) => name
}
}
object SalesOrder {
val fieldOrderings: List[(String, Ordering[SalesOrder])] = List(
"f01" -> Ordering.by(_.f01),
"f02" -> Ordering.by(_.f02),
"f03" -> Ordering.by(_.f03)
)
}
And then:
scala> val orderA = new SalesOrder("a", 1, 'a')
orderA: SalesOrder = SalesOrder#5827384f
scala> val orderB = new SalesOrder("b", 1, 'b')
orderB: SalesOrder = SalesOrder#3bf2e1c7
scala> orderA diff orderB
res0: List[String] = List(f01, f03)
You almost certainly don't need to worry about the perfomance of your original formulation, but this version is (arguably) nicer for unrelated reasons.
Yes, that creates 50 short lived functions. I don't think you should be worried unless you have manifest evidence that that causes a performance problem in your case.
But I would define a method that transforms SalesOrder into a Map[String, Any], then you would just have
trait SalesOrder {
def fields: Map[String, Any]
}
def diff(a: SalesOrder, b: SalesOrder): Iterable[String] = {
val af = a.fields
val bf = b.fields
af.collect { case (key, value) if bf(key) != value => key }
}
If the field names are indeed just incremental numbers, you could simplify
trait SalesOrder {
def fields: Iterable[Any]
}
def diff(a: SalesOrder, b: SalesOrder): Iterable[String] =
(a.fields zip b.fields).zipWithIndex.collect {
case ((av, bv), idx) if av != bv => f"f${idx + 1}%02d"
}