I am trying to build a dataset using a case class for Scala (I would like to use case classes over tuples because I want to join fields by name).
Here is one iteration of a join I am working on:
case class TestTarget(tacticId: String, partnerId:Long)
campaignPartners.join(partnerInput).where(1).equalTo("id") {
(target, partnerInfo, out: Collector[TestTarget]) => {
partnerInfo.partner_pricing match {
case Some(pricing) =>
out.collect(TestTarget(target._1, partnerInfo.partner_id))
case None => ()
}
}
}
Obviously this throws the error:
org.apache.flink.api.common.InvalidProgramException: Task not
serializable at
org.apache.flink.api.scala.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:179)
at
org.apache.flink.api.scala.ClosureCleaner$.clean(ClosureCleaner.scala:171)
at org.apache.flink.api.scala.DataSet.clean(DataSet.scala:121) at
org.apache.flink.api.scala.JoinDataSet$$anon$2.(joinDataSet.scala:108)
at
org.apache.flink.api.scala.JoinDataSet.apply(joinDataSet.scala:107)
at
com.adfin.dataimport.vendors.dbm.Job.calculateVendorFees(Job.scala:84)
I have seen the docs here that state that I need to implement serializable for the class. As far as I can tell in new versions of Scala there is no good way to automatically serialize case classes. (I looked into manual serialization but I think I would need to do some extra work with link for this to work).
Edit:
As per Till Rohrmann's suggestion I tried to reproduce this error using a small case. This is what I used to try and reproduce the error. This example worked and I failed to reproduce the error. I also tried putting Option cases everywhere but that cause the job to fail either.
val text = env.fromElements("To be, or not to be,--that is the question:--")
val words = text.flatMap { _.toLowerCase.split("\\W+") }.map(x => (1,x))
val nums = env.fromElements(List(1,2,3,4,5)).flatMap(x => x).map(x => First(1,x))
val counts = words.join(nums).where(0).equalTo("a") {
(a, b, out: Collector[TestTarget]) => {
b.b match {
case 2 => ()
case _ => out.collect(TestTarget(a._2, b.b))
}
}
}
The definition of my program used a class
class Job(conf: AdfinConfig)(implicit env: ExecutionEnvironment)
extends DspJob(conf){
...
case class TestTarget(tacticId: String, partnerId:Long)
campaignPartners.join(partnerInput).where(1).equalTo("id") {
...
}
Since it was an inner class it wasn't being serialized automatically
If you switch the class to not be an inner class then everything works out
case class TestTarget(tacticId: String, partnerId:Long)
class Job(conf: AdfinConfig)(implicit env: ExecutionEnvironment)
extends DspJob(conf){
...
words.join( ....)
...
}
Related
We are using spark to parse a big csv file, which may contain invalid data.
We want to save valid data into the data store, and also return how many valid data we imported and how many invalid data.
I am wondering how we can do this in spark, what's the standard approach when reading data?
My current approach uses Accumulator, but it's not accurate due to how Accumulator works in spark.
// we define case class CSVInputData: all fields are defined as string
val csvInput = spark.read.option("header", "true").csv(csvFile).as[CSVInputData]
val newDS = csvInput
.flatMap { row =>
Try {
val data = new DomainData()
data.setScore(row.score.trim.toDouble)
data.setId(UUID.randomUUID().toString())
data.setDate(Util.parseDate(row.eventTime.trim))
data.setUpdateDate(new Date())
data
} match {
case Success(map) => Seq(map)
case _ => {
errorAcc.add(1)
Seq()
}
}
}
I tried to use Either, but it failed with the exception:
java.lang.NoClassDefFoundError: no Java class corresponding to Product with Serializable with scala.util.Either[xx.CSVInputData,xx.DomainData] found
Update
I think Either doesn't work with spark 2.0 dataset api:
spark.read.option("header", "true").csv("any.csv").map { row =>
try {
Right("")
} catch { case e: Throwable => Left(""); }
}
If we change to use sc(rdd api), it works:
sc.parallelize('a' to 'z').map { row =>
try {
Right("")
} catch { case e: Throwable => Left(""); }
}.collect()
In current latest scala http://www.scala-lang.org/api/2.11.x/index.html#scala.util.Either: Either doesn't implements Serializable trait
sealed abstract class Either[+A, +B] extends AnyRef
In future 2.12 http://www.scala-lang.org/api/2.12.x/scala/util/Either.html, it does:
sealed abstract class Either[+A, +B] extends Product with Serializable
Updated 2 with workaround
More info at Spark ETL: Using Either to handle invalid data
As spark dataset doesn't work with Either, so the workaround is to call ds.rdd, then use try-left-right to capture both valid and invalid data.
spark.read.option("header", "true").csv("/Users/yyuan/jyuan/1.csv").rdd.map ( { row =>
try {
Right("")
} catch { case e: Throwable => Left(""); }
}).collect()
Have you considered using an Either
val counts = csvInput
.map { row =>
try {
val data = new DomainData()
data.setScore(row.score.trim.toDouble)
data.setId(UUID.randomUUID().toString())
data.setDate(Util.parseDate(row.eventTime.trim))
data.setUpdateDate(new Date())
Right(data)
} catch {
case e: Throwable => Left(row)
}
}
val failedCount = counts.map(_.left).filter(_.e.isLeft).count()
val successCount = counts.map(_.right).filter(_.e.isRight).count()
Did you try Spark DDQ - this has most of the data quality rules that you will need. You can even extend and customize it.
Link: https://github.com/FRosner/drunken-data-quality
I have a simple flash implementation for use with Jersey that looks like this:
#PostConstruct def before { flash.rotateIn }
#PreDestroy def after { flash.rotateOut }
object flash {
val KeyNow = "local.flash.now"
val KeyNext = "local.flash.next"
// case class Wrapper(wrapped: Map[String, Seq[String]])
case class Wrapper(wrapped: String)
def rotateIn {
for {
session <- Option(request.getSession(false))
obj <- Option(session.getAttribute(KeyNext))
} {
request.setAttribute(KeyNow, obj)
session.removeAttribute(KeyNext)
}
}
def rotateOut {
for (obj <- Option(request.getAttribute(KeyNext))) {
request.getSession.setAttribute(KeyNext, obj)
}
}
def now = Option(request.getAttribute(KeyNow)) match {
case Some(x: Wrapper) => x.wrapped
case Some(x) if x.isInstanceOf[Wrapper] => "WHAT"
case _ => "NOPE"
}
def next(value: String) {
request.setAttribute(KeyNext, Wrapper(value))
}
}
I have simplified it here somewhat, but it lets me set a value for flash with flash.next and read the current flash value with flash.now.
The weird thing is that my now value is always "WHAT". If I do something similar in my REPL, I don't have the same issues:
val req = new org.springframework.mock.web.MockHttpServletRequest
val res = req.getSession
res.setAttribute("foo", Wrapper("foo"))
req.setAttribute("foo", res.getAttribute("foo"))
// Is not None
Option(req.getAttribute("foo")).collect { case x: Wrapper => x }
Am I missing something obvious?
EDIT
I've added a minimal example webapp replicating this issue at https://github.com/kardeiz/sc-issue-20160229.
I tried your example. Check my answer for your other question for details how pattern matching works in this case.
In short, as you Wrapper is an inner class, patter matching also checks the "outer class" reference. It seems that depending on the application server implementation Router.flash can be different instance for each request, so pattern matching fails.
Simple fix for that is to make Wrapper top-level class, so it doesn't have reference to any other class.
Is it prohibited by design or am I doing something wrong?
val ext = 1
class Test extends Actor {
def receive = { case _ => println(ext) }
}
try {
val sys = ActorSystem("lol")
sys.actorOf(Props[Test], "test") ! true
} catch {
case e: Throwable => println(e)
}
once I send a message to Test I get an exception java.lang.IllegalArgumentException: no matching constructor found on class HelloAkkaScala$Test$1 for arguments [].
I don't have this exception if I don't reference ext inside of Test class.
I'm using Akka 2.3.4
There is nothing wrong in accessing val (not var or any other mutable state) defined outside of the actor.
Just tried to run your example code in Akka 2.3.5 and it works fine. You probably have a typo somwhere in your original code.
UPDATE
Looking closer at the error you are getting it seems like you've defined Test class inside some other class.
In this case inner class (Test) will receive reference to outer class behind the scenes in order to be able to close on it's members (ext).
This also means that constructor of the inner class will take reference to the outer class (syntactic sugar hides this).
This also explains why you are getting the error (you pass 0 args to constructor but there actually some hidden ones that you are not supplying).
Here is the example that reproduces this:
class Boot {
val ext = 1
class Test extends Actor {
def receive = { case _ => println(ext) }
}
try {
val sys = ActorSystem("lol")
sys.actorOf(Props[Test], "test") ! true
} catch {
case e: Throwable => println(e)
}
}
object Boot extends App {
new Boot
}
... and here is a quick workaround (in this case I make Test and ext "static" by moving them to companion object, but you can achieve similar results by referencing ext as a member of some other instance, passed to constructor):
class Boot {
import Boot.Test
try {
val sys = ActorSystem("lol")
sys.actorOf(Props[Test], "test") ! true
} catch {
case e: Throwable => println(e)
}
}
object Boot extends App {
val ext = 1
class Test extends Actor {
def receive = { case _ => println(ext) }
}
new Boot
}
In my scala code, I have some nested Try() match {}, which look ugly:
import scala.util._
Try(convertJsonToObject[User]) match {
case Success(userJsonObj) =>
Try(saveToDb(userJsonObj.id)) match {
case Success(user) => Created("User saved")
case _ => InternalServerError("database error")
}
case _ => BadRequest("bad input")
}
Is there any better way of writing such code?
There's a bunch of ways to solve this problem. I'll give you one possibility. Consider this cleaned up version of your code:
trait Result
case class BadRequest(message:String) extends Result
case class InternalServerError(message:String) extends Result
case class Created(message:String) extends Result
def processRequest(json:String):Result = {
val result =
for{
user <- Try(parseJson(json))
savedUser <- Try(saveToDb(user))
} yield Created("saved")
result.recover{
case jp:JsonParsingException => BadRequest(jp.getMessage)
case other => InternalServerError(other.getMessage)
}.get
}
def parseJson(json:String):User = ...
def saveToDb(user:User):User = ...
The caveat to this code is that it assumes that you can differentiate the json parsing failure from the db failure by the exception each might yield. Not a bad assumption to make though. This code is very similar to a java try/catch block that catches different exception types and returns different results based on catching those different types.
One other nice thing about this approach is that you could just define a standard recovery Partial Function for all kinds of possible exceptions and use it throughout your controllers (which I'm assuming this code is) to eliminate duplicate code. Something like this:
object ExceptionHandling{
val StandardRecovery:PartialFunction[Throwable,Result] = {
case jp:JsonParsingException => BadRequest(jp.getMessage)
case sql:SQLException => InternalServerError(sql.getMessage)
case other => InternalServerError(other.getMessage)
}
}
And then in your controller:
import ExceptionHandling._
result.recover(StandardRecovery).get
Another approach is to define implicit reads for User (if using Play Framework) and then doing something like
someData.validate[User].map { user =>
saveToDb(user.id) match { // you can return Try from saveToDb
case Success(savedUser) => Created("User saved")
case Failure(exception) => InternalServerError("Database Error")
}
}.recoverTotal {
e => BadRequest(JsError.toFlatJson(e))
}
Try(convertJsonToObject[User]).map([your code]).toOption.getOrElse(fallback)
I am in the process of converting Akka UntypedActors in Java code to their Scala equivalent.
However, I am having trouble understanding how to correctly implement the receive() abstract method. The ScalaDoc is a little confusing and most of the examples I see just involve String messages!
My Actor can support multiple message types and this is my solution so far:
override def receive = {
case message if message.isInstanceOf[ClassOne] => {
// do something after message.asInstanceOf[ClassOne]
}
case message if message.isInstanceOf[ClassTwo] => {
// do something after message.asInstanceOf[ClassTwo]
}
case message => unhandled(message)
}
Is there a better way to achieve the above?
override def receive = {
case c: ClassOne =>
// do something after message.asInstanceOf[ClassOne]
case c: ClassTwo =>
// do something after message.asInstanceOf[ClassTwo]
case message => unhandled(message)
}
If you're using case classes, you can get more sophisticated.
case class ClassOne(x: Int, y: String)
case class ClassTwo(a: Int, b: Option[ClassOne])
override def receive = {
case ClassOne(x, y) =>
println(s"Received $x and $y")
case ClassTwo(a, Some(ClassOne(x, y)) if a == 42 =>
// do something
case ClassTwo(a, None) =>
case c # ClassOne(_, "foo") => // only match if y == "foo", now c is your instance of ClassOne
}
All sorts of fun stuff.
receive's type is really just a PartialFunction[Any,Unit], which means you can use Scala's pattern match expressions - in fact, you're already doing it, just not entirely succinctly. A terser equivalent that would also let you handle the type of the match for each case:
def receive = {
case classOneMessage : ClassOne => {
// do something
}
case classTwoMessage : ClassTwo => {
// do something
}
case _ => someCustomLogicHereOtherWiseThereWillBeASilentFailure
//you can, but you don't really need to define this case - in Akka
//the standard way to go if you want to process unknown messages
//within *this* actor, is to override the Actor#unhandled(Any)
//method instead
}
Read the tour article, and the already-linked tutorial for more info on pattern matching, especially in the context of using the feature together with case classes - this pattern is applied regularly when working with Akka, for example here in the Akka manual when handling the ActorIdentity case class.
receive is a regular partial function in Scala. You can write something like this in your case:
case class Example(number: Int, text: String)
override def receive = {
case message: ClassOne =>
// do something with ClassOne instance
case message: ClassTwo =>
// do something with ClassTwo instance
case Example(n, t) =>
println(t * n)
case Example(n, t) if n > 10 =>
println("special case")
}
You don't have to include a special case for unhandled messages unless your application logic requires you to handle all possible messages.
First two cases just match by type of a message and subtypes will be matched as well. Last one not only matches the type Example but also "deconstructs" it using pattern matching.