Flink Dataset API not compatible with scala case classes - scala

I am learning Flink with the scala API and I do not quite understand why do I get at task serialization error.
I am using the DatasetAPI and trying to simply convert the following into scala case classes.
case class User(name: String, country: String, city: String)
val userDataSet: DataSet[String] = env.fromCollection(List(
"Peter,Germany,Berlin",
"James,UK,London",
"Tom,America,NewYork"
))
If I try to map it to case classes, I get the following error:
val sultSet = userDataSet.map {
text =>
val fieldArr: Array[String] = text.split(",")
User(fieldArr(0), fieldArr(1), fieldArr(2))
}
sultSet.print()
Exception in thread "main" org.apache.flink.api.common.InvalidProgramException: Task not serializable
Why is this the case? Case classes in scala extend serializable and should not be an issue. Is this related to this ticket?
https://issues.apache.org/jira/browse/FLINK-16969
val sultSet1 = userDataSet.map {
text =>
lazy val fieldArr: Array[String] = text.split(",")
Tuple3(fieldArr(0), fieldArr(1), fieldArr(2))
}
This works just fine, however, it is not clear why one cannot use case classes.

Related

Serializing a case class with a lazy val causes a StackOverflow

Say I define the following case class:
case class C(i: Int) {
lazy val incremented = copy(i = i + 1)
}
And then try to serialize it to json:
val mapper = new ObjectMapper()
mapper.registerModule(DefaultScalaModule)
val out = new StringWriter
mapper.writeValue(out, C(4))
val json = out.toString()
println("Json is: " + json)
It will throw the following exception:
Exception in thread "main" com.fasterxml.jackson.databind.JsonMappingException: Infinite recursion (StackOverflowError) (through reference chain: C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]->C["incremented"]-
...
I don't know why is it trying to serialize the lazy val by default in the first place? This does not seem to me as the logical approach
And can I disable this feature?
This happens because Jackson is designed for Java. Specifically, note that:
Java has no idea of a lazy val
Java's normal semantics around fields and constructors don't allow the partitioning of fields into "needed for construction" and "derived for construction" (neither of those is a technical term) that Scala's combination of val in default constructor (implicitly present in a case class) and val in a class's body provide
The consequence of the second is that (except for beans, sometimes), Java-oriented serialization approaches tend to assume that anything which is a field (including private fields, since Java idiom is to make fields private by default) in the object needs to be serialized, with the ability to opt out through #transient annotations.
The first, in turn, means that lazy vals are implemented by the compiler in a way that includes a private field.
Thus to a Java-oriented serializer like Jackson, a lazy val without a #transient annotation gets serialized.
Scala-oriented serialization approaches (e.g. circe, play-json, etc.) tend to serialize case classes by only serializing the constructor parameters.
The solution I found was to use json4s for my serialization rather than jackson databind. My issue arose using akka cluster so I had to add a custom serlializer to my project. For reference here is my complete implementation:
class Json4sSerializer(system: ExtendedActorSystem) extends Serializer {
private val actorRefResolver = ActorRefResolver(system.toTyped)
object ActorRefSerializer extends CustomSerializer[ActorRef[_]](format => (
{
case JString(str) =>
actorRefResolver.resolveActorRef[AnyRef](str)
},
{
case actorRef: ActorRef[_] =>
JString(actorRefResolver.toSerializationFormat(actorRef))
}
))
implicit private val formats = DefaultFormats + ActorRefSerializer
def includeManifest: Boolean = true
def identifier = 1234567
def toBinary(obj: AnyRef): Array[Byte] = {
write(obj).getBytes(StandardCharsets.UTF_8)
}
def fromBinary(bytes: Array[Byte], clazz: Option[Class[_]]): AnyRef = clazz match {
case Some(cls) =>
read[AnyRef](new String(bytes, StandardCharsets.UTF_8))(formats, ManifestFactory.classType(cls))
case None =>
throw new RuntimeException("Specified includeManifest but it was never passed")
}
}
You can't serialize that class because the value is infinitely recursive (hence the stack overflow). Specifically, the value of incremented for C(4) is an instance of C(5). The value of incremented for C(5) is C(6). The value of incremented for C(6) is C(7) and so on...
Since an instance of C(n) contains an instance of C(n+1) it can never be fully serlialized.
If you don't want a field to appear in the JSON, make it a function:
case class C(i: Int) {
def incremented = copy(i = i + 1)
}
The root of this problem is trying to serialise a class that also implements application logic, which breaches the principle of Separation of Concerns (The S in SOLID).
It is better to have distinct classes for serialisation and populate them from the application data as necessary. This allows different forms of serialisation to be used without having to change the application logic.

How to use Avro serialization for scala case classes with Flink 1.7?

We've got a Flink job written in Scala using case classes (generated from avsc files by avrohugger) to represent our state. We would like to use Avro for serialising our state so state migration will work when we update our models. We understood since Flink 1.7 Avro serialization is supported OOTB. We added the flink-avro module to the classpath, but when restoring from a saved snapshot we notice that it's still trying to use Kryo serialization. Relevant Code snippet
case class Foo(id: String, timestamp: java.time.Instant)
val env = StreamExecutionEnvironment.getExecutionEnvironment
val conf = env.getConfig
conf.disableForceKryo()
conf.enableForceAvro()
val rawDataStream: DataStream[String] = env.addSource(MyFlinkKafkaConsumer)
val parsedDataSteam: DataStream[Foo] = rawDataStream.flatMap(new JsonParser[Foo])
// do something useful with it
env.execute("my-job")
When performing a state migration on Foo (e.g. by adding a field and deploying the job) I see that it tries to deserialize using Kryo, which obviously fails. How can I make sure Avro serialization is being used?
UPDATE
Found out about https://issues.apache.org/jira/browse/FLINK-10897, so POJO state serialization with Avro is only supported from 1.8 afaik. I tried it using the latest RC of 1.8 with a simple WordCount POJO that extends from SpecificRecord:
/** MACHINE-GENERATED FROM AVRO SCHEMA. DO NOT EDIT DIRECTLY */
import scala.annotation.switch
case class WordWithCount(var word: String, var count: Long) extends
org.apache.avro.specific.SpecificRecordBase {
def this() = this("", 0L)
def get(field$: Int): AnyRef = {
(field$: #switch) match {
case 0 => {
word
}.asInstanceOf[AnyRef]
case 1 => {
count
}.asInstanceOf[AnyRef]
case _ => new org.apache.avro.AvroRuntimeException("Bad index")
}
}
def put(field$: Int, value: Any): Unit = {
(field$: #switch) match {
case 0 => this.word = {
value.toString
}.asInstanceOf[String]
case 1 => this.count = {
value
}.asInstanceOf[Long]
case _ => new org.apache.avro.AvroRuntimeException("Bad index")
}
()
}
def getSchema: org.apache.avro.Schema = WordWithCount.SCHEMA$
}
object WordWithCount {
val SCHEMA$ = new org.apache.avro.Schema.Parser().parse(" .
{\"type\":\"record\",\"name\":\"WordWithCount\",\"fields\":
[{\"name\":\"word\",\"type\":\"string\"},
{\"name\":\"count\",\"type\":\"long\"}]}")
}
This, however, also didn’t work out of the box. We then tried to define our own type information using flink-avro’s AvroTypeInfo but this fails because Avro looks for a SCHEMA$ property (SpecificData:285) in the class and is unable to use Java reflection to identify the SCHEMA$ in the Scala companion object.
I could never get reflection to work due to Scala's fields being private under the hood. AFAIK the only solution is to update Flink to use avro's non-reflection-based constructors in AvroInputFormat (compare).
In a pinch, other than Java, one could fall back to avro's GenericRecord, maybe use avro4s to generate them from avrohugger's Standard format (note that Avro4s will generate it's own schema from the generated Scala types)

Working only when case class defined outside main method to create Dataset[case class] or Dataframe[case class]

This is working.
object FilesToDFDS {
case class Student(id: Int, name: String, dept:String)
def main(args: Array[String]): Unit = {
val ss = SparkSession.builder().appName("local").master("local[*]").getOrCreate()
import ss.implicits._
val path = "data.txt"
val rdd = ss.sparkContext.textFile(path).map(x => x.split(" ")).map(x => Student(x(0).toInt,x(1),x(2)))
val df = ss.read.format("csv").option("delimiter", " ").load(path).map(x => Student(x.getString(0).toInt ,x.getString(1),x.getString(2)))
val ds = ss.read.textFile(path).map(x => x.split(" ")).map(x => Student(x(0).toInt,x(1),x(2)))
val rddToDF = ss.sqlContext.createDataFrame(rdd)
}
}
But, if case class moved inside main, df, ds giving compilation error.
Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
And rddToDF giving this compilation error No TypeTag available for Student
In this questions ques1 , ques2 people answered to move case class outside main. And this idea worked. But, why it is working only when case class moved outside main method?
I believe if a case class is defined within another class, then it needs an instance of that class to work properly. In this case, if you put your Student class within the main, then you would need something like FilesToDFDS.Student to make it work.

Joda Marshalling/Unmarshalling in scala

I'm trying to use akka.http.scaladsl.testkit.responseAs in order to test some endpoints, but I can't figure out how to handle the marshalling/unmarshalling process of a org.joda.time.DateTime object. For instance, consider the case class below:
case class ConfigEntity(id: Option[Int] = None, description: String, key: String, value: String, expirationDate: Option[DateTime] = None)
Also, consider the following route test:
"retrieve config by id" in new Context {
val testConfig = testConfigs(4)
Get(s"/configs/${testConfig.id.get}") ~> route ~> check {
responseAs[ConfigEntity] should be(testConfig)
}
}
When I run "sbt test", the code does not compile, throwing the following error: "could not find implicit value for evidence parameter of type akka.http.scaladsl.unmarshalling.FromResponseUnmarshaller[me.archdev.restapi.models.ConfigEntity]"
I know the message is very self explanatory, but I still don't know how to create the implicit FromResponseUnmarshaller that the code is complaining about.
My code is based in this example: https://github.com/ArchDev/akka-http-rest
I'm just creating some new entities and trying to play around...
Thanks in advance.
This project uses CirceSupport. That means that you need to provide a Circe Decoder for the compiler to derive the Akka Http Unmarshaller.
Put the Decoder in scope:
case class ConfigEntity(id: Option[Int] = None, description: String, key: String, value: String, expirationDate: Option[DateTime] = None)
implicit val decoder = Decoder.decodeString.emap[DateTime](str =>
Right(DateTime.parse(str))
)
"retrieve config by id" in new Context {
val testConfig = testConfigs(Some(4))
Get(s"/configs/${testConfig.id.get}") ~> route ~> check {
responseAs[ConfigEntity] should be(testConfig)
}
}
Obviously, you must deal with the possible exceptions trying to parse your DateTime and return Left instead Right...
I must say that I always use SprayJsonSupport for Akka Http, and this is the first time I´ve seen CirceSupport.
Hope this helps.

Flink read custom types - implicit value error: `Caused by: java.lang.NoSuchMethodException: <init>()`

I am trying to read avro file with case class: UserItemIds that includes case class type: User , sbt and scala 2.11
case class User(id: Long, description: String)
case class UserItemIds(user: User, itemIds: List[Long])
val UserItemIdsInputStream = env.createInput(new AvroInputFormat[UserItemIds](user_item_ids_In, classOf[UserItemIds]))
UserItemIdsInputStream.print()
but receive: Error:
Caused by: java.lang.NoSuchMethodException: schema.User.<init>()
Can anyone guide me how to work with these types please? This example is with avro files, but this could be parquet or any custom DB input.
Do I need to use TypeInformation ? ex:, if yes how to do so?
val tupleInfo: TypeInformation[(User, List[Long])] = createTypeInformation[(User, List[Long])]
I also saw env.registerType() , does it relate to the issue at all? Any help is greatly appreciated.
I found the solution to this java error as Adding a default constructor in this case I added factory method to scala case class by adding it to the companion object
object UserItemIds{
case class UserItemIds(
user: User,
itemIds: List[Long])
def apply(user:User,itemIds:List[Long]) = new
UserItemIds(user,itemIds)}
but this has not resolved the issue
You have to add a default constructor for the User and UserItemIds type. This could look the following way:
case class User(id: Long, description: String) {
def this() = this(0L, "")
}
case class UserItemIds(user: User, itemIds: List[Long]) {
def this() = this(new User(), List())
}