Using a type by reflection in ParDO causes NotSerializableException - apache-beam

I have a ParDo DoFn that deserializes protobuf from a bytestream. It outputs to PCollection<InputMessage> where InputMessage is a Generic (<? extends Message>). It infers the google protobuf Descriptor by a whacky method that circumvents type erasure
Type mySuperclass = this.getClass().getGenericSuperclass();
Type inputType = ((ParameterizedType) mySuperclass).getActualTypeArguments()[0];
Class<InputT> inputTClass = (Class<InputT>) inputType;
Descriptor descriptor = MessageReflector.getDescriptor(inputTClass);
and uses the descriptor to deserialize the protobuf from the bytestream. All this is working fine.
What doesn't work is a similar DoFn that doesn't have the proto message type as a generic type, but comes to know of the proto message class name only at runtime, and uses Class.forName to get the Class type and proceeds to get the Descriptor and use it to deserialize the protobuf. This DoFn outputs com.googe.protobuf.Message which is the superclass type of all protobuf classes, and its callers receive a PCollection as the output. When I run this transformation, it fails
Exception in thread "main" java.lang.IllegalArgumentException: unable to serialize DoFnWithExecutionInformation{doFn=com.google.cloud.verticals.telco.taap.dataflow.dataingestion.businesslogic.datastrategy.ProcessExtractedDataFn#72e572ee, mainOutputTag=Tag<com.google.cloud.verticals.telco.taap.dataflow.dataingestion.businesslogic.BusinessLogicPipeline#1>, sideInputMapping={}, schemaInformation=DoFnSchemaInformation{elementConverters=[]}}
at org.apache.beam.sdk.util.SerializableUtils.serializeToByteArray(SerializableUtils.java:59)
at org.apache.beam.runners.core.construction.ParDoTranslation.translateDoFn(ParDoTranslation.java:695)
at org.apache.beam.runners.core.construction.ParDoTranslation$1.translateDoFn(ParDoTranslation.java:253)
at org.apache.beam.runners.core.construction.ParDoTranslation.payloadForParDoLike(ParDoTranslation.java:817)
at org.apache.beam.runners.core.construction.ParDoTranslation.translateParDo(ParDoTranslation.java:249)
at org.apache.beam.runners.core.construction.ParDoTranslation.translateParDo(ParDoTranslation.java:210)
at org.apache.beam.runners.core.construction.ParDoTranslation$ParDoTranslator.translate(ParDoTranslation.java:176)
at org.apache.beam.runners.core.construction.PTransformTranslation.toProto(PTransformTranslation.java:248)
at org.apache.beam.runners.core.construction.ParDoTranslation.getParDoPayload(ParDoTranslation.java:746)
at org.apache.beam.runners.core.construction.ParDoTranslation.isSplittable(ParDoTranslation.java:761)
at org.apache.beam.runners.core.construction.PTransformMatchers$6.matches(PTransformMatchers.java:274)
at org.apache.beam.sdk.Pipeline$2.visitPrimitiveTransform(Pipeline.java:290)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:595)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:587)
at org.apache.beam.sdk.runners.TransformHierarchy.visit(TransformHierarchy.java:214)
at org.apache.beam.sdk.Pipeline.traverseTopologically(Pipeline.java:469)
at org.apache.beam.sdk.Pipeline.replace(Pipeline.java:268)
at org.apache.beam.sdk.Pipeline.replaceAll(Pipeline.java:218)
at org.apache.beam.runners.direct.DirectRunner.performRewrites(DirectRunner.java:254)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:175)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:67)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:323)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:309)
at . . .
Caused by: java.io.NotSerializableException:
com.google.protobuf.Descriptors$Descriptor at
java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1185)
at
java.base/java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1553)
at
java.base/java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1510)
at
java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1433)
at
java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179)
at
java.base/java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1553)
at
java.base/java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1510)
at
java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1433)
at
java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179)
at
java.base/java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1553)
at
java.base/java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1510)
at
java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1433)
at
java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179)
at
java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349)
at
org.apache.beam.sdk.util.SerializableUtils.serializeToByteArray(SerializableUtils.java:55)
Why doesn't it work?

Related

json4s Serialize and deserialize generic type

Because I am dealing with generic type therefore I can't use specific case classes. Then I created a generic util which serializes and deserializes generic object.
import org.json4s
import org.json4s.Formats._
import org.json4s.native.JsonMethods._
object JsonHelper {
def json2Object[O](input: String) : O = {
parse(json4s.string2JsonInput(input)).asInstanceOf[O]
}
def object2Json[O](input: O) : String = {
write(input).toString
}
}
The compiler throws the error:
No JSON serializer found for type O. Try to implement an implicit Writer or JsonFormat for this type.
write(input).toString
This should be thrown at runtime but why it's thrown at compile time?
In a comment above, you asked "So how jackson can work with java object? It use reflection right? And why it's different from Scala?", which gets to the heart of this question.
The json4s "native" serializer you have imported uses compile-time reflection to create the Writer.
Jackson uses run-time reflection to do the same.
The compile-time version is more efficient; the run-time version is more flexible.
To use the compile-time version, you need to let the compiler have enough information to choose the correct Writer based on the declared type of the object to be serialized. This will rule out very generic writer methods like the one you propose. See #TimP's answer for how to fix your code for that version.
To use the run-time version, you can use Jackson via the org.json4s.jackson.JsonMethods._ package. See https://github.com/json4s/json4s#jackson
The compiler error you posted comes from this location in the json4s code. The write function you're calling takes an implicit JSON Writer, which is how the method can take arbitrary types. It's caught at compile time because implicit arguments are compiled the same way explicit ones are -- it's as if you had:
def f(a: Int, b: Int) = a + b
f(5) // passed the wrong number of arguments
I'm having a bit of trouble seeing exactly which write method you're calling here -- the json4s library is pretty big and things are overloaded. Can you paste the declared write method you're using? It almost certainly has a signature like this:
def write[T](value: T)(implicit writer: Writer[T]): JValue
If it looks like the above, try including the implicit writer parameter in your method as so:
object JsonHelper {
def json2Object[O](input: String)(implicit reader: Reader[O]) : O = {
parse(json4s.string2JsonInput(input)).asInstanceOf[O]
}
def object2Json[O](input: O)(implicit writer: Writer[O]) : String = {
write(input).toString
}
}
In this example, you have a deal with generic types, Scala, like another jvm languages, has type erasing mechanism at compile time (error message at compile time may don't contain message about generic in a whole), so try to append this fragment to the signature of both methods:
(implicit tag: ClassTag[T])
it's similar to you example with generic, but with jackson.
HTH

Scala function return type based on generic

Using Scala generics I'm trying to abstract some common functions in my Play application. The functions return Seqs with objects deserialized from a REST JSON service.
def getPeople(cityName: String): Future[Seq[People]] = {
getByEndpoint[People](s"http://localhost/person/$cityName")
}
def getPeople(): Future[Seq[Dog]] = {
getByEndpoint[Dog]("http://localhost/doge")
}
The fetch and deserialization logic is packed into a single function using generics.
private def getByEndpoint[T](endpoint: String): Future[Seq[T]] = {
ws.url(endpoint)
.get()
.map(rsp => rsp.json)
.flatMap { json =>
json.validate[Seq[T]] match {
case s: JsSuccess[Seq[T]] =>
Future.successful(s.get)
case e: JsError =>
Future.failed(new RuntimeException(s"Get by endpoint JSON match failed: $e"))
}
}
}
Problem is is I'm getting "No Json deserializer found for type Seq[T]. Try to implement an implicit Reads or Format for this type.". I'm sure I'm not using T properly in Seq[T] (according to my C#/Java memories at least), but I can't find any clue how to do it the proper way in Scala. Everything works as expected without using generics.
Play JSON uses type classes to capture information about which types can be (de-)serialized to and from JSON, and how. If you have an implicit value of type Format[Foo] in scope, that's referred to as an instance of the Format type class for Foo.
The advantage of this approach is that it gives us a way to constrain generic types (and have those constraints checked at compile time) that doesn't depend on subtyping. For example, there's no way the standard library's String will ever extend some kind of Jsonable trait that Play (or any other library) might provide, so we need some way of saying "we know how to encode Strings as JSON" that doesn't involve making String a subtype of some trait we've defined ourselves.
In Play JSON you can do this by defining implicit Format instances, and Play itself provides many of these for you (e.g., if you've got one for T, it'll give you one for Seq[T]). The validate method on JsValue requires one of these instances (actually a subtype of Format, Reads, but that's not terribly relevant here) for its type parameter—Seq[T] in this case—and it won't compile unless the compiler can find that instance.
You can provide this instance by adding the constraint to your own generic method:
private def getByEndpoint[T: Format](endpoint: String): Future[Seq[T]] = {
...
}
Now with the T: Format syntax you've specified that there has to be a Format instance for T (even though you don't constraint T in any other way), so the compiler knows how to provide the Format instance for Seq[T] that the json.validate[Seq[T]] call requires.

How to convert nested case class into UDTValue type

I'm struggling using custom case classes to write to Cassandra (2.1.6) using Spark (1.4.0). So far, I've tried this by using the DataStax spark-cassandra-connector 1.4.0-M1 and the following case classes:
case class Event(event_id: String, event_name: String, event_url: String, time: Option[Long])
[...]
case class RsvpResponse(event: Event, group: Group, guests: Long, member: Member, mtime: Long, response: String, rsvp_id: Long, venue: Option[Venue])
In order to make this work, I've also implemented the following converter:
implicit object EventToUDTValueConverter extends TypeConverter[UDTValue] {
def targetTypeTag = typeTag[UDTValue]
def convertPF = {
case e: Event => UDTValue.fromMap(toMap(e)) // toMap just transforms the case class into a Map[String, Any]
}
}
TypeConverter.registerConverter(EventToUDTValueConverter)
If I look up the converter manually, I can use it to convert an instance of Event into UDTValue, however, when using sc.saveToCassandra passing it an instance of RsvpResponse with related objects, I get the following error:
15/06/23 23:56:29 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
com.datastax.spark.connector.types.TypeConversionException: Cannot convert object Event(EVENT9136830076436652815,First event,http://www.meetup.com/first-event,Some(1435100185774)) of type class model.Event to com.datastax.spark.connector.UDTValue.
at com.datastax.spark.connector.types.TypeConverter$$anonfun$convert$1.apply(TypeConverter.scala:42)
at com.datastax.spark.connector.types.UserDefinedType$$anon$1$$anonfun$convertPF$1.applyOrElse(UserDefinedType.scala:33)
at com.datastax.spark.connector.types.TypeConverter$class.convert(TypeConverter.scala:40)
at com.datastax.spark.connector.types.UserDefinedType$$anon$1.convert(UserDefinedType.scala:31)
at com.datastax.spark.connector.writer.DefaultRowWriter$$anonfun$readColumnValues$2.apply(DefaultRowWriter.scala:46)
at com.datastax.spark.connector.writer.DefaultRowWriter$$anonfun$readColumnValues$2.apply(DefaultRowWriter.scala:43)
It seems my converter is never even getting called because of the way the connector library is handling UDTValue internally. However, the solution described above does work for reading data from Cassandra tables (including user defined types). Based on the connector docs, I also replaced my nested case classes with com.datastax.spark.connector.UDTValue types directly, which then fixes the issue described, but breaks reading the data. I can't imagine I'm meant to define 2 separate models for reading and writing data. Or am I missing something obvious here?
Since version 1.3, there is no need to use custom type converters to load and save nested UDTs. Just model everything with case classes and stick to the field naming convention and you should be fine.

How to implement a generic REST api for tables in Play2 with squeryl and spray-json

I'm trying to implement a controller in Play2 which exposes a simple REST-style api for my db-tables. I'm using squeryl for database access and spray-json for converting objects to/from json
My idea is to have a single generic controller to do all the work, so I've set up the following routes in conf/routes:
GET /:tableName controllers.Crud.getAll(tableName)
GET /:tableName/:primaryKey controllers.Crud.getSingle(tableName, primaryKey)
.. and the following controller:
object Crud extends Controller {
def getAll(tableName: String) = Action {..}
def getSingle(tableName: String, primaryKey: Long) = Action {..}
}
(Yes, missing create/update/delete, but let's get read to work first)
I've mapped tables to case classes by extended squeryl's Schema:
object MyDB extends Schema {
val accountsTable = table[Account]("accounts")
val customersTable = table[Customer]("customers")
}
And I've told spray-json about my case classes so it knows how to convert them.
object MyJsonProtocol extends DefaultJsonProtocol {
implicit val accountFormat = jsonFormat8(Account)
implicit val customerFormat = jsonFormat4(Customer)
}
So far so good, it actually works pretty well as long as I'm using the table-instances directly. The problem surfaces when I'm trying to generify the code so that I end up with excatly one controller for accessing all tables: I'm stuck with some piece of code that doesn't compile and I am not sure what's the next step.
It seems to be a type issue with spray-json which occurs when I'm trying to convert the list of objects to json in my getAll function.
Here is my generic attempt:
def getAll(tableName: String) = Action {
val json = inTransaction {
// lookup table based on url
val table = MyDB.tables.find( t => t.name == tableName).get
// execute select all and convert to json
from(table)(t =>
select(t)
).toList.toJson // causes compile error
}
// convert json to string and set correct content type
Ok(json.compactPrint).as(JSON)
}
Compile error:
[error] /Users/code/api/app/controllers/Crud.scala:29:
Cannot find JsonWriter or JsonFormat type class for List[_$2]
[error] ).toList.toJson
[error] ^
[error] one error found
I'm guessing the problem could be that the json-library needs to know at compile-time which model type I'm throwing at it, but I'm not sure (notice the List[_$2] in that compile error). I have tried the following changes to the code which compile and return results:
Remove the generic table-lookup (MyDB.tables.find(.....).get) and instead use the specific table instance e.g. MyDB.accountsTable. Proves that JSON serialization for work . However this is not generic, will require a unique controller and route config per table in db.
Convert the list of objects from db query to a string before calling toJson. I.e: toList.toJson --> toList.toString.toJson. Proves that generic lookup of tables work But not a proper json response since it is a string-serialized list of objects..
Thoughts anyone?
Your guess is correct. MyDb.tables is a Seq[Table[_]], in other words it could hold any type of table. There is no way for the compiler to figure out the type of the table you locate using the find method, and it seems like the type is needed for the JSON conversion. There are ways to get around that, but you'd need to some type of access to the model class.

Jerkson (Jackson) issue with scala.runtime.BoxedUnit?

Jerkson started throwing a really strange error that I haven't seen before.
com.fasterxml.jackson.databind.JsonMappingException: No serializer found for class scala.runtime.BoxedUnit and no properties discovered to create BeanSerializer (to avoid exception, disable SerializationConfig.SerializationFeature.FAIL_ON_EMPTY_BEANS) ) (through reference chain: scala.collection.MapWrapper["data"])
I'm parsing some basic data from an API. The class I've defined is:
case class Segmentation(
#(JsonProperty#field)("legend_size")
val legend_size: Int,
#(JsonProperty#field)("data")
val data: Data
)
and Data looks like:
case class Data(
#(JsonProperty#field)("series")
val series: List[String],
#(JsonProperty#field)("values")
val values: Map[String, Map[String, Any]]
)
Any clue why this would be triggering errors? Seems like a simple class that Jerkson can handle.
Edit: sample data:
{"legend_size": 1, "data": {"series": ["2013-04-06", "2013-04-07", "2013-04-08", "2013-04-09", "2013-04-10", "2013-04-11", "2013-04-12", "2013-04-13", "2013-04-14", "2013-04-15"], "values": {"datapoint": {"2013-04-12": 0, "2013-04-15": 4, "2013-04-14": 0, "2013-04-08":
0, "2013-04-09": 0, "2013-04-11": 0, "2013-04-10": 0, "2013-04-13": 0, "2013-04-06": 0, "2013-04-07": 0}}}}
this isn't the answer to the above example, but I'm going to offer it because it was the answer to my similar "BoxedUnit" scenario:
No serializer found for class scala.runtime.BoxedUnit and no properties discovered to create BeanSerializer (to avoid exception, disable SerializationFeature.FAIL_ON_EMPTY_BEANS)
In my case Jackson was complaining about deserializing an instance of a scala.runtime.BoxedUnit object.
Q: So what is scala.runtime.BoxedUnit?
A: It's scala java representation for "Unit". The core part of Jackson (which is java code) is attempting to deserialize a java representation of the scala Unit non-entity.
Q: So why was this happening?
A: In my case this was a downstream side effect caused by a buggy method with an undeclared return value. The method in question wrapped a match clause that (unintentionally) didn't return a value for each case. Because of the buggy code described above, Scala dynamically declared the var capturing the result of this method as "Unit". Later on in the code when this var gets serialized into json, the jackson error occurs.
So if you are getting an issue like this, my advice would be to examine any implicitly typed vars / methods with non-defined return values and ensure they are doing what you think they are doing.
I had the same exception. what caused it in my case is that I defined an apply method in the companion object without '=':
object Somthing {
def apply(s: SomthingElse) {
...
}
}
instead of
object Somthing {
def apply(s: SomthingElse) = {
...
}
}
That caused the apply method return type to be Unit which caused the exception when I passed the object to jackson.
Not sure if that is the case in your code or if this question is still relevant but this might help others with this kind of problem.
It's been a while since I first posted this question. The solution as of writing this answer appears to be moving on from Jerkson and using the Jackson-module-scala or Json4s with the Jackson backend. Many Scala types are included in the default serialized and are natively handled.
In addition, the reason why I'm seeing BoxedUnit is because the explicit type Jerkson was seeing was Any (a part of Map[String, Map[String, Any]]). Any is a base type and doesn't give Jerkson/Jackson information about what it's deserializing. Therefore, it complains about a missing serializer.