How to convert nested case class into UDTValue type - scala

I'm struggling using custom case classes to write to Cassandra (2.1.6) using Spark (1.4.0). So far, I've tried this by using the DataStax spark-cassandra-connector 1.4.0-M1 and the following case classes:
case class Event(event_id: String, event_name: String, event_url: String, time: Option[Long])
[...]
case class RsvpResponse(event: Event, group: Group, guests: Long, member: Member, mtime: Long, response: String, rsvp_id: Long, venue: Option[Venue])
In order to make this work, I've also implemented the following converter:
implicit object EventToUDTValueConverter extends TypeConverter[UDTValue] {
def targetTypeTag = typeTag[UDTValue]
def convertPF = {
case e: Event => UDTValue.fromMap(toMap(e)) // toMap just transforms the case class into a Map[String, Any]
}
}
TypeConverter.registerConverter(EventToUDTValueConverter)
If I look up the converter manually, I can use it to convert an instance of Event into UDTValue, however, when using sc.saveToCassandra passing it an instance of RsvpResponse with related objects, I get the following error:
15/06/23 23:56:29 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
com.datastax.spark.connector.types.TypeConversionException: Cannot convert object Event(EVENT9136830076436652815,First event,http://www.meetup.com/first-event,Some(1435100185774)) of type class model.Event to com.datastax.spark.connector.UDTValue.
at com.datastax.spark.connector.types.TypeConverter$$anonfun$convert$1.apply(TypeConverter.scala:42)
at com.datastax.spark.connector.types.UserDefinedType$$anon$1$$anonfun$convertPF$1.applyOrElse(UserDefinedType.scala:33)
at com.datastax.spark.connector.types.TypeConverter$class.convert(TypeConverter.scala:40)
at com.datastax.spark.connector.types.UserDefinedType$$anon$1.convert(UserDefinedType.scala:31)
at com.datastax.spark.connector.writer.DefaultRowWriter$$anonfun$readColumnValues$2.apply(DefaultRowWriter.scala:46)
at com.datastax.spark.connector.writer.DefaultRowWriter$$anonfun$readColumnValues$2.apply(DefaultRowWriter.scala:43)
It seems my converter is never even getting called because of the way the connector library is handling UDTValue internally. However, the solution described above does work for reading data from Cassandra tables (including user defined types). Based on the connector docs, I also replaced my nested case classes with com.datastax.spark.connector.UDTValue types directly, which then fixes the issue described, but breaks reading the data. I can't imagine I'm meant to define 2 separate models for reading and writing data. Or am I missing something obvious here?

Since version 1.3, there is no need to use custom type converters to load and save nested UDTs. Just model everything with case classes and stick to the field naming convention and you should be fine.

Related

DataSet/DataStream of type class interface

I am just experimenting with the use of Scala type classes within Flink. I have defined the following type class interface:
trait LikeEvent[T] {
def timestamp(payload: T): Int
}
Now, I want to consider a DataSet of LikeEvent[_] like this:
// existing classes that need to be adapted/normalized (without touching them)
case class Log(ts: Int, severity: Int, message: String)
case class Metric(ts: Int, name: String, value: Double)
// create instances for the raw events
object EventInstance {
implicit val logEvent = new LikeEvent[Log] {
def timestamp(log: Log): Int = log.ts
}
implicit val metricEvent = new LikeEvent[Metric] {
def timestamp(metric: Metric): Int = metric.ts
}
}
// add ops to the raw event classes (regular class)
object EventSyntax {
implicit class Event[T: LikeEvent](val payload: T) {
val le = implicitly[LikeEvent[T]]
def timestamp: Int = le.timestamp(payload)
}
}
The following app runs just fine:
// set up the execution environment
val env = ExecutionEnvironment.getExecutionEnvironment
// underlying (raw) events
val events: DataSet[Event[_]] = env.fromElements(
Metric(1586736000, "cpu_usage", 0.2),
Log(1586736005, 1, "invalid login"),
Log(1586736010, 1, "invalid login"),
Log(1586736015, 1, "invalid login"),
Log(1586736030, 2, "valid login"),
Metric(1586736060, "cpu_usage", 0.8),
Log(1586736120, 0, "end of world"),
)
// count events per hour
val eventsPerHour = events
.map(new GetMinuteEventTuple())
.groupBy(0).reduceGroup { g =>
val gl = g.toList
val (hour, count) = (gl.head._1, gl.size)
(hour, count)
}
eventsPerHour.print()
Printing the expected output
(0,5)
(1,1)
(2,1)
However, if I modify the syntax object like this:
// couldn't make it work with Flink!
// add ops to the raw event classes (case class)
object EventSyntax2 {
case class Event[T: LikeEvent](payload: T) {
val le = implicitly[LikeEvent[T]]
def timestamp: Int = le.timestamp(payload)
}
implicit def fromPayload[T: LikeEvent](payload: T): Event[T] = Event(payload)
}
I get the following error:
type mismatch;
found : org.apache.flink.api.scala.DataSet[Product with Serializable]
required: org.apache.flink.api.scala.DataSet[com.salvalcantara.fp.EventSyntax2.Event[_]]
So, guided by the message, I do the following change:
val events: DataSet[Event[_]] = env.fromElements[Event[_]](...)
After that, the error changes to:
could not find implicit value for evidence parameter of type org.apache.flink.api.common.typeinfo.TypeInformation[com.salvalcantara.fp.EventSyntax2.Event[_]]
I cannot understand why EventSyntax2 results into these errors, whereas EventSyntax compiles and runs well. Why is using a case class wrapper in EventSyntax2 more problematic than using a regular class as in EventSyntax?
Anyway, my question is twofold:
How can I solve my problem with EventSyntax2?
What would be the simplest way to achieve my goals? Here, I am just experimenting with the type class pattern for the sake of learning, but definitively a more Object-Oriented approach (based on subtyping) looks simpler to me. Something like this:
// Define trait
trait Event {
def timestamp: Int
def payload: Product with Serializable // Any case class
}
// Metric adapter (similar for Log)
object MetricAdapter {
implicit class MetricEvent(val payload: Metric) extends Event {
def timestamp: Int = payload.ts
}
}
And then simply use val events: DataSet[Event] = env.fromElements(...) in the main.
Note that List of classes implementing a certain typeclass poses a similar question, but it considers a simple Scala List instead of a Flink DataSet (or DataStream). The focus of my question is on using the type class pattern within Flink to somehow consider heterogeneous streams/datasets, and whether it really makes sense or one should just clearly favour a regular trait in this case and inherit from it as outlined above.
BTW, you can find the code here: https://github.com/salvalcantara/flink-events-and-polymorphism.
Short answer: Flink cannot derive TypeInformation in scala for wildcard types
Long answer:
Both of your questions are really asking, what is TypeInformation, how is it used, and how is it derived.
TypeInformation is Flink's internal type system that it uses to serialize data when it is shuffled across the network and stored in a statebackend (when using the DataStream api).
Serialization is a major performance concern in data processing, so Flink contains specialized serializers for common data types and patterns. Out of the box, in its Java stack, it supports all JVM primitives, Pojo's, Flink tuples, some common collections types, and avro. The type of your class is determined using reflection and if it does not match a known type it will fall back to Kryo.
In the scala api, type information is derived using implicits. All methods on the scala DataSet and DataStream api have their generic parameters annotated for the implicit as a type class.
def map[T: TypeInformation]
This TypeInformation can be provided manually, like any type class, or derived using a macro that is imported from flink.
import org.apache.flink.api.scala._
This macro decorates the java type stack with support for scala tuples, scala case classes, and some common scala std library types. I say decorator because it can and will fall back to the java stack if your class is not one of those types.
So why does version 1 work?
Because it is an ordinary class that the type stack cannot match and so it resolved it to a generic type and returned a kryo based serializer. You can test this from the console and see it returns a generic type.
> scala> implicitly[TypeInformation[EventSyntax.Event[_]]]
res2: org.apache.flink.api.common.typeinfo.TypeInformation[com.salvalcantara.fp.EventSyntax.Event[_]] = GenericType<com.salvalcantara.fp.EventSyntax.Event>
Version 2 does not work because it recognized the type as a case class and then works to recursively derive TypeInformation instances for each of its members. This is not possible for wildcard types, which are different than Any and so derivation fails.
In general, you should not use Flink with heterogeneous types because it will not be able to derive efficient serializers for your workload.

Pre-process parameters of a case class constructor without repeating the argument list

I have this case class with a lot of parameters:
case class Document(id:String, title:String, ...12 more params.. , keywords: Seq[String])
For certain parameters, I need to do some string cleanup (trim, etc) before creating the object.
I know I could add a companion object with an apply function, but the LAST thing I want is to write the list of parameters TWICE in my code (case class constructor and companion object's apply).
Does Scala provide anything to help me on this?
My general recommendations would be:
Your goal (data preprocessing) is the perfect use case of a companion object -- so it is maybe the most idiomatic solution despite the boilerplate.
If the number of case class parameters is high the builder pattern definitely helps, since you do not have to remember the order of the parameters and your IDE can help you with calling the builder member functions. Using named arguments for the case class constructor allows you to use a random argument order as well but, to my knowledge, there is not IDE autocompletion for named arguments => makes a builder class slightly more convenient. However using a builder class raises the question of how to deal with enforcing the specification of certain arguments -- the simple solution may cause runtime errors; the type-safe solution is a bit more verbose. In this regard a case class with default arguments is more elegant.
There is also this solution: Introduce an additional flag preprocessed with a default argument of false. Whenever you want to use an instance val d: Document, you call d.preprocess() implemented via the case class copy method (to avoid ever typing all your arguments again):
case class Document(id: String, title: String, keywords: Seq[String], preprocessed: Boolean = false) {
def preprocess() = if (preprocessed) this else {
this.copy(title = title.trim, preprocessed = true) // or whatever you want to do
}
}
But: You cannot prevent a client to initialize preprocessed set to true.
Another option would be to make some of your parameters a private val and expose the corresponding getter for the preprocessed data:
case class Document(id: String, title: String, private val _keywords: Seq[String]) {
val keywords = _keywords.map(kw => kw.trim)
}
But: Pattern matching and the default toString implementation will not give you quite what you want...
After changing context for half an hour, I looked at this problem with fresh eyes and came up with this:
case class Document(id: String, title: String, var keywords: Seq[String]) {
keywords = keywords.map(kw => kw.trim)
}
I simply make the argument mutable adding var and cleanup data in the class body.
Ok I know, my data is not immutable anymore and Martin Odersky will probably kill a kitten after seeing this, but hey.. I managed to do what I want adding 3 characters. I call this a win :)

Serializing case class with trait mixin using json4s

I've got a case class Game which I have no trouble serializing/deserializing using json4s.
case class Game(name: String,publisher: String,website: String, gameType: GameType.Value)
In my app I use mapperdao as my ORM. Because Game uses a Surrogate Id I do not have id has part of its constructor.
However, when mapperdao returns an entity from the DB it supplies the id of the persisted object using a trait.
Game with SurrogateIntId
The code for the trait is
trait SurrogateIntId extends DeclaredIds[Int]
{
def id: Int
}
trait DeclaredIds[ID] extends Persisted
trait Persisted
{
#transient
private var mapperDaoVM: ValuesMap = null
#transient
private var mapperDaoDetails: PersistedDetails = null
private[mapperdao] def mapperDaoPersistedDetails = mapperDaoDetails
private[mapperdao] def mapperDaoValuesMap = mapperDaoVM
private[mapperdao] def mapperDaoInit(vm: ValuesMap, details: PersistedDetails) {
mapperDaoVM = vm
mapperDaoDetails = details
}
.....
}
When I try to serialize Game with SurrogateIntId I get empty parenthesis returned, I assume this is because json4s doesn't know how to deal with the attached trait.
I need a way to serialize game with only id added to its properties , and almost as importantly a way to do this for any T with SurrogateIntId as I use these for all of my domain objects.
Can anyone help me out?
So this is an extremely specific solution since the origin of my problem comes from the way mapperDao returns DOs, however it may be helpful for general use since I'm delving into custom serializers in json4s.
The full discussion on this problem can be found on the mapperDao google group.
First, I found that calling copy() on any persisted Entity(returned from mapperDao) returned the clean copy(just case class) of my DO -- which is then serializable by json4s. However I did not want to have to remember to call copy() any time I wanted to serialize a DO or deal with mapping lists, etc. as this would be unwieldy and prone to errors.
So, I created a CustomSerializer that wraps around the returned Entity(case class DO + traits as an object) and gleans the class from generic type with an implicit manifest. Using this approach I then pattern match my domain objects to determine what was passed in and then use Extraction.decompose(myDO.copy()) to serialize and return the clean DO.
// Entity[Int, Persisted, Class[T]] is how my DOs are returned by mapperDao
class EntitySerializer[T: Manifest] extends CustomSerializer[Entity[Int, Persisted, Class[T]]](formats =>(
{PartialFunction.empty} //This PF is for extracting from JSON and not needed
,{
case g: Game => //Each type is one of my DOs
implicit val formats: Formats = DefaultFormats //include primitive formats for serialization
Extraction.decompose(g.copy()) //get plain DO and then serialize with json4s
case u : User =>
implicit val formats: Formats = DefaultFormats + new LinkObjectEntitySerializer //See below for explanation on LinkObject
Extraction.decompose(u.copy())
case t : Team =>
implicit val formats: Formats = DefaultFormats + new LinkObjectEntitySerializer
Extraction.decompose(t.copy())
...
}
The only need for a separate serializer is in the event that you have non-primitives as parameters of a case class being serialized because the serializer can't use itself to serialize. In this case you create a serializer for each basic class(IE one with only primitives) and then include it into the next serializer with objects that depend on those basic classes.
class LinkObjectEntitySerializer[T: Manifest] extends CustomSerializer[Entity[Int, Persisted, Class[T]]](formats =>(
{PartialFunction.empty},{
//Team and User have Set[TeamUser] parameters, need to define this "dependency"
//so it can be included in formats
case tu: TeamUser =>
implicit val formats: Formats = DefaultFormats
("Team" -> //Using custom-built representation of object
("name" -> tu.team.name) ~
("id" -> tu.team.id) ~
("resource" -> "/team/") ~
("isCaptain" -> tu.isCaptain)) ~
("User" ->
("name" -> tu.user.globalHandle) ~
("id" -> tu.user.id) ~
("resource" -> "/user/") ~
("isCaptain" -> tu.isCaptain))
}
))
This solution is hardly satisfying. Eventually I will need to replace mapperDao or json4s(or both) to find a simpler solution. However, for now, it seems to be the fix with the least amount of overhead.

How to implement a generic REST api for tables in Play2 with squeryl and spray-json

I'm trying to implement a controller in Play2 which exposes a simple REST-style api for my db-tables. I'm using squeryl for database access and spray-json for converting objects to/from json
My idea is to have a single generic controller to do all the work, so I've set up the following routes in conf/routes:
GET /:tableName controllers.Crud.getAll(tableName)
GET /:tableName/:primaryKey controllers.Crud.getSingle(tableName, primaryKey)
.. and the following controller:
object Crud extends Controller {
def getAll(tableName: String) = Action {..}
def getSingle(tableName: String, primaryKey: Long) = Action {..}
}
(Yes, missing create/update/delete, but let's get read to work first)
I've mapped tables to case classes by extended squeryl's Schema:
object MyDB extends Schema {
val accountsTable = table[Account]("accounts")
val customersTable = table[Customer]("customers")
}
And I've told spray-json about my case classes so it knows how to convert them.
object MyJsonProtocol extends DefaultJsonProtocol {
implicit val accountFormat = jsonFormat8(Account)
implicit val customerFormat = jsonFormat4(Customer)
}
So far so good, it actually works pretty well as long as I'm using the table-instances directly. The problem surfaces when I'm trying to generify the code so that I end up with excatly one controller for accessing all tables: I'm stuck with some piece of code that doesn't compile and I am not sure what's the next step.
It seems to be a type issue with spray-json which occurs when I'm trying to convert the list of objects to json in my getAll function.
Here is my generic attempt:
def getAll(tableName: String) = Action {
val json = inTransaction {
// lookup table based on url
val table = MyDB.tables.find( t => t.name == tableName).get
// execute select all and convert to json
from(table)(t =>
select(t)
).toList.toJson // causes compile error
}
// convert json to string and set correct content type
Ok(json.compactPrint).as(JSON)
}
Compile error:
[error] /Users/code/api/app/controllers/Crud.scala:29:
Cannot find JsonWriter or JsonFormat type class for List[_$2]
[error] ).toList.toJson
[error] ^
[error] one error found
I'm guessing the problem could be that the json-library needs to know at compile-time which model type I'm throwing at it, but I'm not sure (notice the List[_$2] in that compile error). I have tried the following changes to the code which compile and return results:
Remove the generic table-lookup (MyDB.tables.find(.....).get) and instead use the specific table instance e.g. MyDB.accountsTable. Proves that JSON serialization for work . However this is not generic, will require a unique controller and route config per table in db.
Convert the list of objects from db query to a string before calling toJson. I.e: toList.toJson --> toList.toString.toJson. Proves that generic lookup of tables work But not a proper json response since it is a string-serialized list of objects..
Thoughts anyone?
Your guess is correct. MyDb.tables is a Seq[Table[_]], in other words it could hold any type of table. There is no way for the compiler to figure out the type of the table you locate using the find method, and it seems like the type is needed for the JSON conversion. There are ways to get around that, but you'd need to some type of access to the model class.

Jerkson (Jackson) issue with scala.runtime.BoxedUnit?

Jerkson started throwing a really strange error that I haven't seen before.
com.fasterxml.jackson.databind.JsonMappingException: No serializer found for class scala.runtime.BoxedUnit and no properties discovered to create BeanSerializer (to avoid exception, disable SerializationConfig.SerializationFeature.FAIL_ON_EMPTY_BEANS) ) (through reference chain: scala.collection.MapWrapper["data"])
I'm parsing some basic data from an API. The class I've defined is:
case class Segmentation(
#(JsonProperty#field)("legend_size")
val legend_size: Int,
#(JsonProperty#field)("data")
val data: Data
)
and Data looks like:
case class Data(
#(JsonProperty#field)("series")
val series: List[String],
#(JsonProperty#field)("values")
val values: Map[String, Map[String, Any]]
)
Any clue why this would be triggering errors? Seems like a simple class that Jerkson can handle.
Edit: sample data:
{"legend_size": 1, "data": {"series": ["2013-04-06", "2013-04-07", "2013-04-08", "2013-04-09", "2013-04-10", "2013-04-11", "2013-04-12", "2013-04-13", "2013-04-14", "2013-04-15"], "values": {"datapoint": {"2013-04-12": 0, "2013-04-15": 4, "2013-04-14": 0, "2013-04-08":
0, "2013-04-09": 0, "2013-04-11": 0, "2013-04-10": 0, "2013-04-13": 0, "2013-04-06": 0, "2013-04-07": 0}}}}
this isn't the answer to the above example, but I'm going to offer it because it was the answer to my similar "BoxedUnit" scenario:
No serializer found for class scala.runtime.BoxedUnit and no properties discovered to create BeanSerializer (to avoid exception, disable SerializationFeature.FAIL_ON_EMPTY_BEANS)
In my case Jackson was complaining about deserializing an instance of a scala.runtime.BoxedUnit object.
Q: So what is scala.runtime.BoxedUnit?
A: It's scala java representation for "Unit". The core part of Jackson (which is java code) is attempting to deserialize a java representation of the scala Unit non-entity.
Q: So why was this happening?
A: In my case this was a downstream side effect caused by a buggy method with an undeclared return value. The method in question wrapped a match clause that (unintentionally) didn't return a value for each case. Because of the buggy code described above, Scala dynamically declared the var capturing the result of this method as "Unit". Later on in the code when this var gets serialized into json, the jackson error occurs.
So if you are getting an issue like this, my advice would be to examine any implicitly typed vars / methods with non-defined return values and ensure they are doing what you think they are doing.
I had the same exception. what caused it in my case is that I defined an apply method in the companion object without '=':
object Somthing {
def apply(s: SomthingElse) {
...
}
}
instead of
object Somthing {
def apply(s: SomthingElse) = {
...
}
}
That caused the apply method return type to be Unit which caused the exception when I passed the object to jackson.
Not sure if that is the case in your code or if this question is still relevant but this might help others with this kind of problem.
It's been a while since I first posted this question. The solution as of writing this answer appears to be moving on from Jerkson and using the Jackson-module-scala or Json4s with the Jackson backend. Many Scala types are included in the default serialized and are natively handled.
In addition, the reason why I'm seeing BoxedUnit is because the explicit type Jerkson was seeing was Any (a part of Map[String, Map[String, Any]]). Any is a base type and doesn't give Jerkson/Jackson information about what it's deserializing. Therefore, it complains about a missing serializer.