Can I customize data serialization mechanism in spark

Can I customize data serialization mechanism in spark - scala

In my project, I need to ship plenty of existing Java objects to Spark workers, most of them are not extended from java.io.Serializable. I also want the ability to control the variables/attributes included in the objects. I only want to serialize the useful attributes, not everything in a object.
Spark documentation indicates that there are two ways to serialize an object in Spark, using java.io.Serializable or Kryo. I think both ways need to rewrite a bunch of wrappers or extra code for every business objects. However, my current code base has already implemented protocol-buffers serialization. I am wondering if there is any way to embed this serialization mechanism into Spark.

Related

Random Avro data generator in Java/Scala

Is this possible to generate random Avro data by the specified schema using org.apache.avro library?
I need to produce this data with Kafka.
I tried to find some kind of random data generator for test, however, I have stumbled upon tools for such data generator or GenericRecord usage. Tools are not very suitable for me as there is a specific file dependency (like reading the file and so on) and GenericRecord should be generated one-by-one as I've understood.
Are there any other solutions for Java/Scala?
UPDATE: I have found this class but it does not seem to beaccessible from org.apache.avro version version 1.8.2

The reason you need to read a file, is that it matches a Schema, which defines the fields that need to be created, and of which types.
That is not a hard requirement, and there would be nothing preventing creation of random Generic or Specific Records that are built in code via Avro's SchemaBuilder class
See this repo for example, that uses a POJO generated from an AVSC schema (which again, could be done with SchemaBuilder instead) into a Java class.
Even the class you linked to uses a schema file

So I personally would probably use Avro4s (https://github.com/sksamuel/avro4s) in conjunction with scalachecks (https://www.scalacheck.org) Gen to model such tests.
You could use scalacheck to generate random instances of case classes and avro4s to convert them to generic records, extract their schema etc etc.
There's also avro-mocker https://github.com/speedment/avro-mocker though I don't know how easy it is to hook into the code.

I'd just use Podam http://mtedone.github.io/podam/ to generate POJOs and then just output them to Avro using Java Avro library https://avro.apache.org/docs/1.8.1/gettingstartedjava.html#Serializing

How to evolve Avro schema with Akka Persistence without sending schema or using a registry?

We are considering a serialization approach for our scala-based Akka Persistence app. We consider it likely that our persisted events will "evolve" over time, so we want to support schema evolution, and are considering Avro first.
We'd like to avoid including the full schema with every message. However, for the foreseeable future, this Akka Persistence app is the only app that will be serializing and deserializing these messages, so we don't see a need for a separate schema registry.
Checking the docs for avro and the various scala libs, I see ways to include the schema with messages, and also how to use it "schema-less" by using a schema registry, but what about the in-between case? What's the correct approach for going schema-less, but somehow including an identifier to be able to look up the correct schema (available in the local deployed codebase) for the deserialized object? Would I literally just create a schema that represents my case class, but with an additional "identifier" field for schema version, and then have some sort of in-memory map of identifier->schema at runtime?
Also, is the correct approach to have one serializer/deserialize class for each version of the schema, so it knows how to translate every version to/from the most recent version?
Finally, are there recommendations on how to unit-test schema evolutions? For instance, store a message in akka-persistence, then actually change the definition of the case class, and then kill the actor and make sure it properly evolves. (I don't see how to change the definition of the case class at runtime.)

After spending more time on this, here are the answers I came up with.
Using avro4s, you can use the default data output stream to include the schema with every serialized message. Or, you can use the binary output stream, which simply omits the schema when serializing each message. ('binary' is a bit of a misnomer here since all it does is omit the schema. In either case it is still an Array[Byte].)
Akka itself supplies a Serializer trait or a SerializerWithStringManifest trait, which will automatically include a field for a "schema identifier" in the object of whatever you serialize.
So when you create your custom serializer, you can extend the appropriate trait, define your schema identifier, and use the binary output stream. When those techniques are combined, you'll successfully be using schema-less serialization while including a schema identifier.
One common technique is to "fingerprint" your schema - treat it as a string and then calculate its digest (MD5, SHA-256, whatever). If you construct an in-memory map of fingerprint to schema, that can serve as your application's in-memory schema registry.
So then when deserializing, your incoming object will have the schema identifier of the schema that was used to serialize it (the "writer"). While deserializing, you should know the identifier of the schema to use to deserialize it (the "reader"). Avro4s supports a way for you to specify both using a builder pattern, so avro can translate the object from the old format to the new. That's how you support "schema evolution". Because of how that works, you don't need a separate serializer for each schema version. Your custom serializer will know how to evolve your objects, because that's the part that Avro gives you for free.
As for unit testing, your best bet is exploratory testing. Actually define multiple versions of a case class in your test, and multiple accompanying versions of its schema, and then explore how Avro works by writing tests that will evolve an object between different versions of that schema.
Unfortunately that won't be directly relevant to the code you are writing, because it's hard to simulate actually changing the code you are testing as you test it.
I developed a prototype that demonstrates several of these answers, and it's available on github. It uses avro, avro4s, and akka persistence. For this one, I demonstrated a changing codebase by actually changing it across commits - you'd check out commit #1, run the code, then move to commit #2, etc. It runs against cassandra so it will demonstrate replaying events that need to be evolved using new schema, all without using an external schema registry.

How to customize serialization behavior without annotations in Salat?

I'm using Salat library to serialize objects to be stored in MongoDb via Casbah. Sometimes I need to tune little bit how fields will be serialized, and Salat's Annotations is a pretty convenient way to do it.
BUT, Is there any way to describe serialization parameters(Key, Ignore etc) not directly in case-classes(models) via Annotations, but in some external point, to keep my models clear from Salat dependency(aka POJO/POCO)?

Yes, you can add custom serialization logic to your Salat context.
Example from Salat unit tests:
WibbleTransformer
Custom context with custom transformers added

How to convert JSON into Scala class (NOT Case class), and then populate set of case classes from that big class

I am building an application using Scala 2.10, Salat and Play frmework 2.1-RC2 (will upgrade to 2.1 release soon) and MongoDB.
This is a faceless application where JSON web services are exposed for consumers. Up until now JSON was converted into Model object directly using Play's Json API and implicit converters. I have to refactor some case classes to avoid 22 tuples limit and now instead of flat case class I'm now refactoring to have an embedded case(and embedded MongoDB collection).
Web service interface should remain same where client should still be passing in JSON data as they were before in a flat structure but application needs to map them into proper case class(es) structure. What's the best way to handle this kind of situation. I fear of writing a lot of conversion code <-> Flat JSON <-> complex case class structure <-> from complex case classes to flat JSON output again.
How would you approach such a requirement? I assume case class 22 tuple limit may have had been faced by many others to handle this kind of requirements? How would you approach this

The Play 2.1 json library relies heavily on combinators (path1 and path2). These combinators all have the same 22 restriction. That gives you two options:
Don't use combinators and construct your objects the hard way: path(json) will give you the value at that point in the path. Searching for 'Accessing value of JsPath' at ScalaJsonCombinators will give more examples.
First transform the json into a structure that does not have more than 22 values in a single object and then use the normal combinators. More information about transforming can be found here: ScalaJsonTransformers

Sending persisted JDO instances over GWT-RPC

I've just started learning Google Web Toolkit and finished writing the Stock Watcher tutorial app.
Is my thinking correct that if one wants to persist a business object (like a Stock) using JDO and send it back and forth to/from the client over RPC then one has to create two separate classes for that object: One with the JDO annotations for persisting it on the server and another which is serialisable and used over RPC?
I notice the Stock Watcher has separate classes and I can theorise why:
Otherwise the gwt compiler would try
to generate javascript for everything
the persisted class referenced like
JDO and com.google.blah.users.User, etc
Also there may be logic on the server-side
class which doesn't apply to the client
and vice-versa.
I just want to make sure I'm understanding this correctly. I don't want to have to create two versions of all my business object classes which I want to use over RPC if I don't have to.

The short answer is: you don't need to create duplicate classes.
I recommend that you take a look from the following google groups discussion on the gwt-contributors list:
http://groups.google.com/group/google-web-toolkit-contributors/browse_thread/thread/3c768d8d33bfb1dc/5a38aa812c0ac52b
Here is an interesting excerpt:
If this is all you're interested in, I
described a way to make GAE and
GWT-RPC work together "out of the
box". Just declare your entities as:
#PersistenceCapable(identityType =
IdentityType.APPLICATION, detachable
= "false") public class MyPojo implements Serializable { }
and everything will work, but you'll
have to manually deal with
re-attachment when sending objects
from the client back to the server.
You can use this option, and you will not need a mirror (DTO) class.
You can also try gilead (former hibernate4gwt), which takes care of some details within the problems of serializing enhanced objects.

Your assessment is correct. JDO replaces instances of Collections with their own implementations, in order to sniff when the object graph changes, I suppose. These implementations are not known by the GWT compiler, so it will not be able to serialize them. This happens often for classes that are composed of otherwise GWT compliant types, but with JDO annotations, especially if some of the object properties are Collections.
For a detailed explanation and a workaround, check out this pretty influential essay on the topic: http://timepedia.blogspot.com/2009/04/google-appengine-and-gwt-now-marriage.html

I finally found a solution. Don't change your object at all, but for the listing do it this way:
List<YourCustomObject> secureList=(List<YourCustomObject>)pm.newQuery(query).execute();
return new ArrayList<YourCustomObject>(secureList);
The actual problem is not in Serializing the Object... the problem is to Serialize the Collection class which is implemented by Google and is not allowed to Serialize out.

You do not have to create two versions of the domain model.
Here are two tips:
Use a String encoded key, not the Appengine Key class.
pojo = pm.detachCopy(pojo)
...will remove all the JDO enhancements.

You don't have to create separate instances at all, in fact you're better off not doing it. Your JDO objects should be plain POJOs anyway, and should never contain business logic. That's for your business layer, not your persistent objects themselves.
All you need to do is include the source for the annotations you are using and GWT should compile your class just fine. Also, you want to avoid using libraries that GWT can't compile (like things that use reflection, etc.), but in all the projects I've done this has never been a problem.

I think that a better format to send objects through GWT is through JSON. In this case from the server a JSON string would be sent which would then have to be parsed in the client. The advantage is that the final Javascript which is rendered in the browser has a smaller size. thus causing the page to load faster.
Secondly to send objects through GWT, the objects should be serializable. This may not be the case for all objects
Thirdly GWT has inbuilt functions to handle JSON... so no issues on the client end

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse