Spark: java.io.NotSerializableException: org.apache.avro.Schema$RecordSchema - scala

I am creating avro RDD with following code.
def convert2Avro(data : String ,schema : Schema) : AvroKey[GenericRecord] = {
var wrapper = new AvroKey[GenericRecord]()
var record = new GenericData.Record(schema)
record.put("empname","John")
wrapper.datum(record)
return wrapper
}
and creating avro RDD as follows.
var avroRDD = fieldsRDD.map(x =>(convert2Avro(x, schema)))
while executing, I am getting following exception in above line
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
at org.apache.spark.rdd.RDD.map(RDD.scala:270)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:331)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.NotSerializableException: org.apache.avro.Schema$RecordSchema
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
any pointer?

Schema.ReocrdSchema class has not implemented serializable. So it could not transferred over the network. We can convert the schema to string and pass to method and inside the method reconstruct the schema object.
var schemaString = schema.toString
var avroRDD = fieldsRDD.map(x =>(convert2Avro(x, schemaString)))
Inside the method reconstruct the schema:
def convert2Avro(data : String ,schemaString : String) : AvroKey[GenericRecord] = {
var schema = parser.parse(schemaString)
var wrapper = new AvroKey[GenericRecord]()
var record = new GenericData.Record(schema)
record.put("empname","John")
wrapper.datum(record)
return wrapper
}

Another alternative (from http://aseigneurin.github.io/2016/03/04/kafka-spark-avro-producing-and-consuming-avro-messages.html) is to use static initialization.
as they explain on the link
we are using a static initialization block. An instance of the
recordInjection object will be created per JVM, i.e. we will have one
instance per Spark worker
And since it's created fresh for each worker, there is no serialization needed.
I prefer the static initializer, as I would worry that toString() might not contain all the information needed to construct the object (it seems to work well in this case, but serialization is not toString()'s advertised purpose). However, the disadvantage of using static is that it's not really a correct use of static (see, for example, Java: when to use static methods)
So, whichever you prefer - since both seem to work fine, then it's probably more a matter of your preferred style.
Update
Of course, depending on your program, the most elegant solution might be to avoid the problem all together, by containing all your avro code in the worker i.e. do all the Avro processing you need to do, like writing to the Kafka topic or whatever, in "convert2Avro". Then there is no need to return these objects back into an RDD. It really depends what you are wanting the RDD for.

Related

NoSuchMethodException in JDBI while using it with Kotlin

I am trying to use JDBI with PostgreSQL database. I am using Kotlin. Even after trying for several hours, I couldn't get to a point where I can use basic CRUD operations.
val users = jdbi.withHandleUnchecked {
it
.createQuery("select id, name from users")
.mapToBean(User::class.java)
.list()
}
Here is the code I am using. (There is no issue with the connection. I have verified that by executing queries using JDBC)
Here is the model class,
data class User(val id: Int, val name: String)
Here is the exception I am getting,
Exception in thread "main" java.lang.NoSuchMethodException: no such constructor: User.<init>()void/newInvokeSpecial
at java.base/java.lang.invoke.MemberName.makeAccessException(MemberName.java:961)
at java.base/java.lang.invoke.MemberName$Factory.resolveOrFail(MemberName.java:1101)
at java.base/java.lang.invoke.MethodHandles$Lookup.resolveOrFail(MethodHandles.java:2030)
at java.base/java.lang.invoke.MethodHandles$Lookup.findConstructor(MethodHandles.java:1264)
at org.jdbi.v3.core.mapper.reflect.internal.BeanPropertiesFactory$BeanPojoProperties$PropertiesHolder.<init>(BeanPropertiesFactory.java:202)
at org.jdbi.v3.core.config.JdbiCaches.lambda$declare$0(JdbiCaches.java:49)
at org.jdbi.v3.core.config.JdbiCaches$1.lambda$get$1(JdbiCaches.java:63)
at java.base/java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1705)
at org.jdbi.v3.core.config.JdbiCaches$1.get(JdbiCaches.java:63)
at org.jdbi.v3.core.mapper.reflect.internal.BeanPropertiesFactory$BeanPojoProperties.getProperties(BeanPropertiesFactory.java:81)
at org.jdbi.v3.core.mapper.reflect.internal.PojoMapper.specialize0(PojoMapper.java:99)
at org.jdbi.v3.core.mapper.reflect.internal.PojoMapper.specialize(PojoMapper.java:80)
at org.jdbi.v3.core.result.ResultSetResultIterator.<init>(ResultSetResultIterator.java:38)
at org.jdbi.v3.core.result.ResultIterable.lambda$of$0(ResultIterable.java:54)
at org.jdbi.v3.core.result.ResultIterable.stream(ResultIterable.java:228)
at org.jdbi.v3.core.result.ResultIterable.collect(ResultIterable.java:284)
at org.jdbi.v3.core.result.ResultIterable.list(ResultIterable.java:273)
at MainKt$main$$inlined$withHandleUnchecked$1.withHandle(Jdbi858Extensions.kt:185)
at org.jdbi.v3.core.Jdbi.withHandle(Jdbi.java:342)
at MainKt.main(Main.kt:26)
at MainKt.main(Main.kt)
Caused by: java.lang.NoSuchMethodError: User: method 'void <init>()' not found
at java.base/java.lang.invoke.MethodHandleNatives.resolve(Native Method)
at java.base/java.lang.invoke.MemberName$Factory.resolve(MemberName.java:1070)
at java.base/java.lang.invoke.MemberName$Factory.resolveOrFail(MemberName.java:1098)
... 19 more
Process finished with exit code 1
Seemingly it has something to do with the User class. So, I tried tinkering with it.
Where is what I tried, (changed to var from val)
data class User(var id: Int, var name: String)
The exception being thrown is the same as above.
Here is how the dependency section of my gradle.build looks like,
dependencies {
implementation 'org.jetbrains.kotlin:kotlin-stdlib-jdk8:1.5.31'
implementation 'org.postgresql:postgresql:42.3.0'
implementation 'org.jdbi:jdbi3-kotlin-sqlobject:3.23.0'
implementation 'com.zaxxer:HikariCP:5.0.0'
implementation 'org.slf4j:slf4j-jdk14:1.7.32'
}
Any assistance would be really helpful.

Spark throws Not Serializable Exception inside a foreachRDD operation

i'm trying to implement an observer pattern using scala and spark streaming. the idea is that whenever i receive a record from the stream (from kafka) i notify the observer by calling the method "notifyObservers" inside the closure. here's the code:
the stream is provided by the kafka utils.
the method notifyObserver is defined into an abstract class following the rules of the pattern.
the error I think is related on the fact that methods cant be serialize.
Am I thinking correctly? and if it was, what kind of solution should I follow?
thanks
def onMessageConsumed() = {
stream.foreachRDD(rdd => {
rdd.foreach(consumerRecord => {
val record = new Record[T](consumerRecord.topic(),
consumerRecord.value())
//notify observers with the record to compute
notifyObservers(record)
})
})
}
Yes, the classes that are used in the code that is sent to other executors (executed in foreach, etc.), should implement Serializable interface.
also, if you're notification code requires connection to some resource, you need to wrap foreach into foreachPartition, something like this:
stream.foreachRDD(rdd => {
rdd.foreachPartition(rddPartition =>
// setup connection to external component
rddPartition.foreach(consumerRecord => {
val record = new Record[T](consumerRecord.topic(),
consumerRecord.value())
notifyObservers(record)
})
// close connection to external component
})
})

Accessing Pipeline within DoFn

I'm writing a pipeline to replicate data from one source to another. Info about data sources is stored in db (BQ). How I can use this data it to build read/write endpoints dynamically?
I tried to pass Pipeline object to my custom DoFn but it can't be serialized. Later I tried to call method getPipeline() on a passed view but it doesn't work as well. -- which is actually expected
I can't know all tables I need to serialize in advance so I have to read all data from db (or any other source).
// builds some random view
PCollectionView<IdWrapper> idView = ...;
// reads tables meta and replicates data per each table
pipeline.apply(getTableMetaEndpont().read())
.apply(ParDo.of(new MyCustomReplicator(idView)).withSideInputs(idView))
private static class MyCustomReplicator extends DoFn<TableMeta, TableMeta> {
private final PCollectionView<IdWrapper> idView;
private DataReplicator(PCollectionView<IdWrapper> idView) {
this.idView = idView;
}
// TableMeta {string: sourceTable, string: destTable}
#ProcessElement
public void processElement(#Element TableMeta tableMeta, ProcessContext ctx) {
long id = ctx.sideInput(idView).getValue();
// builds read endpoint which depends on table meta
// updates entities
// stores entities using another endpoint
idView
.getPipeline()
.apply(createReadEndpoint(tableMeta).read())
.apply(ParDo.of(new SomeFunction(tableMeta, id)))
.apply(createWriteEndpoint(tableMeta).insert());
ctx.output(tableMetadata);
}
}
I expect it to replicate data specified by TableMeta but I can't use pipeline within DoFn object because it can't be serialized/deserialized.
Is there any way to implement the intended behavior?

Replacing RocksDB with In-memory state store in Kafka Streams

I'm using Kafka Streams 0.10.1.1 release.
the RocksDB implementation for state store can't handle our 50k/msg rate so I want to change the state store to be the in-memory one. This should be possible according to the docs: http://docs.confluent.io/3.1.0/streams/architecture.html#state
However, when I implement this:
val stateStore = Stores.create(stateStoreName).withStringKeys().withStringKeys().inMemory().build()
val procSuppl: KStreamAggregate = ... // I'll spare the implementation details
streamBuilder.addSource(
"mysource",
new StringDeserializer(),
new StringDeserializer(),
"input_topic"
).addProcessor("proc", procSuppl, "mysource").addStateStore(stateStore, "proc")
I end up with this error in runtime:
Caused by: java.lang.ClassCastException: org.apache.kafka.streams.state.internals.MeteredKeyValueStore cannot be cast to org.apache.kafka.streams.state.internals.CachedStateStore
2017-01-23T13:19:11.830674020Z at org.apache.kafka.streams.kstream.internals.KStreamAggregate$KStreamAggregateProcessor.init(KStreamAggregate.java:62)
The implementation of the above method is:
public void init(ProcessorContext context) {
super.init(context);
store = (KeyValueStore<K, T>) context.getStateStore(storeName);
((CachedStateStore) store).setFlushListener(new ForwardingCacheFlushListener<K, V>(context, sendOldValues));
}
Why is it trying to cast the state store to a CachedStateStore instance? How can I implement a simple in-memory state store which should be possible according to docs?
Thanks
In order to create an in-memory state store, one needs to create a store supplier (using Stores factory object):
val storeSupplier = Stores.inMemoryKeyValueStore("in-mem")
Then you need to use the store supplier when materializing a KTable:
val wordCounts = builder
.stream[String, String]("streams-plaintext-input")
.flatMapValues(textLine => textLine.toLowerCase.split("\\W+"))
.groupBy((_, word) => word)
.count()(Materialized.as(storeSupplier))
Obtain the queryable store:
val qStore = streams.store(
wordCounts.queryableStoreName,
QueryableStoreTypes.keyValueStore[String, Long])

Use GuidRepresentation.Standard with MongoDB

I am implementing a custom IBsonSerializer with the official MongoDB driver (C#). I am in the situation where I must serialize and deserialize a Guid.
If I implement the Serialize method as follow, it works:
public void Serialize(BsonWriter bsonWriter, Type nominalType, object value, IBsonSerializationOptions options)
{
BsonBinaryData data = new BsonBinaryData(value, GuidRepresentation.CSharpLegacy);
bsonWriter.WriteBinaryData(data);
}
However I don't want the Guid representation to be CSharpLegacy, I want to use the standard representation. But if I change the Guid representation in that code, I get the following error:
MongoDB.Bson.BsonSerializationException: The GuidRepresentation for the writer is CSharpLegacy, which requires the subType argument to be UuidLegacy, not UuidStandard.
How do I serialize a Guid value using the standard representation?
Old question but in case someone finds it on google like I did...
Do this once:
BsonDefaults.GuidRepresentation = GuidRepresentation.Standard;
For example, in a Web Application/Web API, your Global.asax.cs file is best place to add it once
public class WebApiApplication : System.Web.HttpApplication
{
protected void Application_Start()
{
BsonDefaults.GuidRepresentation = GuidRepresentation.Standard;
//Other code...below
}
}
If you don't want to modify the global setting BsonDefaults.GuidRepresentation (and you shouldn't, because modifying globals is a bad pattern), you can specify the setting when you create your collection:
IMongoDatabase db = ???;
string collectionName = ???;
var collectionSettings = new MongoCollectionSettings {
GuidRepresentation = GuidRepresentation.Standard
};
var collection = db.GetCollection<BsonDocument>(collectionName, collectionSettings);
Then any GUIDs written to the collection will be in the standard format.
Note that when you read records from the database, you will get a System.FormatException if the GUID format in the database is different from the format in your collection settings.
It looks like what's happening is when you are not explicitly passing the GuidRepresentation to BsonBinaryData constructor, it defaults to passing GuidRepresentation.Unspecified and that ultimately maps to GuidRepresentation.Legacy (see this line in the source)
So you need to explicitly pass the guidRepresentation as a third argument to BsonBinaryData set to GuidRepresentation.Standard.
edit: As was later pointed out, you can set BsonDefaults.GuidRepresentation = GuidRepresentation.Standard if that's what you always want to use.