Generating class in scala from avro schema - scala

trying to generate classes using avrohugger(https://github.com/julianpeeters/avrohugger#description)
Here is my schema:
{
"name": "test1",
"namespace": "test.testaero",
"type": "map",
"values": [
{
"type": "map",
"values": [
"boolean",
{
"type": "map",
"values": [
"null",
"string",
"boolean",
{
"type": "map",
"values": [
"null",
"string",
"boolean",
"int",
{
"type": "map",
"values": [
"null",
"string",
"int"
],
"default": null
}
],
"default": null
}
],
"default": null
}
]
}
]
}
And code :
object AvroParser extends App{
val inputPath = "app/dto/roman/src/main/resources/tests.avsc"
val outPutPath = "src/main/scala"
val schemaFile = new File(inputPath)
private val scalaTypes: AvroScalaTypes = SpecificRecord.defaultTypes.copy(map = avrohugger.types.ScalaMap)
val generator = new Generator(Standard, avroScalaCustomTypes = Some(scalaTypes))
generator.fileToFile(schemaFile, outPutPath)
}
My types in the schema is a map and I failing in function :
def getSchemaOrProtocols(
infile: File,
format: SourceFormat,
classStore: ClassStore,
classLoader: ClassLoader,
parser: Parser = schemaParser): List[Either[Schema, Protocol]] = {
def unUnion(schema: Schema) = {
schema.getType match {
case UNION => schema.getTypes().asScala.toList
case RECORD => List(schema)
case ENUM => List(schema)
case FIXED => List(schema)
case _ => sys.error("""Neither a record, enum nor a union of either.
|Nothing to map to a definition.""".trim.stripMargin)
}
}
where the type map not matching any of the types from below. How can I adopt schema or maybe i am not passing right arguments?

Related

Avro Schema: Build Avro Schema from Schema Fields

I am trying to write a function to calculate a diff between two avro schemas and generate another schema.
schema_one = {
"type": "record",
"name": "schema_one",
"namespace": "test",
"fields": [
{
"name": "type",
"type": "string"
},
{
"name": "id",
"type": "string"
}
]
}
schema_two = {
"type": "record",
"name": "schema_two",
"namespace": "test",
"fields": [
{
"name": "type",
"type": "string"
}
]
}
To get elements field in schema_one not in schema_two
import org.apache.avro.Schema._
import org.apache.avro.{Schema, SchemaBuilder}
val diff: Set[Schema.Field] = schema_one.getFields.asScala.toSet.filterNot(schema_two.getFields.asScala.toSet)
So far, so good.
I want to build a new schema from diff and I expect it to be:
schema_three = {
"type": "record",
"name": "schema_three",
"namespace": "test",
"fields": [
{
"name": "id",
"type": "string"
}
]
}
I cant seem to find any method within Avro SchemaBuilder to achieve this without having to explicitly provide named fields. i.e build Schema given Schema.Fields
For example:
SchemaBuilder.record("schema_three").namespace("test").fromFields(diff)
Is there a way to achieve this? Appreciate comments.
I was able to achieve this using the kite sdk "org.kitesdk" % "kite-data-core" % "1.1.0"
val schema_namespace = schema_one.getNamespace
val schema_name = schema_one.getName
val schemas = diff.map( f => {
SchemaBuilder
.record(schema_name)
.namespace(schema_namespace)
.fields()
.name(f.name())
.`type`(f.schema())
.noDefault()
.endRecord()
}
)
val schema_three = SchemaUtil.merge(schemas.asJava)

How to set array of records Using GenericRecordBuilder

I'm trying to turn a Scala object (i.e case class) into byte array.
In order to do so, I'm inserting the object content into a GenericRecordBuilder using its specific schema, and eventually using GenericDatumWriter i turn it into a byte array.
I have no problem to set primitive types, and array of primitive types into the GenericRecordBuilder.
But, I need help with Inserting array of records into the GenericRecordBuilder, and create a byte array from it.
What is the right way to insert array of records into the GenericRecordBuilder?
Here is part of what i'm trying to do:
This is the Schema:
{
"type": "record",
"name": "test1",
"namespace": "ns",
"fields": [
{
"name": "t_name",
"type": "string",
"default": "a"
},
{
"name": "t_num",
"type": "int",
"default": 0
},
{"name" : "t_arr", "type":
["null",
{"type": "array", "items": {
"name": "t_arr_a",
"type": "record",
"fields": [
{
"name": "t_arr_f1",
"type": "int",
"default": 0
},
{
"name": "t_arr_f2",
"type": "int",
"default": 0
}
]
}
}
]
}
]
}
This is the Scala class that populate the GenericRecordBuilder and transform it into byte Array:
package utils
import java.io.ByteArrayOutputStream
import org.apache.avro.{Schema, generic}
import org.apache.avro.generic.{GenericData, GenericDatumWriter}
import org.apache.avro.io.EncoderFactory
import org.apache.avro.generic.GenericRecordBuilder
object CheckRecBuilder extends App {
val avroSchema: Schema = new Schema.Parser().parse(this.getClass.getResourceAsStream("/data/myschema.avsc"))
val recordBuilder = new GenericRecordBuilder(avroSchema)
recordBuilder.set("t_name", "X")
recordBuilder.set("t_num", 100)
recordBuilder.set("t_arr", ???)
val record = recordBuilder.build()
val w = new GenericDatumWriter[GenericData.Record](avroSchema)
val outputStream = new ByteArrayOutputStream()
val e = EncoderFactory.get.binaryEncoder(outputStream, null)
w.write(record, e)
val barr = outputStream.toByteArray
println("End")
}
I manged to set the array of objects.
I wonder if there is a better or righter way for doing it.
Here is what I did:
Created a case class:
case class t_arr_a(t_arr_f1:Int, t_arr_f2:Int)
Created a method that transform case class into a GenericData.Record:
def caseClassToGenericDataRecord(cc:Product, schema:Schema): GenericData.Record = {
val childRecord = new GenericData.Record(schema.getElementType)
val values = cc.productIterator
cc.getClass.getDeclaredFields.map(f => childRecord.put(f.getName, values.next ))
childRecord
}
Updated the class CheckRecBuilder above:
replaced:
recordBuilder.set("t_arr", ???)
With:
val childSchema = new GenericData.Record(avroSchema2).getSchema.getField("t_arr").schema().getTypes().get(1)
val tArray = Array(t_arr_a(2,4), t_arr_a(25,14))
val tArrayGRecords: util.List[GenericData.Record]
= Some(yy.map(x => caseClassToGenericDataRecord(x,childSchema))).map(arr => java.util.Arrays.asList(arr: _*)).orNull
recordBuilder.set("t_arr", tArrayGRecords)

Troubles with AVRO schema update

I have a simple case class:
case class User(id: String, login: String, key: String)
i am add field "name"
case class User(id: String, login: String, name: String, key: String)
then add this field in avro schema (user.avsc)
{
"namespace": "test",
"type": "record",
"name": "User",
"fields": [
{ "name": "id", "type": "string" },
{ "name": "login", "type": "string" },
{ "name": "name", "type": "string" },
{ "name": "key", "type": "string" }
]
}
this class is used other case class:
case class AuthRequest(user: User, session: String)
chema (auth_request.avsc)
{
"namespace": "test",
"type": "record",
"name": "AuthRequest",
"fields": [
{ "name": "user", "type": "User" },
{ "name": "session", "type": "string" }
]
}
after that change my consumer start throws exceptons
Consumer.committableSource(consumerSettings, Subscriptions.topics("token_service_auth_request"))
.map { msg =>
Try {
val in: ByteArrayInputStream = new ByteArrayInputStream(msg.record.value())
val input: AvroBinaryInputStream[AuthRequest] = AvroInputStream.binary[AuthRequest](in)
val result: AuthRequest = input.iterator.toSeq.head !!!! here is exception
msg.committableOffset.commitScaladsl()
(msg.record.value(), result, msg.record.key())
} match {
case Success((a: Array[Byte], value: AuthRequest, key: String)) =>
log.info(s"listener got $msg -> $a -> $value")
context.parent ! value
case Failure(e) => e.printStackTrace()
}
}
.runWith(Sink.ignore)
java.util.NoSuchElementException: head of empty stream at
scala.collection.immutable.Stream$Empty$.head(Stream.scala:1104) at
scala.collection.immutable.Stream$Empty$.head(Stream.scala:1102) at
test.consumers.AuthRequestListener.$anonfun$new$2(AuthRequestListener.scala:39)
at scala.util.Try$.apply(Try.scala:209) at
test.consumers.AuthRequestListener.$anonfun$new$1(AuthRequestListener.scala:36)
at
test.consumers.AuthRequestListener.$anonfun$new$1$adapted(AuthRequestListener.scala:35)
at akka.stream.impl.fusing.Map$$anon$9.onPush(Ops.scala:51) at
akka.stream.impl.fusing.GraphInterpreter.processPush(GraphInterpreter.scala:519)
at
akka.stream.impl.fusing.GraphInterpreter.processEvent(GraphInterpreter.scala:482)
at
akka.stream.impl.fusing.GraphInterpreter.execute(GraphInterpreter.scala:378)
at
akka.stream.impl.fusing.GraphInterpreterShell.runBatch(ActorGraphInterpreter.scala:588)
at
akka.stream.impl.fusing.GraphInterpreterShell$AsyncInput.execute(ActorGraphInterpreter.scala:472)
at
akka.stream.impl.fusing.GraphInterpreterShell.processEvent(ActorGraphInterpreter.scala:563)
at
akka.stream.impl.fusing.ActorGraphInterpreter.akka$stream$impl$fusing$ActorGraphInterpreter$$processEvent(ActorGraphInterpreter.scala:745)
at
akka.stream.impl.fusing.ActorGraphInterpreter$$anonfun$receive$1.applyOrElse(ActorGraphInterpreter.scala:760)
at akka.actor.Actor.aroundReceive(Actor.scala:517) at
akka.actor.Actor.aroundReceive$(Actor.scala:515) at
akka.stream.impl.fusing.ActorGraphInterpreter.aroundReceive(ActorGraphInterpreter.scala:670)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:588) at
akka.actor.ActorCell.invoke(ActorCell.scala:557) at
akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) at
akka.dispatch.Mailbox.run(Mailbox.scala:225) at
akka.dispatch.Mailbox.exec(Mailbox.scala:235) at
akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
I tried to clean builds and invalidate cache - i seems like some kind of caching previous version of schema in some there
Help please!
You need to make your change backward compatible making the new field nullable and adding a default value to it.
{
"namespace": "test",
"type": "record",
"name": "User",
"fields": [
{ "name": "id", "type": "string" },
{ "name": "login", "type": "string" },
{ "name": "name", "type": ["null", "string"], "default": null },
{ "name": "key", "type": "string" }
]
}

Scala Map -add new key and copy value from another key

Considering 2 sets of data as follows:
JSON1=> {
"data": [
{"id": "1-abc",
"model": "Agile",
"status":"open"
"configuration": {
"state": "running",
"rootVolumeSize": "0.00000",
"count": "2",
"type": "large",
"platform": "Linux"
}
"stateId":"123-567"
}
]}
JSON2=>{
"data": [
{"id": "1-abc",
"model": "Agile",
"configuration": {
"state": "running",
"diskSize": "0",
"type": "small",
"platform":"Windows"
}
}
]}
I need to compare JSON1 and JSON2 based on the 1st field id and if they match , I need to merge JSON1 with JSON 2 retaining the existing values in JSON2( only append fields not present).
I have coded the same as below:
private def merger(JSON1: Seq[JSON], JSON2: Seq[JSON]):Seq[JSON] = {
val abcKey = JSON1.groupBy(_.id) map { case (k, v) => (k, v.head)
val mergedRecords = for {
xyzJSON<- JSON2
} yield (
abcKey.get(xyzJSON.id) match {
case Some(JSON1) => xyzJSON.copy(status = JSON1.status,
stateId = JSON1.stateId)
case None => xyzJSON.copy(origin = "N/A")
}
)
I am not able to derive at a solution for reconciling the fields within the configurationMap.
Expected result set should be like:
{
"data": [
{"id": "1-abc",
"model": "Agile",
"status":"open"
"configuration": {
"state": "running",
"diskSize": "0",
"rootVolumeSize": "0.00000",
"count": "2",
"type": "small",
"platform": "Windows",
}
"stateId":"123-567"
}
]}

elastic4s: how to add analyzer/filter for german_phonebook to analysis?

How do I add the following german_phonebook analyzer to elastic search using elastic4s?
"index": {
"analysis": {
"analyzer": {
"german": {
"filter": [
"lowercase",
"german_stop",
"german_normalization",
"german_stemmer"
],
"tokenizer": "standard"
},
"german_phonebook": {
"filter": [
"german_phonebook"
],
"tokenizer": "keyword"
},
"mySynonyms": {
"filter": [
"lowercase",
"mySynonymFilter"
],
"tokenizer": "standard"
}
},
"filter": {
"german_phonebook": {
"country": "CH",
"language": "de",
"type": "icu_collation",
"variant": "#collation=phonebook"
},
"german_stemmer": {
"language": "light_german",
"type": "stemmer"
},
"german_stop": {
"stopwords": "_german",
"type": "stop"
},
"mySynonymFilter": {
"synonyms": [
"swisslift,lift"
],
"type": "synonym"
}
}
},
The core question here is which filter to use for the german_phonebook filter of type icu_collation?
...
Following the answer I came up with this code:
case class GPhonebook() extends TokenFilterDefinition {
val filterType = "phonebook"
def name = "german_phonebook"
override def build(source: XContentBuilder): Unit = {
source.field("tokenizer", "keyword")
source.field("country", "CH")
source.field("language", "de")
source.field("type", "icu_collation")
source.field("variant", "#collation=phonebook")
}
}
The analyzer definition looks like this now:
CustomAnalyzerDefinition(
"german_phonebook",
KeywordTokenizer("myKeywordTokenizer2"),
GPhonebook()
)
What you really want is someway to say
CustomTokenFilter("german_phonebook) or BuiltInTokenFilter("german_phonebook") but you can't (I'll add that).
So for now, you need to extend TokenFilterDefinition.
Eg, Something like
case class GPhonebook extends TokenFilterDefinition {
val filterType = "phonebook"
override def build(source: XContentBuilder): Unit = {
// set extra params in here
}
}