Avro Generic Record not taking aliases into account - scala

I have some JsonData (fastxml.jackson objects) and I want to convert this into a GenericAvro Record. As I don't know by forehand what data I will be getting, only that there is an Avro schema available in a schema repository. I can't have predefined classes. So hence the generic record.
When I pretty print my schema, I can see my keys/values and their aliases. However the Generic record "put" method does not seem to know these aliases.
I get the following exception Exception in thread "main" org.apache.avro.AvroRuntimeException: Not a valid schema field: device/id
Is this by design? How can I make this schema look at the aliases as well?
schema extract:
"fields" : [ {
"name" : "device_id",
"type" : "long",
"doc" : " The id of the device.",
"aliases" : [ "deviceid", "device/id" ]
}, {
............
}]
code:
def jsonToAvro(jSONObject: JsonNode, schema: Schema): GenericRecord = {
val converter = new JsonAvroConverter
println(jSONObject.toString) // correct
println(schema.toString(true)) // correct
println(schema.getField("device_id")) //correct
println(schema.getField("device_id").aliases()toString) //correct
val avroRecord = new GenericData.Record(schema)
val iter = jSONObject.fields()
while (iter.hasNext) {
import java.util
val e = jSONObject.fields()
val entry = iter.next.asInstanceOf[util.Map.Entry[String, Nothing]]
println(s"adding ${entry.getKey.toString} and ${entry.getValue} with ${entry.getValue.getClass.getName}") // adding device/id and 8711 with com.fasterxml.jackson.databind.node.IntNode
avroRecord.put(entry.getKey.toString, entry.getValue) // throws
}
avroRecord
}

I tried on Avro 1.8.2, it still throws this exception when I read a json string into a GenericRecord:
org.apache.avro.AvroTypeException: Expected field name not found:
But I saw some sample used alias correctly two years ago:
https://www.waitingforcode.com/apache-avro/serialization-and-deserialization-with-schemas-in-apache-avro/read
So I guess Avro changed that behaviour recently

It seems like the schema is only this flexible when reading.
Writing AVRO only looks at the current field name.
Not only that, but I'm using "/" in my field names (json), this is not supported as a field name.
Schema validation does not complain when it's in the alias, so that might work (haven't tested this)

Related

How to apply a function to every string in a dataframe

{
"cars": {
"Nissan": {
"Sentra": {"doors":4, "transmission":"automatic"},
"Maxima": {"doors":4, "transmission":"automatic","colors":["b#lack","pin###k"]}
},
"Ford": {
"Taurus": {"doors":4, "transmission":"automatic"},
"Escort": {"doors":4, "transmission":"auto#matic"}
}
}
}
I have this JSON that I have read, and I want to remove every # symbol in every string that may exist. My problem is doing this function generic, so it could work on every schema that I may encounter and not only this schema as used in JSON above.
You could do something like this: Get all the fields from the schema, use fold with the DataFrame itself as an accumulator and, apply the function that you want
def replaceSymbol(df: DataFrame): DataFrame =
df.schema.fieldNames.foldLeft(df)((df, field) => df.withColumn(field, regexp_replace(col(field), "#", "")))
You might need to check if the column is String or not.

how retrieve and add new document to MongoDb collection throw com.mongodb.reactivestreams.client.MongoClient

Context: I coded a Kafka Consumer which receives a simple message and I want to insert it to MongoDb using com.mongodb.reactivestreams.client.MongoClient. Althought I understand my issue is all about how use properly MongoClient let me inform my stack: my stack is Micronaut + MongoDb reactive + Kotlin.
Disclaimer: if someone provide answer in java I may be able to translate it to Kotlin. You can ignore the Kafka part bellow since it is working as expected.
Here is my code
package com.mybank.consumer
import com.mongodb.reactivestreams.client.MongoClient
import com.mongodb.reactivestreams.client.MongoCollection
import com.mongodb.reactivestreams.client.MongoDatabase
import io.micronaut.configuration.kafka.annotation.KafkaKey
import io.micronaut.configuration.kafka.annotation.KafkaListener
import io.micronaut.configuration.kafka.annotation.OffsetReset
import io.micronaut.configuration.kafka.annotation.Topic
import org.bson.Document
import org.reactivestreams.Publisher
import javax.inject.Inject
#KafkaListener(offsetReset = OffsetReset.EARLIEST)
class DebitConsumer {
#Inject
//#Named("another")
var mongoClient: MongoClient? = null
#Topic("debit")
fun receive(#KafkaKey key: String, name: String) {
println("Account - $name by $key")
var mongoDb : MongoDatabase? = mongoClient?.getDatabase("account")
var mongoCollection: MongoCollection<Document>? = mongoDb?.getCollection("account_collection")
var mongoDocument: Publisher<Document>? = mongoCollection?.find()?.first()
print(mongoDocument.toString())
//println(mongoClient?.getDatabase("account")?.getCollection("account_collection")?.find()?.first())
//val mongoClientClient: MongoDatabase = mongoClient.getDatabase("account")
//println(mongoClient.getDatabase("account").getCollection("account_collection").find({ "size.h": { $lt: 15 } })
//println(mongoClient.getDatabase("account").getCollection("account_collection").find("1").toString())
}
}
Well, the code above was the closest I got. It is not prompting any error. It is printing
com.mongodb.reactivestreams.client.internal.Publishers$$Lambda$618/0x0000000800525840#437ec11
I guess this prove the code is connecting properly to database but I was expecting to print the first document.
There are three documents:
My final goal is to insert the message I have received from Kafka Listener to MongoDb. Any clue will be appreciated.
The whole code can be found in git hub
*** edited after Susan's question
Here is what is printed with
var mongoDocument = mongoCollection?.find()?.first()
print(mongoDocument.toString())
Looks like you are using reactive streams for mongodb. Is there a reason you are using reactive streams?
The result you are getting is of type "Publisher". You will need to use the method subscribe(), in order to get the document.
See the documentation on Publisher.
http://www.howsoftworks.net/reacstre/1.0.2/Publisher
If you dont want to use reactive: Great example on how/what to use for mongodb in Kotlin.
https://kb.objectrocket.com/mongo-db/retrieve-mongodb-document-using-kotlin-1180
--- Similar StackOverlow using MongoDB, Reactive Streams, Publisher.
how save document to MongoDb with com.mongodb.reactivestreams.client
=============== Edited ==============
Publisher<Document> publisher = collection.find().first();
subscriber = new PrintDocumentSubscriber();
publisher.subscribe(subscriber); //publisher.subscribe(subscriber)
subscriber.await();
The example will print the following document:
{ "_id" : { "$oid" : "551582c558c7b4fbacf16735" },
"name" : "MongoDB", "type" : "database", "count" : 1,
}
If you want nonblocking, do it this way:
publisher.subscribe(new PrintDocumentSubscriber()); //without await
http://mongodb.github.io/mongo-java-driver-reactivestreams/1.4/javadoc/tour/SubscriberHelpers.PrintDocumentSubscriber.html
http://mongodb.github.io/mongo-java-driver-reactivestreams/1.6/getting-started/quick-tour/

spark parse json field and match to different case class

I have some json like below, when I loaded this json some fields is string of json,
How to parse this json using spark scala and look for the key words I am looking for in that json
{"main":"{\"payload\": { \"mode\": [\"Node\"], \"currentSatate\": \"Ready\", \"Previousstate\": \"slow\", \"trigger\": [\"11\", \"12\"], \"AllStates\": [\"Ready\", \"slow\", \"fast\", \"new\"],\"UnusedStates\": [\"slow\", \"new\"],\"Percentage\": \"70\",\"trigger\": [\"11\"]}"}
{"main":"{\"payload\": {\"trigger\": [\"11\", \"22\"],\"mode\": [\"None\"],\"cangeState\": \"Open\"}}"}
{"main":"{\"payload\": { \"trigger\": [\"23\", \"45\"], \"mode\": [\"Edge\"], \"node.postions\": [\"12\", \"23\", \"45\", \"67\"], \"node.names\": [\"aa\", \"bb\", \"cc\", \"dd\"]}}" }
This is how its looking after loading in to data frame
val df = spark.read.json("<pathtojson")
df.show(false)
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|main |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"payload": { "mode": ["Node"], "currentSatate": "Ready", "Previousstate": "slow", "trigger": ["11", "12"], "AllStates": ["Ready", "slow", "fast", "new"],"UnusedStates": ["slow", "new"],"Percentage": "70","trigger": ["11"]}|
|{"payload": {"trigger": ["11", "22"],"mode": ["None"],"cangeState": "Open"}} |
|{"payload": { "trigger": ["23", "45"], "mode": ["Edge"], "node.postions": ["12", "23", "45", "67"], "node.names": ["aa", "bb", "cc", "dd"]}} |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Since my json filed is different for all the 3 json strings , is there a way to match define 3 case class and match
I know only matching to one class
val mapper = new ObjectMapper() with ScalaObjectMapper
mapper.registerModule(DefaultScalaModule)
val parsedJson = mapper.readValue[classname](jsonstring)
is there a way to create a multiple matching case class and match to any particular class ?
You are using Spark SQL, the first thing you have to do is to turn it into a dataset, and then use the spark's methods to deal with them. Don't use Json, all over the place (e.g., like in Play). The first task is to turn it into a dataset.
You could turn the serialize a Json into a case class:
val jsonFilePath: String = "/whatever/data.json"
val myDataSet = sparkSession.read.json(jsonFilePath).as[StudentRecord]
Then here you have the dataset for StudentRecord. So, you can now use the spark's groupBy method to get the data of the column you want from the dataset:
myDataSet.groupBy("whateverTable.whateverColumn").max() //could be min(), count(), etc...
Extra Note: Your Json, should "cleaned up" a little. For example, if it is within your program you can use the multi line way of declaring your Json, and then you don't need to use escape character all over the place:
val myJson: String =
"""
{
}
""".stripMargin
If it is in the file, then the Json you wrote is not correct. So first, make sure you have a syntactically correct Json to work on.

Parsing Json in Spark and populate a column in dataframe dynamically based on nodes value

I am using spark 1.6.3 to parse a json strucuture
I have a json structure below :
{
"events":[
{
"_update_date":1500301647576,
"eventKey":"depth2Name",
"depth2Name":"XYZ"
},
{
"_update_date":1500301647577,
"eventKey":"journey_start",
"journey_start":"2017-07-17T14:27:27.144Z"
}]
}
i want parse the above JSON to 3 columns in dataframe. eventKey's value(deapth2Name) is a node in Json(deapth2Name) and i want to read the value from corresponding node add it to a column "eventValue" so that i can accommodate any new events dynamically.
Here is the expected output:
_update_date,eventKey,eventValue
1500301647576,depth2Name,XYZ
1500301647577,journey_start,2017-07-17T14:27:27.144Z
sample code:
val x = sc.wholeTextFiles("/user/jx665240/events.json").map(x => x._2)
val namesJson = sqlContext.read.json(x)
namesJson.printSchema()
namesJson.registerTempTable("namesJson")
val eventJson=namesJson.select("events")
val mentions1 =eventJson.select(explode($"events")).toDF("events").select($"events._update_date",$"events.eventKey",$"events.$"events.eventKey"")
$"events.$"events.eventKey"" is not working.
Can you please suggest how to fix this issue.
Thanks,
Sree

Issue with mondogb-morphia in grails application to store Map correctly in database

I'm using the plugin mongodb-morphia (0.7.8) in a grails (2.0.3) application and I experiment an issue with the type Map.
I want to store in my database a map of type Map but when I put that in my groovy file :
class ServiceInfo {
String name
Map<String,?> args
Date dateCreated // autoset by plugin
Date lastUpdated // autoset by plugin
static constraints = {
name nullable:false
}
}
I obtain the following error :
2012-04-29 14:39:43,876 [pool-2-thread-3] ERROR MongodbMorphiaGrailsPlugin - Error processing mongodb domain Artefact > fr.unice.i3s.modalis.yourcast.provider.groovy.ServiceInfo: Unknown type... pretty bad... call for help, wave your hands... yeah!
I tried just to specify Map in my file:
Map args
In that case I obtain the following simple warning:
INFO: MapKeyDifferentFromString complained about fr.unice.i3s.modalis.yourcast.provider.groovy.ServiceInfo.args : Maps cannot be keyed by Object (Map); Use a parametrized type that is supported (Map)
and when I try to save an object, the attribute args is simply omitted in the database.
For information my objects have this kind of representation:
def icalReader= new ServiceInfo(name:"IcalReader", args:['uri':DEFAULT_URL, 'path':"fr.unice.i3s.modalis.yourcast.sources.calendar.icalreader/"])
icalReader.save()
Finally, if I just say that args is a List:
List args
And I change my objects to take a List with only one Map, I just have a warning:
ATTENTION: The multi-valued field 'fr.unice.i3s.modalis.yourcast.provider.groovy.ServiceInfo.args' is a possible heterogenous collection. It cannot be verified. Please declare a valid type to get rid of this warning. null
but everything works fine and I've my map correctly stored in the database:
{ "_id" : ObjectId("4f9be39f0364bf4002cd48ad"), "name" : "IcalReader", "args" : [ { "path" : "fr.unice.i3s.modalis.yourcast.sources.calendar.icalreader/", "uri" : "http://localhost:8080/" } ], "dateCreated" : ISODate("2012-04-28T12:33:35.838Z"), "lastUpdated" : ISODate("2012-04-28T12:33:35.838Z") }
So is there something I forget in defining my map?
My service does work but I don't like hacks like "encapsulate a map in a list to serialize it" ;)
I don't know about Map, but could you use embedded datamodel instead?
class ServiceInfo {
...
#Embedded
MyArgs args
...
}
class MyArgs {
String key
String value
}