AvroTypeException: Not an enum: MOBILE on DataFileWriter - scala

I am getting the following error message when I tried to write avro records using build-in AvroKeyValueSinkWriter in Flink 1.3.2 and avro 1.8.2:
My schema looks like this:
{"namespace": "com.base.avro",
"type": "record",
"name": "Customer",
"doc": "v6",
"fields": [
{"name": "CustomerID", "type": "string"},
{"name": "platformAgent", "type": {
"type": "enum",
"name": "PlatformAgent",
"symbols": ["WEB", "MOBILE", "UNKNOWN"]
}, "default":"UNKNOWN"}
]
}
And I am calling the following Flink code to write data:
var properties = new util.HashMap[String, String]()
val stringSchema = Schema.create(Type.STRING)
val myTypeSchema = Customer.getClassSchema
val keySchema = stringSchema.toString
val valueSchema = myTypeSchema.toString
val compress = true
properties.put(AvroKeyValueSinkWriter.CONF_OUTPUT_KEY_SCHEMA, keySchema)
properties.put(AvroKeyValueSinkWriter.CONF_OUTPUT_VALUE_SCHEMA, valueSchema)
properties.put(AvroKeyValueSinkWriter.CONF_COMPRESS, compress.toString)
properties.put(AvroKeyValueSinkWriter.CONF_COMPRESS_CODEC, DataFileConstants.SNAPPY_CODEC)
val sink = new BucketingSink[org.apache.flink.api.java.tuple.Tuple2[String, Customer]]("s3://test/flink")
sink.setBucketer(new DateTimeBucketer("yyyy-MM-dd/HH/mm/"))
sink.setInactiveBucketThreshold(120000) // this is 2 minutes
sink.setBatchSize(1024 * 1024 * 64) // this is 64 MB,
sink.setPendingSuffix(".avro")
val writer = new AvroKeyValueSinkWriter[String, Customer](properties)
sink.setWriter(writer.duplicate())
However, it throws the following errors:
Caused by: org.apache.avro.AvroTypeException: Not an enum: MOBILE
at org.apache.avro.generic.GenericDatumWriter.writeEnum(GenericDatumWriter.java:177)
at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:119)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62)
at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:302)
... 10 more
Please suggest!
UPDATE 1:
I found this is kind of bug in avro 1.8+ based on this ticket: https://issues-test.apache.org/jira/browse/AVRO-1810

It turns out this is an issue with Avro 1.8+, I have to override the version flink uses dependencyOverrides += "org.apache.avro" % "avro" % "1.7.3", the bug can be found here https://issues-test.apache.org/jira/browse/AVRO-1810

Related

How To parse Json in Flink Kafka + Scala

Hi I am new developer to Apache Flink. I am trying to build a Data-streaming Application in Apache Flink integrated with Kafka.
I already an Have a Topic named source-json created in Kafka containing a json which i am required to parse using Apache Flink(Scala).
I have already written the code to extract the Json in Apacheflink facing issue in parsing the json in apache Flink using the deseralizer for spark i apparently i could not find much support for apache flink spark in docs.
json file i am Trying to parse
{
"timestamp": "2020-01-09",
"integrations": {},
"context": {
"page": {
"path": "/",
"title": "shop",
"url": "www.example.com"
},
"library": {
"version": "3.10.1"
}
},
"properties": {
"url": "https://www.example.com/",
"page_type": "Home",
},
"Time": "2020-01-09",
"_sorting": {
"SortAs": ["SGML"],
"Acronym": ["SGML"]
}
}
I was able to write the code for print the json from kafka using flink.
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val source = KafkaSource.builder[String].
setBootstrapServers("localhost:9092")
.setTopics("source-json")
.setGroupId("my-group")
.setStartingOffsets(OffsetsInitializer.earliest)
.setValueOnlyDeserializer(new SimpleStringSchema())
.build
val streamEnv : DataStream[String] = env.fromSource(source, WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(20)), "Kafka Source")
streamEnv.print()
env.execute("running stream")
}
Need help in parsing the json storing the filtered json in a different topic in kafka named sink-json in apache flink.

How to get Quantile/median values in pydruid

My goal is to query the median value of column height in my druid datasource. I was able to use other aggregations like count and count distinct values. Here's my query so far:
group = query.groupby(
datasource=datasource,
granularity='all',
intervals='2020-01-01T00:00:00+00:00/2101-01-01T00:00:00+00:00',
dimensions=[
"category_a"
],
filter=(Dimension("country") == country_id),
aggregations={
'count': longsum('count'),
'count_distinct_city': aggregators.thetasketch('city'),
}
)
There's a class Quantile under postaggregator.py so I tried using this.
class Quantile(Postaggregator):
def __init__(self, name, probability):
Postaggregator.__init__(self, None, None, name)
self.post_aggregator = {
"type": "quantile",
"fieldName": name,
"probability": probability,
}
Here's my attempt at getting the median:
post_aggregations={
'median_value': postaggregator.Quantile(
'height', 50
)
}
The error I'm getting here is 'Could not resolve type id \'quantile\' as a subtype of [simple type, class io.druid.query.aggregation.PostAggregator]:
Druid Error: {'error': 'Unknown exception', 'errorMessage': 'Could not resolve type id \'quantile\' as a subtype of [simple type, class io.druid.query.aggregation.PostAggregator]: known type ids = [arithmetic, constant, doubleGreatest, doubleLeast, expression, fieldAccess, finalizingFieldAccess, hyperUniqueCardinality, javascript, longGreatest, longLeast, quantilesDoublesSketchToHistogram, quantilesDoublesSketchToQuantile, quantilesDoublesSketchToQuantiles, quantilesDoublesSketchToString, sketchEstimate, sketchSetOper, thetaSketchEstimate, thetaSketchSetOp] (for POJO property \'postAggregations\')\n at [Source: (org.eclipse.jetty.server.HttpInputOverHTTP); line: 1, column: 856] (through reference chain: io.druid.query.groupby.GroupByQuery["postAggregations"]->java.util.ArrayList[0])', 'errorClass': 'com.fasterxml.jackson.databind.exc.InvalidTypeIdException', 'host': None}
I modified the code of pydruid to get this working on our end. I've created new aggregator and postaggregator under /pydruid/utils.
aggregator.py
def quantilesDoublesSketch(raw_column, k=128):
return {"type": "quantilesDoublesSketch", "fieldName": raw_column, "k": k}
postaggregator.py
class QuantilesDoublesSketchToQuantile(Postaggregator):
def __init__(self, name: str, field_name: str, fraction: float):
self.post_aggregator = {
"type": "quantilesDoublesSketchToQuantile",
"name": name,
"fraction": fraction,
"field": {
"fieldName": field_name,
"name": field_name,
"type": "fieldAccess",
},
}
My first time to create a PR! Hopefully they accept and publish officially.
https://github.com/druid-io/pydruid/pull/287

'TypeError: StructType can not accept object

I'm trying to convert this json string data to Dataframe in Databricks
a = """{ "id": "a",
"message_type": "b",
"data": [ {"c":"abcd","timestamp":"2022-03-
01T13:10:00+00:00","e":0.18,"f":0.52} ]}"""
the schema I defined for the data is this
schema=StructType(
[
StructField("id",StringType(),False),
StructField("message_type",StringType(),False),
StructField("data", ArrayType(StructType([
StructField("c",StringType(),False),
StructField("timestamp",StringType(),False),
StructField("e",DoubleType(),False),
StructField("f",DoubleType(),False),
])))
,
]
)
and when I run this command
df = sqlContext.createDataFrame(sc.parallelize([a]), schema)
I get this error
PythonException: 'TypeError: StructType can not accept object '{ "id": "a",\n"message_type": "JobMetric",\n"data": [ {"c":"abcd","timestamp":"2022-03- \n01T13:10:00+00:00","e":0.18,"f":0.52=} ]' in type <class 'str'>'. Full traceback below:
anyone could help me with this, would much appreciate it!
Your a variable is wrong.
"data": [ "{"JobId":"ATLUPS10m2101V1","Timestamp":"2022-03-
01T13:10:00+00:00","number1":0.9098145961761475,"number2":0.5294908881187439}" ]
Should be
"data": [ {"JobId":"ATLUPS10m2101V1","Timestamp":"2022-03-
01T13:10:00+00:00","number1":0.9098145961761475,"number2":0.5294908881187439} ]
And check if it is OK to match name with JobId to job_id and Timestamp to timestamp.
Issue is whenever you're passing the string object to struct schema it expects RDD([StringType, StringType,...]) however, in your current scenario it is getting just string object. In order to fix it first you need to convert your string to a json object and from there you'll need to create a RDD. See the below logic for details -
Input Data -
a = """{"run_id": "1640c68e-5f02-4f49-943d-37a102f90146",
"message_type": "JobMetric",
"data": [ {"JobId":"ATLUPS10m2101V1","timestamp":"2022-03-01T13:10:00+00:00",
"score":0.9098145961761475,
"severity":0.5294908881187439
}
]
}"""
Converting to a RDD using json object -
from pyspark.sql.types import *
import json
schema=StructType(
[
StructField("run_id",StringType(),False),
StructField("message_type",StringType(),False),
StructField("data", ArrayType(StructType([
StructField("JobId",StringType(),False),
StructField("timestamp",StringType(),False),
StructField("score",DoubleType(),False),
StructField("severity",DoubleType(),False),
])))
,
]
)
df = spark.createDataFrame(data=sc.parallelize([json.loads(a)]),schema=schema)
df.show(truncate=False)
Output -
+------------------------------------+------------+--------------------------------------------------------------------------------------+
|run_id |message_type|data |
+------------------------------------+------------+--------------------------------------------------------------------------------------+
|1640c68e-5f02-4f49-943d-37a102f90146|JobMetric |[{ATLUPS10m2101V1, 2022-03-01T13:10:00+00:00, 0.9098145961761475, 0.5294908881187439}]|
+------------------------------------+------------+--------------------------------------------------------------------------------------+

Handling empty/invalid Mqtt Messages with Kafka Connect

I am trying to ingest data from Mqtt into Kafka. Unfortunately, some of those Mqtt-Messages are either empty or invalid JSON. I assume that is what leads to the following exception:
{
"name": "source_mqtt_alarms",
"connector": {
"state": "RUNNING",
"worker_id": "-redacted-:8083"
},
"tasks": [
{
"id": 0,
"state": "FAILED",
"worker_id": "-redacted-:8083",
"trace": "org.apache.kafka.connect.errors.ConnectException:
Tolerance exceeded in error handler\n\tat org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:196)\n\t
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execute(RetryWithToleranceOperator.java:122)\n\t
at org.apache.kafka.connect.runtime.WorkerSourceTask.convertTransformedRecord(WorkerSourceTask.java:314)\n\t
at org.apache.kafka.connect.runtime.WorkerSourceTask.sendRecords(WorkerSourceTask.java:340)\n\t
at org.apache.kafka.connect.runtime.WorkerSourceTask.execute(WorkerSourceTask.java:264)\n\t
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:185)\n\t
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:235)\n\t
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)\n\t
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)\n\t
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\t
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\t
at java.base/java.lang.Thread.run(Thread.java:834)\n
Caused by: org.apache.kafka.connect.errors.DataException: Conversion error: null value for field that is required and has no default value\n\t
at org.apache.kafka.connect.json.JsonConverter.convertToJson(JsonConverter.java:611)\n\t
at org.apache.kafka.connect.json.JsonConverter.convertToJsonWithEnvelope(JsonConverter.java:592)\n\t
at org.apache.kafka.connect.json.JsonConverter.fromConnectData(JsonConverter.java:346)\n\t
at org.apache.kafka.connect.storage.Converter.fromConnectData(Converter.java:63)\n\t
at org.apache.kafka.connect.runtime.WorkerSourceTask.lambda$convertTransformedRecord$2(WorkerSourceTask.java:314)\n\t
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:146)\n\t
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:180)\n\t
... 11 more\n"
}
],
"type": "source"
}
From what I've learned so far, it looks like the incoming (empty/invalid) messages do not contain values that are declared as non-optional, which leads to the exception above.
My question would be, where is the connector taking that expectation from? It says "null value for field that is required and has no default value", but how is that field required if the schema is (I assume) created per message?
Additional information:
I am using the Lenses.io Stream Reactor Mqtt Source Connector. The configuration is as follows:
{
"name": "source_mqtt_alarms",
"config": {
"topics": "alarms",
"connect.mqtt.kcql": "INSERT INTO alarms SELECT * FROM `-redacted-/+/alarms` WITHCONVERTER=`com.datamountaineer.streamreactor.connect.converters.source.JsonSimpleConverter`",
"connect.mqtt.client.id": "kafka_connect_alarms",
"tasks.max": 1,
"connector.class": "com.datamountaineer.streamreactor.connect.mqtt.source.MqttSourceConnector",
"connect.mqtt.service.quality": 2,
"connect.mqtt.hosts": "ssl://-redacted-:8883",
"connect.mqtt.ssl.ca.cert": "/usr/share/certs/cumu.crt",
"connect.mqtt.ssl.cert": "/usr/share/certs/mqtt.crt",
"connect.mqtt.ssl.key": "/usr/share/certs/mqtt.pem",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": true,
"key.converter":"org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": true,
}
}
Edit: I just went through the logs of the Kafka Connect worker and it's giving a bit more information. Prior to the exception above, I get a lost of these:
[2021-05-26 08:27:19,552] ERROR Error handling message with id:0 on topic:-redacted-/alarms (com.datamountaineer.streamreactor.connect.mqtt.source.MqttManager)
java.util.NoSuchElementException: head of empty list
at scala.collection.immutable.Nil$.head(List.scala:430)
at scala.collection.immutable.Nil$.head(List.scala:427)
at com.datamountaineer.streamreactor.connect.converters.source.JsonSimpleConverter$.convert(JsonSimpleConverter.scala:76)
at com.datamountaineer.streamreactor.connect.converters.source.JsonSimpleConverter$.convert(JsonSimpleConverter.scala:70)
at com.datamountaineer.streamreactor.connect.converters.source.JsonSimpleConverter.convert(JsonSimpleConverter.scala:37)
at com.datamountaineer.streamreactor.connect.mqtt.source.MqttManager.messageArrived(MqttManager.scala:110)
at org.eclipse.paho.client.mqttv3.internal.CommsCallback.deliverMessage(CommsCallback.java:514)
at org.eclipse.paho.client.mqttv3.internal.CommsCallback.handleMessage(CommsCallback.java:417)
at org.eclipse.paho.client.mqttv3.internal.CommsCallback.run(CommsCallback.java:214)
at java.base/java.lang.Thread.run(Thread.java:834)

How to properly use context variables in OrientDB ETL configuration file?

Summary
Trying to learn about OrientDB ETL configuration json file.
Assuming a CSV file where:
each row is a single vertex
a 'class' column gives the intended class of the vertex
there are multiple classes for the vertices (Foo, Bar, Baz)
How do I set the class of the vertex to be the value of the 'class' column?
Efforts to Troubleshoot
I have spent a LOT of time in the OrientDB ETL documentation trying to solve this. I have tried many different combinations of let and block and code components. I have tried variable names like className and $className and ${classname}.
Current Results:
The code component is able to correctly print the value of `className', so I know that it is being set correctly.
The vertex component isn't referencing the variable correctly, and consequently sets the class of each vertex to null.
Context
I have a freshly created database (PLOCAL GRAPH) on localhost called 'deleteme'.
I have an vertex CSV file (nodes.csv) that looks like this:
id,name,class
1,Jack,Foo
2,Jill,Bar
3,Gephri,Baz
And an ETL configuration file (test.json) that looks like this:
{
"config": {
"log": "DEBUG"
},
"source": {"file": {"path": "nodes.csv"}},
"extractor": {"csv": {}},
"transformers": [
{"block": {"let": {"name": "$className",
"value": "$input.class"}}},
{"code": {"language": "Javascript",
"code": "print(className + '\\n'); input;"}},
{"vertex": {"class": "$className"}}
],
"loader": {
"orientdb": {
"dbURL": "remote:localhost:2424/deleteme",
"dbUser": "admin",
"dbPassword": "admin",
"dbType": "graph",
"tx": false,
"wal": false,
"batchCommit": 1000,
"classes": [
{"name": "Foo", "extends": "V"},
{"name": "Bar", "extends": "V"},
{"name": "Baz", "extends": "V"}
]
}
}
}
And when I run the ETL job, I have output that looks like this:
aj#host:~/bin/orientdb-community-2.1.13/bin$ ./oetl.sh test.json
OrientDB etl v.2.1.13 (build 2.1.x#r9bc1a54a4a62c4de555fc5360357f446f8d2bc84; 2016-03-14 17:00:05+0000) www.orientdb.com
BEGIN ETL PROCESSOR
[file] INFO Reading from file nodes.csv with encoding UTF-8
[orientdb] DEBUG - OrientDBLoader: created vertex class 'Foo' extends 'V'
[orientdb] DEBUG orientdb: found 0 vertices in class 'null'
+ extracted 0 rows (0 rows/sec) - 0 rows -> loaded 0 vertices (0 vertices/sec) Total time: 1001ms [0 warnings, 0 errors]
[orientdb] DEBUG - OrientDBLoader: created vertex class 'Bar' extends 'V'
[orientdb] DEBUG orientdb: found 0 vertices in class 'null'
[orientdb] DEBUG - OrientDBLoader: created vertex class 'Baz' extends 'V'
[orientdb] DEBUG orientdb: found 0 vertices in class 'null'
[csv] DEBUG document={id:1,class:Foo,name:Jack}
[1:block] DEBUG Transformer input: {id:1,class:Foo,name:Jack}
[1:block] DEBUG Transformer output: {id:1,class:Foo,name:Jack}
[1:code] DEBUG Transformer input: {id:1,class:Foo,name:Jack}
Foo
[1:code] DEBUG executed code=OCommandExecutorScript [text=print(className); input;], result={id:1,class:Foo,name:Jack}
[1:code] DEBUG Transformer output: {id:1,class:Foo,name:Jack}
[1:vertex] DEBUG Transformer input: {id:1,class:Foo,name:Jack}
[1:vertex] DEBUG Transformer output: v(null)[#3:0]
[csv] DEBUG document={id:2,class:Bar,name:Jill}
[2:block] DEBUG Transformer input: {id:2,class:Bar,name:Jill}
[2:block] DEBUG Transformer output: {id:2,class:Bar,name:Jill}
[2:code] DEBUG Transformer input: {id:2,class:Bar,name:Jill}
Bar
[2:code] DEBUG executed code=OCommandExecutorScript [text=print(className); input;], result={id:2,class:Bar,name:Jill}
[2:code] DEBUG Transformer output: {id:2,class:Bar,name:Jill}
[2:vertex] DEBUG Transformer input: {id:2,class:Bar,name:Jill}
[2:vertex] DEBUG Transformer output: v(null)[#3:1]
[csv] DEBUG document={id:3,class:Baz,name:Gephri}
[3:block] DEBUG Transformer input: {id:3,class:Baz,name:Gephri}
[3:block] DEBUG Transformer output: {id:3,class:Baz,name:Gephri}
[3:code] DEBUG Transformer input: {id:3,class:Baz,name:Gephri}
Baz
[3:code] DEBUG executed code=OCommandExecutorScript [text=print(className); input;], result={id:3,class:Baz,name:Gephri}
[3:code] DEBUG Transformer output: {id:3,class:Baz,name:Gephri}
[3:vertex] DEBUG Transformer input: {id:3,class:Baz,name:Gephri}
[3:vertex] DEBUG Transformer output: v(null)[#3:2]
END ETL PROCESSOR
+ extracted 3 rows (4 rows/sec) - 3 rows -> loaded 3 vertices (4 vertices/sec) Total time: 1684ms [0 warnings, 0 errors]
Oh, and what does DEBUG orientdb: found 0 vertices in class 'null' mean?
Try this. I wrestled with this for awhile too, but the below setup worked for me.
Note that setting #class before the vertex transformer will initialize a Vertex with the proper class.
"transformers": [
{"block": {"let": {"name": "$className",
"value": "$input.class"}}},
{"code": {"language": "Javascript",
"code": "print(className + '\\n'); input;"}},
{ "field": {
"fieldName": "#class",
"expression": "$className"
}
},
{"vertex": {}}
]
To get your result, you could use "ETL" to import data from csv into a CLASS named "Generic".
Through an JS function, "separateClass ()", create new classes taking the name from the property 'Class' imported from csv, and put vertices from class Generic to new classes.
File json:
{
"source": { "file": {"path": "data.csv"}},
"extractor": { "row": {}},
"begin": [
{ "let": { "name": "$className", "value": "Generic"} }
],
"transformers": [
{"csv": {
"separator": ",",
"nullValue": "NULL",
"columnsOnFirstLine": true,
"columns": [
"id:Integer",
"name:String",
"class:String"
]
}
},
{"vertex": {"class": "$className", "skipDuplicates": true}}
],
"loader": {
"orientdb": {
"dbURL": "remote:localhost/test",
"dbType": "graph"
}
}
}
After importing the data from etl, in javascript creates the function
var g = orient.getGraphNoTx();
var queryResult= g.command("sql", "SELECT FROM Generic");
//example filed vertex: ID, NAME, CLASS
if (!queryResult.length) {
print("Empty");
} else {
//for each value create or insert in class
for (var i = 0; i < queryResult.length; i++) {
var className = queryResult[i].getProperty("class").toString();
//chech is className is already created
var countClass = g.command("sql","select from V where #class = '"+className+"'");
if (!countClass.length) {
g.command("sql","CREATE CLASS "+className+" extends V");
g.command("sql"," CREATE PROPERTY "+className+".id INTEGER");
g.command("sql"," CREATE PROPERTY "+className+".name STRING");
g.commit();
}
var id = queryResult[i].getProperty("id").toString();
var name = queryResult[i].getProperty("name").toString();
g.command("sql","INSERT INTO "+className+ " (id, name) VALUES ("+id+",'"+name+"')");
g.commit();
}
//remove class generic
g.command("sql","truncate class Generic unsafe");
}
the result should be like the one shown in the picture.