Parsing dates in format dd.MM.yyyy in Kafka Connect using kafka-connect-spooldir connector - apache-kafka

I am trying to use SpoolDirCsvSourceConnector from https://github.com/jcustenborder/kafka-connect-spooldir
I have following configuration for connector in Kafka:
connector.class=com.github.jcustenborder.kafka.connect.spooldir.SpoolDirCsvSourceConnector
csv.first.row.as.header=true
finished.path=/csv/finished
tasks.max=1
parser.timestamp.date.formats=[dd.MM.yyyy, yyyy-MM-dd'T'HH:mm:ss, yyyy-MM-dd' 'HH:mm:ss]
key.schema={"name":"com.github.jcustenborder.kafka.connect.model.Key","type":"STRUCT","isOptional":false,"fieldSchemas":{}}
csv.separator.char=59
input.file.pattern=umsaetze_.*.csv
topic=test-csv
error.path=/csv/error
input.path=/csv/input
value.schema={"name":"com.github.jcustenborder.kafka.connect.model.Value","type":"STRUCT","isOptional":false,"fieldSchemas":{"Buchungstag":{"name":"org.apache.kafka.connect.data.Timestamp","type":"INT64","version":1,"isOptional":true},"Wertstellung":{"name":"org.apache.kafka.connect.data.Timestamp","type":"INT64","version":1,"isOptional":true},"Vorgang":{"type":"STRING","isOptional":false},"Buchungstext":{"type":"STRING","isOptional":false},"Umsatz":{"name":"org.apache.kafka.connect.data.Decimal","type":"BYTES","version":1,"parameters":{"scale":"2"},"isOptional":true}}}
value schema is following:
{
"name": "com.github.jcustenborder.kafka.connect.model.Value",
"type": "STRUCT",
"isOptional": false,
"fieldSchemas": {
"Buchungstag": {
"name": "org.apache.kafka.connect.data.Date",
"type": "INT32",
"version": 1,
"isOptional": true
},
"Wertstellung": {
"name": "org.apache.kafka.connect.data.Timestamp",
"type": "INT64",
"version": 1,
"isOptional": true
},
"Vorgang": {
"type": "STRING",
"isOptional": false
},
"Buchungstext": {
"type": "STRING",
"isOptional": false
},
"Umsatz": {
"name": "org.apache.kafka.connect.data.Decimal",
"type": "BYTES",
"version": 1,
"parameters": {
"scale": "2"
},
"isOptional": true
}
}
}
I have tried Date instead of timestamps
{
"name" : "org.apache.kafka.connect.data.Date",
"type" : "INT32",
"version" : 1,
"isOptional" : true
}
Both timestamps and date are not working for me with same exception as on example for fields Buchungstag and Wertstellung. I was trying to solve it with option parser.timestamp.date.formats but it doesn't help.
Here is example of CSV I am trying to import into Kafka:
Buchungstag;Wertstellung;Vorgang;Buchungstext;Umsatz;
08.02.2019;08.02.2019;Lastschrift / Belastung;Auftraggeber: BlablaBuchungstext: Fahrschein XXXXXX Ref. U3436346/8423;-55,60;
08.02.2019;08.02.2019;Lastschrift / Belastung;Auftraggeber: Bank AGBuchungstext: 01.02.209:189,34 Ref. ZMKDVSDVS/5620;-189,34;
I am getting following exception in Kafka Connect:
org.apache.kafka.connect.errors.ConnectException: org.apache.kafka.connect.errors.DataException: Exception thrown while parsing data for 'Buchungstag'. linenumber=2
at com.github.jcustenborder.kafka.connect.spooldir.AbstractSourceTask.read(AbstractSourceTask.java:277)
at com.github.jcustenborder.kafka.connect.spooldir.AbstractSourceTask.poll(AbstractSourceTask.java:144)
... 10 more
Caused by: org.apache.kafka.connect.errors.DataException: Could not parse '08.02.2019' to 'Date'
at com.github.jcustenborder.kafka.connect.utils.data.Parser.parseString(Parser.java:113)
... 11 more
Caused by: java.lang.IllegalStateException: Could not parse '08.02.2019' to java.util.Date
at com.google.common.base.Preconditions.checkState(Preconditions.java:588)
... 12 more
Do you have any idea what should be there value schema to parse dates like 01.01.2001?

I think the issue is with your parser.timestamp.date.formats value. You pass [dd.MM.yyyy, yyyy-MM-dd'T'HH:mm:ss, yyyy-MM-dd' 'HH:mm:ss].
In configuration the property (parser.timestamp.date.formats) is set as List type. List should be passed as string with comma delimiter (,).
In your case it should be: dd.MM.yyyy,yyyy-MM-dd'T'HH:mm:ss,yyyy-MM-dd' 'HH:mm:ss. The problem might be with white spaces, because they are trimmed.

Related

What column type do I need for this nested data in BigQuery?

I have a JSON schema for a Kafka stream that I am integrating with BigQuery but I can't get the data type correct at the BigQuery end. This is the schema:
"my_meta_data": {
"type": "object",
"properties": {
"property_1": {
"type": "array",
"items": {
"type": "number"
}
},
"property_2": {
"type": "array",
"items": {
"type": "number"
}
},
"property_3": {
"type": "array",
"items": {
"type": "number"
}
}
}
}
I tried this in the JSON file defining the BigQuery table:
{
"name": "my_meta_data",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "property_1",
"type": "INT64",
"mode": "REPEATED"
},
{
"name": "property_2",
"type": "INT64",
"mode": "REPEATED"
},
{
"name": "property_3",
"type": "INT64",
"mode": "REPEATED"
}
]
}
I am using a hosted connector from Confluent, the Kafka provider, and the error message is
The connector is failing because it cannot write a non-array element to an array column. Please check the schemas of the data in Kafka and the BigQuery tables the connector is writing to, and ensure that all data from Kafka that will be written to an array column in BigQuery is contained in an array.
I haven't defined an array column though, I've defined a RECORD column that contains arrays. Any ideas how I can set up the BigQuery table to capture this data? Thanks in advance.

Avro: org.apache.avro.AvroTypeException: Expected long. Got START_OBJECT

I am working on an Avro schema and trying to create a testing data to test it with Kafa, but when I produce the message got this error: "Caused by: org.apache.avro.AvroTypeException: Expected long. Got START_OBJECT"
The Schema I created is like this:
{
"name": "MyClass",
"type": "record",
"namespace": "com.acme.avro",
"doc":"This schema is for streaming information",
"fields":[
{"name":"batchId", "type": "long"},
{"name":"status", "type": {"type": "enum", "name": "PlannedTripRequestedStatus", "namespace":"com.acme.avro.Dtos", "symbols":["COMPLETED", "FAILED"]}},
{"name":"runRefId", "type": "int"},
{"name":"tripId", "type": ["null", "int"]},
{"name": "referenceNumber", "type": ["null", "string"]},
{"name":"errorMessage", "type": ["null", "string"]}
]
}
The testing data is like this:
{
"batchId": {
"long": 3
},
"status": "COMPLETED",
"runRefId": {
"int": 1000
},
"tripId": {
"int": 200
},
"referenceNumber": {
"string": "ReferenceNumber1111"
},
"errorMessage": {
"string": "Hello World"
}
}
However, when I registered this schema and try to produce a message with Confluent console tool, I got the error: org.apache.avro.AvroTypeException: Expected long. Got START_OBJECT The whole error message is like this:
org.apache.kafka.common.errors.SerializationException: Error deserializing {"batchId": ...} to Avro of schema {"type":...}" at io.confluent.kafka.formatter.AvroMessageReader.readFrom(AvroMessageReader.java:134)
at io.confluent.kafka.formatter.SchemaMessageReader.readMessage(SchemaMessageReader.java:325)
at kafka.tools.ConsoleProducer$.main(ConsoleProducer.scala:51)
at kafka.tools.ConsoleProducer.main(ConsoleProducer.scala)
Caused by: org.apache.avro.AvroTypeException: Expected long. Got START_OBJECT
at org.apache.avro.io.JsonDecoder.error(JsonDecoder.java:511)
at org.apache.avro.io.JsonDecoder.readLong(JsonDecoder.java:177)
at org.apache.avro.io.ResolvingDecoder.readLong(ResolvingDecoder.java:169)
at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:197)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:160)
at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:259)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:247)
at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:179)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:160)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
at io.confluent.kafka.schemaregistry.avro.AvroSchemaUtils.toObject(AvroSchemaUtils.java:213)
at io.confluent.kafka.formatter.AvroMessageReader.readFrom(AvroMessageReader.java:124)
Does any know what I did wrong with my schema or test data? Thank you so much!
You only need the type object if the type is unclear (union of string or number, for example), or its nullable.
For batchId and runRefId, just use simple values

Avro union field with "local-timestamp-millis" deserialization issue

Avro: 10.1
Dataflow (Apache Beam): 2.28.0
Runner: org.apache.beam.runners.dataflow.DataflowRunner
Avro schema piece:
{
"name": "client_timestamp",
"type": [
"null",
{ "type": "long", "logicalType": "local-timestamp-millis" }
],
"default": null,
"doc": "Client side timestamp of this xxx"
},
An exception when writing Avro output file:
Caused by: org.apache.avro.UnresolvedUnionException:
Not in union ["null",{"type":"long","logicalType":"local-timestamp-millis"}]:
2021-03-12T12:21:17.599
Link to a longer stacktrace
Some of the steps taken:
Replacing "logicalType":"local-timestamp-millis" with "logicalType":"timestamp-millis" causes the same error.
Writing Avro locally also works.
Removing "type": "null" option eliminates the exception
Try something like following:
{
"name": "client_timestamp",
"type": ["null", "long"],
"doc": "Client side timestamp of this xxx",
"default": null,
"logicalType": "timestamp-millis"
}
It is a way to define logicalType in avro schema (link).

How to use Schema registry for Kafka Connect AVRO

I have started exploring Kafka and Kafka connect recently and did some initial set up .
But wanted to explore more on schema registry part .
My schema registry is started now what i should do .
I have a AVRO schema stored in avro_schema.avsc.
here is the schema
{
"name": "FSP-AUDIT-EVENT",
"type": "record",
"namespace": "com.acme.avro",
"fields": [
{
"name": "ID",
"type": "string"
},
{
"name": "VERSION",
"type": "int"
},
{
"name": "ACTION_TYPE",
"type": "string"
},
{
"name": "EVENT_TYPE",
"type": "string"
},
{
"name": "CLIENT_ID",
"type": "string"
},
{
"name": "DETAILS",
"type": "string"
},
{
"name": "OBJECT_TYPE",
"type": "string"
},
{
"name": "UTC_DATE_TIME",
"type": "long"
},
{
"name": "POINT_IN_TIME_PRECISION",
"type": "string"
},
{
"name": "TIME_ZONE",
"type": "string"
},
{
"name": "TIMELINE_PRECISION",
"type": "string"
},
{
"name": "AUDIT_EVENT_TO_UTC_DT",
"type": [
"string",
"null"
]
},
{
"name": "AUDIT_EVENT_TO_DATE_PITP",
"type": "string"
},
{
"name": "AUDIT_EVENT_TO_DATE_TZ",
"type": "string"
},
{
"name": "AUDIT_EVENT_TO_DATE_TP",
"type": "string"
},
{
"name": "GROUP_ID",
"type": "string"
},
{
"name": "OBJECT_DISPLAY_NAME",
"type": "string"
},
{
"name": "OBJECT_ID",
"type": [
"string",
"null"
]
},
{
"name": "USER_DISPLAY_NAME",
"type": [
"string",
"null"
]
},
{
"name": "USER_ID",
"type": "string"
},
{
"name": "PARENT_EVENT_ID",
"type": [
"string",
"null"
]
},
{
"name": "NOTES",
"type": [
"string",
"null"
]
},
{
"name": "SUMMARY",
"type": [
"string",
"null"
]
}
]
}
Is my schema is valid .I converted it online from JSON ?
where should i keep this schema file location i am not sure about .
Please guide me with the step to follow
.
I am sending records from Lambda function and from JDBC source both .
So basically how can i enforce AVRO schema and test ?
Do i have to change anything in avro-consumer properties file ?
Or is this correct way to register schema
./bin/kafka-avro-console-producer \
--broker-list b-3.**:9092,b-**:9092,b-**:9092 --topic AVRO-AUDIT_EVENT \
--property value.schema='{"type":"record","name":"myrecord","fields":[{"name":"f1","type":"string"}]}'
curl -X POST -H "Content-Type: application/vnd.schemaregistry.v1+json" --data '{"schema" : "{\"type\":\"struct\",\"fields\":[{\"type\":\"string\",\"optional\":false,\"field\":\"ID\"},{\"type\":\"string\",\"optional\":true,\"field\":\"VERSION\"},{\"type\":\"string\",\"optional\":true,\"field\":\"ACTION_TYPE\"},{\"type\":\"string\",\"optional\":true,\"field\":\"EVENT_TYPE\"},{\"type\":\"string\",\"optional\":true,\"field\":\"CLIENT_ID\"},{\"type\":\"string\",\"optional\":true,\"field\":\"DETAILS\"},{\"type\":\"string\",\"optional\":true,\"field\":\"OBJECT_TYPE\"},{\"type\":\"string\",\"optional\":true,\"field\":\"UTC_DATE_TIME\"},{\"type\":\"string\",\"optional\":true,\"field\":\"POINT_IN_TIME_PRECISION\"},{\"type\":\"string\",\"optional\":true,\"field\":\"TIME_ZONE\"},{\"type\":\"string\",\"optional\":true,\"field\":\"TIMELINE_PRECISION\"},{\"type\":\"string\",\"optional\":true,\"field\":\"GROUP_ID\"},{\"type\":\"string\",\"optional\":true,\"field\":\"OBJECT_DISPLAY_NAME\"},{\"type\":\"string\",\"optional\":true,\"field\":\"OBJECT_ID\"},{\"type\":\"string\",\"optional\":true,\"field\":\"USER_DISPLAY_NAME\"},{\"type\":\"string\",\"optional\":true,\"field\":\"USER_ID\"},{\"type\":\"string\",\"optional\":true,\"field\":\"PARENT_EVENT_ID\"},{\"type\":\"string\",\"optional\":true,\"field\":\"NOTES\"},{\"type\":\"string\",\"optional\":true,\"field\":\"SUMMARY\"},{\"type\":\"string\",\"optional\":true,\"field\":\"AUDIT_EVENT_TO_UTC_DT\"},{\"type\":\"string\",\"optional\":true,\"field\":\"AUDIT_EVENT_TO_DATE_PITP\"},{\"type\":\"string\",\"optional\":true,\"field\":\"AUDIT_EVENT_TO_DATE_TZ\"},{\"type\":\"string\",\"optional\":true,\"field\":\"AUDIT_EVENT_TO_DATE_TP\"}],\"optional\":false,\"name\":\"test\"}"}' http://localhost:8081/subjects/view/versions
what next i have to do
But when i try to see my schema i get only below
curl --silent -X GET http://localhost:8081/subjects/AVRO-AUDIT-EVENT/versions/latest
this is the result
{"subject":"AVRO-AUDIT-EVENT","version":1,"id":161,"schema":"{\"type\":\"string\",\"optional\":false}"}
Why i do not see my full registered schema
Also when i try to delete schema
i get below error
{"error_code":405,"message":"HTTP 405 Method Not Allowed"
i am not sure if my schema is registered correctly .
Please help me.
Thanks in Advance
is my schema valid
You can use the REST API of the Registry to try and submit it and see...
where should i keep this schema file location i am not sure about
It's not clear how you're sending messages...
If you actually wrote Kafka producer code, you store it within your code (as a string) or as a resource file.. If using Java, you can instead use the SchemaBuilder class to create the Schema object
You need to rewrite your producer to use Avro Schema and Serializers if you've not already
If we create AVRO schema will it work for Json as well .
Avro is a Binary format, but there is a JSONDecoder for it.
what should be the URL for our AVRO schema properties file ?
It needs to be the IP of your Schema Registry once you figure out how to start it. (with schema-registry-start)
Do i have to change anything in avro-consumer properties file ?
You need to use the Avro Deserializer
is this correct way to register schema
.> /bin/kafka-avro-console-producer \
Not quite. That's how you produce a message with a schema (and you need to use the correct schema). You also must provide --property schema.registry.url
You use the REST API of the Registry to register and verify schemas

Ingesting multi-valued dimension from comma sep string

I have event data from Kafka with the following structure that I want to ingest in Druid
{
"event": "some_event",
"id": "1",
"parameters": {
"campaigns": "campaign1, campaign2",
"other_stuff": "important_info"
}
}
Specifically, I want to transform the dimension "campaigns" from a comma-separated string into an array / multi-valued dimension so that it can be nicely filtered and grouped by.
My ingestion so far looks as follows
{
"type": "kafka",
"dataSchema": {
"dataSource": "event-data",
"parser": {
"type": "string",
"parseSpec": {
"format": "json",
"timestampSpec": {
"column": "timestamp",
"format": "posix"
},
"flattenSpec": {
"fields": [
{
"type": "root",
"name": "parameters"
},
{
"type": "jq",
"name": "campaigns",
"expr": ".parameters.campaigns"
}
]
}
},
"dimensionSpec": {
"dimensions": [
"event",
"id",
"campaigns"
]
}
},
"metricsSpec": [
{
"type": "count",
"name": "count"
}
],
"granularitySpec": {
"type": "uniform",
...
}
},
"tuningConfig": {
"type": "kafka",
...
},
"ioConfig": {
"topic": "production-tracking",
...
}
}
Which however leads to campaigns being ingested as a string.
I could neither find a way to generate an array out of it with a jq expression in flattenSpec nor did I find something like a string split expression that may be used as a transformSpec.
Any suggestions?
Try setting useFieldDiscover: false in your ingestion spec. when this flag is set to true (which is default case) then it interprets all fields with singular values (not a map or list) and flat lists (lists of singular values) at the root level as columns.
Here is a good example and reference link to use flatten spec:
https://druid.apache.org/docs/latest/ingestion/flatten-json.html
Looks like since Druid 0.17.0, Druid expressions support typed constructors for creating arrays, so using expression string_to_array should do the trick!