MSK S3 sink not working with Field Partitioner - apache-kafka

I am using AWS MSK and msk connect. S3 sink connector is not working properly when I added io.confluent.connect.storage.partitioner.FieldPartitioner Note:without fieldPartitioner s3sink had worked. Other than this stack overflow Question Link I was not able to find any resource
Error
ERROR [FieldPart-sink|task-0] Value is not Struct type. (io.confluent.connect.storage.partitioner.FieldPartitioner:81)
Caused by: io.confluent.connect.storage.errors.PartitionException: Error encoding partition.
ERROR [Sink-FieldPartition|task-0] WorkerSinkTask{id=Sink-FieldPartition-0} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted. Error: Error encoding partition. (org.apache.kafka.connect.runtime.WorkerSinkTask:612)
MSK Connect Config
connector.class=io.confluent.connect.s3.S3SinkConnector
format.class=io.confluent.connect.s3.format.avro.AvroFormat
flush.size=1
schema.compatibility=BACKWARD
tasks.max=2
topics=MSKTutorialTopic
storage.class=io.confluent.connect.s3.storage.S3Storage
topics.dir=mskTrials
s3.bucket.name=clickstream
s3.region=us-east-1
partitioner.class=io.confluent.connect.storage.partitioner.FieldPartitioner
partition.field.name=name
value.converter.schemaAutoRegistrationEnabled=true
value.converter.registry.name=datalake-schema-registry
value.convertor.schemaName=MSKTutorialTopic-value
value.converter.avroRecordType=GENERIC_RECORD
value.converter.region=us-east-1
value.converter.schemas.enable=true
value.converter=org.apache.kafka.connect.storage.StringConverter
key.converter=org.apache.kafka.connect.storage.StringConverter
Data Schema which is stored in glue schema registry
{
"namespace": "example.avro",
"type": "record",
"name": "UserData",
"fields": [
{
"name": "name",
"type": "string"
},
{
"name": "favorite_number",
"type": [
"int",
"null"
]
},
{
"name": "favourite_color",
"type": [
"string",
"null"
]
}
]
}

In order to partition by fields, your data needs actual fields.
StringConverter cannot parse data it consumes to add said fields. Use AvroConverter if you have Avro data in the topic. Also, Avro always has a schema, so remove the schemas.enable configuration

Related

Kafka connect with ibm-messaging connector not ingesting data into postgres

I am trying to ingest data from kafka-topic to postgres. I have successfully connected kafka and postgres through ibm-messaging/kafka-connect-jdbc-sink connector.Below is my jdbc-sink-connector file:
name=jdbc-sink-connector1
connector.class=com.ibm.eventstreams.connect.jdbcsink.JDBCSinkConnector
tasks.max=1
# Below is the postgresql driver
driver.class=org.postgresql.Driver
# The topics to consume from - required for sink connectors
topics=connector7
# Configuration specific to the JDBC sink connector.
# We want to connect to a SQLite database stored in the file test.db and auto-create tables.
connection.url=jdbc:postgresql://localhost:5432/abc
connection.user=xyz
connection.password=***
connection.ds.pool.size=5
connection.table=company
insert.mode.databaselevel=true
put.mode=insert
table.name.format=company
auto.create=true
Now in kafka-producer I am sending data as follows:
{"schema":
{"type": "struct","fields": [
{ "field": "id", "type": "integer", "optional": false },
{ "field": "time", "type": "timestamp", "optional": false },
{ "field": "name", "type": "string", "optional": false },
{ "field": "company", "type": "string", "optional": false }]},
"payload":
{"id":"1",
"time":"2022-02-08 16:18:00-05",
"name":"ned",
"company":"advice"}}
It is giving the following error:
ERROR WorkerSinkTask{id=jdbc-sink-connector1-0} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted. Error: 1 (org.apache.kafka.connect.runtime.WorkerSinkTask:607)
java.lang.ArrayIndexOutOfBoundsException: 1
ERROR WorkerSinkTask{id=jdbc-sink-connector1-0} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask:184)
Now according to my understanding I think there is some problem in my JSON in the way I am giving data to the producer but cannot seem to figure what it is?

How to inject data to apache pinot from kafka topic with avro schema?

I have started exploring apache pinot, there are few query regarding schema of apache pinot. I want to understand how apache pinot work with Kafka topic that has AVRO schema (schema includes nested object, array of object etc..) because i didn't find any resource or example that shows how we can inject data with Kafka that has avro schema with it.
As per my understanding apache pinot we have to provide flat schema or other option for nested Json object we can use transform function. Is there any kind of Kafka connect for pinot for doing data injection?
Avro schema
{
"namespace" : "my.avro.ns",
"name": "MyRecord",
"type" : "record",
"fields" : [
{"name": "uid", "type": "int"},
{"name": "somefield", "type": "string"},
{"name": "options", "type": {
"type": "array",
"items": {
"type": "record",
"name": "lvl2_record",
"fields": [
{"name": "item1_lvl2", "type": "string"},
{"name": "item2_lvl2", "type": {
"type": "array",
"items": {
"type": "record",
"name": "lvl3_record",
"fields": [
{"name": "item1_lvl3", "type": "string"},
{"name": "item2_lvl3", "type": "string"}
]
}
}}
]
}
}}
]
}
Kafka Avro Message:
{
"uid": 29153333,
"somefield": "somevalue",
"options": [
{
"item1_lvl2": "a",
"item2_lvl2": [
{
"item1_lvl3": "x1",
"item2_lvl3": "y1"
},
{
"item1_lvl3": "x2",
"item2_lvl3": "y2"
}
]
}
]
}
You don't need a separate connector to ingest data into Pinot from Kafka, or other stream systems such as Kinesis, Apache Pulsar. You simply configure the Pinot table to point to stream source (Kafka broker in your case), along with any transformations you may want to map Kafka schema (Avro or otherwise) to schema in Pinot.
How you should store the data in Pinot (table schema in Pinot) is more a function of how you want to query it.
If you are only interested in a particular field inside your nested filed, you can configure a simple ingestion transform to extract that field out during ingestion and store it as a column in Pinot.
If you want to preserve the entire nested JSON blob for a column, and then query the blob, then you can use JSON indexing.
Here are some pointers to for your reference:
Ingestion Transforms
Flattening JSON
JSON functions
JSON Indexing
Pinot Docs
You may also want to consider joining the Apache Pinot slack community for Apache Pinot related questions.

Avro invalid default for union field

I'm trying to serialise and then write to the hortonworks schema registry an avro schema but I'm getting the following error message during the write operation.
Caused by: java.lang.RuntimeException: An exception was thrown while processing request with message: [Invalid default for field viewingMode: null not a [{"type":"record","name":"aName","namespace":"domain.assembled","fields":[{"name":"aKey","type":"string"}]},{"type":"record","name":"anotherName","namespace":"domain.assembled","fields":[{"name":"anotherKey","type":"string"},{"name":"yetAnotherKey","type":"string"}]}]]
at com.hortonworks.registries.schemaregistry.client.SchemaRegistryClient.handleSchemaIdVersionResponse(SchemaRegistryClient.java:678)
at com.hortonworks.registries.schemaregistry.client.SchemaRegistryClient.doAddSchemaVersion(SchemaRegistryClient.java:664)
at com.hortonworks.registries.schemaregistry.client.SchemaRegistryClient.lambda$addSchemaVersion$1(SchemaRegistryClient.java:591)
This is the avro schema
{
"type": "record",
"name": "aSchema",
"namespace": "domain.assembled",
"fields": [
{
"name": "viewingMode",
"type": [
{
"name": "aName",
"type": "record",
"fields": [
{"name": "aKey","type": "string"}
]
},
{
"name": "anotherName",
"type": "record",
"fields": [
{"name": "anotherKey","type": "string"},
{"name": "yetAnotherKey","type": "string"}
]
}
]
}
]
}
Whoever if I add a "null" as the first type of the union this the succeeds. Do avro union types require a "null"? In my case this would be an incorrect representation of data so I'm not keen on doing it.
If it makes any difference I'm using avro 1.9.1.
Also, apologies if the tags are incorrect but couldn't find a hortonworks-schema-registry tag and don't have enough rep to create a new one.
Turns out if was an issue with hortonwork's schema registry.
This has actually already been fixed here and I've requested a new release here. Hopefully this happens soon.

Debezium might produce invalid schema

I face an issue with Avro and Schema Registry. After Debezium created a schema and a topic, I have downloaded the schema from Schema Registry. I put it into a .asvc file and it looks like this:
{
"type": "record",
"name": "Envelope",
"namespace": "my.namespace",
"fields": [
{
"name": "before",
"type": [
"null",
{
"type": "record",
"name": "MyValue",
"fields": [
{
"name": "id",
"type": "int"
}
]
}
],
"default": null
},
{
"name": "after",
"type": [
"null",
"MyValue"
],
"default": null
}
]
}
I ran two experiments:
I tried to put it back into Schema Registry but I get this error: MyValue is not correct. When I remove "after" record, the schema seems to work well.
I used 'generate-sources' from avro-maven-plugin to generate the Java classes. When I try to consume the topic above, I see this error:
Exception in thread "b2-StreamThread-1" org.apache.kafka.streams.errors.StreamsException: Exception caught in process. [...]: Error registering Avro schema: [...]
Caused by: io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Schema being registered is incompatible with an earlier schema; error code: 409
Did anyone faced the same problem? Is it Debezium that is producing an invalid schema or is Schema Registry that is having a bug?
MyValue is not correct.
That's not an Avro type. You would have to embed the actual record within the union, just like the before value
In other words, you're not able to cross reference the record types within a schema, AFAIK
When I try to consume the topic above, I see this error:
A consumer does not register schemas, so it's not clear how you're getting that error unless maybe using Kafka Streams, which produces into intermediate topics

Partition column disappears in result set dataframe Spark

I tried to split Spark data frame by the timestamp column update_database_time and write it into HDFS with defined Avro schema. However, after calling the repartition method I get this exception:
Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Catalyst type StructType(StructField(random_pk,DecimalType(38,0),true), StructField(random_string,StringType,true), StructField(code,StringType,true), StructField(random_bool,BooleanType,true), StructField(random_int,IntegerType,true), StructField(random_float,DoubleType,true), StructField(random_double,DoubleType,true), StructField(random_enum,StringType,true), StructField(random_date,DateType,true), StructField(random_decimal,DecimalType(4,2),true), StructField(update_database_time_tz,TimestampType,true), StructField(random_money,DecimalType(19,4),true)) to Avro type {"type":"record","name":"TestData","namespace":"DWH","fields":[{"name":"random_pk","type":["null",{"type":"bytes","logicalType":"decimal","precision":38,"scale":0}]},{"name":"random_string","type":["string","null"]},{"name":"code","type":["string","null"]},{"name":"random_bool","type":["boolean","null"]},{"name":"random_int","type":["int","null"]},{"name":"random_float","type":["double","null"]},{"name":"random_double","type":["double","null"]},{"name":"random_enum","type":["null",{"type":"enum","name":"enumType","symbols":["VAL_1","VAL_2","VAL_3"]}]},{"name":"random_date","type":["null",{"type":"int","logicalType":"date"}]},{"name":"random_decimal","type":["null",{"type":"bytes","logicalType":"decimal","precision":4,"scale":2}]},{"name":"update_database_time","type":["null",{"type":"long","logicalType":"timestamp-millis"}]},{"name":"update_database_time_tz","type":["null",{"type":"long","logicalType":"timestamp-millis"}]},{"name":"random_money","type":["null",{"type":"bytes","logicalType":"decimal","precision":19,"scale":4}]}]}.
I assume that the column for partitioning disappears in the result. How can I redefine the operation so it would not happen?
Here is the code I use:
dataDF.write
.partitionBy("update_database_time")
.format("avro")
.option(
"avroSchema",
SchemaRegistry.getSchema(
schemaRegistryConfig.url,
schemaRegistryConfig.dataSchemaSubject,
schemaRegistryConfig.dataSchemaVersion))
.save(s"${hdfsURL}${pathToSave}")
By the exception you have provided, the error seems to stem from incompatible schemas between the fetched AVRO schema and Spark's schema. Taking a quick look, the most worrisome parts are probably these ones:
(Possibly catalyst doesn't know how to transform string into enumType)
Spark schema:
StructField(random_enum,StringType,true)
AVRO schema:
{
"name": "random_enum",
"type": [
"null",
{
"type": "enum",
"name": "enumType",
"symbols": [
"VAL_1",
"VAL_2",
"VAL_3"
]
}
]
}
(update_databse_time_tz appears only once in the dataframe's schema, but twice in the AVRO schema)
Spark schema:
StructField(update_database_time_tz,TimestampType,true)
AVRO schema:
{
"name": "update_database_time",
"type": [
"null",
{
"type": "long",
"logicalType": "timestamp-millis"
}
]
},
{
"name": "update_database_time_tz",
"type": [
"null",
{
"type": "long",
"logicalType": "timestamp-millis"
}
]
}
I'd suggest to consolidate schemas first and get rid of that exception before going inside other possible partitioning problems.
EDIT: In regards to number 2 I've missed the face that there are different names in the AVRO schema, which leads to a problem of missing a column update_database_time in the Dataframe.