Kafka Connect FileStreamSink connector does not include the KEY in the output file - apache-kafka

Trying a simple File sink connector to extract data from a topic. The generated file does not include the event key and I am not able to find a setting that enables that. Eventually the goal will be to load the file using a source connector and produce the same sample data and the event KEY is very important.
Thanks
{
"name": "save-seed-data",
"config": {
"connector.class": "org.apache.kafka.connect.file.FileStreamSinkConnector",
"tasks.max": "1",
"name": "save-seed-data",
"topics": "FIRM",
"file": "/tmp/FIRM.txt",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"key.converter.schemas.enable": "false",
"value.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter.schemas.enable": "false"
}
}

Not sure where you found that the key should be in the output, since the source code only references the value.
You can download and use a message transform to move the key over into the value, though.
https://github.com/jcustenborder/kafka-connect-transform-archive
Also worth mentioning, that the FileStream Source Connector does not parse the data. Each line, also, only goes into the value
Generally, using kafkacat is much more straightforward for dumping/loading data from files.

Related

Kafka Connect - From JSON records to Avro files in HDFS

My current setup contains Kafka, HDFS, Kafka Connect, and a Schema Registry all in networked docker containers.
The Kafka topic contains simple JSON data without a Schema:
{
"repo_name": "ironbee/ironbee"
}
The Schema Registry contains a JSON Schema describing the data in the Kafka Topic:
{"$schema": "https://json-schema.org/draft/2019-09/schema",
"$id": "http://example.com/example.json",
"type": "object",
"title": "Root Schema",
"required": [
"repo_name"
],
"properties": {
"repo_name": {
"type": "string",
"default": "",
"title": "The repo_name Schema",
"examples": [
"ironbee/ironbee"
]
}
}}
What I am trying to achieve is a Connection that reads JSON data from a Topic and dumps it into files in HDFS (Avro or Parquet).
{
"name": "kafka to hdfs",
"connector.class": "io.confluent.connect.hdfs3.Hdfs3SinkConnector",
"topics": "repo",
"hdfs.url": "hdfs://namenode:9000",
"flush.size": 3,
"confluent.topic.bootstrap.servers": "kafka-1:19092,kafka-2:29092,kafka-3:39092",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"key.converter.schemas.enable": "false",
"value.converter": "io.confluent.connect.json.JsonSchemaConverter",
"value.converter.schemas.enable": "false",
"value.converter.schema.registry.url": "http://schema-registry:8081"
}
If I try to read the raw JSON value via the StringConverter (no schema used) and dump it into Avro files it works, resulting in
Key=null Value={my json} touples
so no usable structure at all.
When I try to use my schema via the JsonSchemaConverter I get the errors
“Converting byte[] to Kafka Connect data failed due to serialization error of topic”
“Unknown magic byte”
I think that there is something wrong with the configuration of my connection, but after a week of trying everything my google-skills have reached their limits.
All the code is available here: https://github.com/SDU-minions/7-Scalable-Systems-Project/tree/dev/Kafka
raw JSON value via the StringConverter (no schema used)
schemas.enable property only exists on JSONConverter. Strings don't have schemas. JSONSchema always has a schema, so property also doesn't exist there.
When I try to use my schema via the JsonSchemaConverter I get the errors
Your producer needs to use Confluent JSONSchema Serializer. Otherwise, it doesn't get sent to Kafka with the "magic byte" referred to in your error.
I personally haven't tried converting JSON schema records to Avro directly in Connect. Usually the pattern is to either produce Avro directly, or convert within ksqlDB, for example to a new Avro topic, which is then consumed by Connect.

How to pass data when meets a condition from MongoDB to a Kafka topic with a source connector and a pipeline property?

I'm working in a source connector to watch for changes in a Mongo's collection and take them to a Kafka topic. This works nicely till I add the requirement to just put them in Kafka topic if meets a specific condition (name=Kathe). It means I need to put data in a topic just if the update process changes the name to Kathe.
My connector's config looks like:
{
"connection.uri":"xxxxxx",
"connector.class": "com.mongodb.kafka.connect.MongoSourceConnector",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable":"false",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable":"false",
"topic.prefix": "qu",
"database":"sample_analytics",
"collection":"customers",
"copy.existing": "true",
"pipeline":"[{\"$match\":{\"name\":\"Kathe\"}}]",
"publish.full.document.only": "true",
"flush.timeout.ms":"15000"
}
I also have tried with
"pipeline":"[{\"$match\":{\"name\":{ \"$eq\":\"Kathe\"}}}]"
But it is not producing messages, when the condition meets.
Am I making a mistake?

kafka FileStreamSourceConnector write an avro file to topic with key field

I want to use kafka FileStreamSourceConnector to write a local avro file into a topic.
My connector config looks like this:
curl -i -X PUT -H "Content-Type:application/json" http://localhost:8083/connectors/file_source_connector/config \
-d '{
"connector.class": "org.apache.kafka.connect.file.FileStreamSourceConnector",
"value.converter.schema.registry.url": "http://schema-registry:8081",
"topic": "my_topic",
"file": "/data/log.avsc",
"format.include.keys": "true",
"source.auto.offset.reset": "earliest",
"tasks.max": "1",
"value.converter.schemas.enable": "true",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter": "org.apache.kafka.connect.storage.StringConverter"
}'
Then when I print out the topic, the key fields are null.
Updated on 2021-03-29:
After watching this video 🎄Twelve Days of SMT 🎄 - Day 2: ValueToKey and ExtractField from Robin,
I applied SMT to my connector config:
curl -i -X PUT -H "Content-Type:application/json" http://localhost:8083/connectors/file_source_connector_02/config \
-d '{
"connector.class": "org.apache.kafka.connect.file.FileStreamSourceConnector",
"value.converter.schema.registry.url": "http://schema-registry:8081",
"topic": "my_topic",
"file": "/data/log.avsc",
"tasks.max": "1",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"transforms": "ValueToKey, ExtractField",
"transforms.ValueToKey.type":"org.apache.kafka.connect.transforms.ValueToKey",
"transforms.ValueToKey.fields":"id",
"transforms.ExtractField.type":"org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.ExtractField.field":"id"
}'
However, the connector is failed:
Caused by: org.apache.kafka.connect.errors.DataException: Only Struct objects supported for [copying fields from value to key], found: java.lang.String
I would use ValueToKey transformer.
In bad case ignorig values and setting random key.
For details look at:ValueToKey
FileStreamSource assumes UTF8 encoded, line delimited files are your input, not binary files such as Avro. Last I checked, format.include.keys is not a valid config for the connector either.
Therefore each consumed event will be a string, and subsequently, transforms that require Structs with field names will not work
You can use the Hoist transform to create a Struct from each "line", but this still will not parse your data to make the ID field accessible to move to the key.
Also, your file is AVSC, which is JSON formatted, not Avro, so I'm not sure what the goal is by using the AvroConverter, or having "schemas.enable": "true". Still, the lines read by the connector are not parsed by converters such that fields are accessible, only serialized when sent to Kafka
My suggestion would be to write some other CLI script using plain producer libraries to parse the file, extract the schema, register that with Schema Registry, build a producer record for each entity in the file, and send them

defining unique name for Kafka file sink connector

I have a service which generates XML string and sending it to Kafka topic which should than generate XML file.
At the moment I am using Kafka FileStreamSink connector which generates the file with predefined fixed name.
The filename of that XML file should be generated according to the XML content, how can i do so?
below is my FileStreamSink connector configuration with the predefined filename.
{
"name": "file_sink_stream_01",
"config": {
"connector.class": "FileStreamSink",
"group.id": "file_sink_stream_connector",
"tasks.max": "1",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable":"false",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "false",
"topics": "stream_userid_stream",
"file": "file.xml"
}
}
It's not possible to do so with the file sink - the file name is static and even an SMT wouldn't let you redefine it
Note: json converter would output json, not xml
If you absolutely need this, you could try Apache Nifi

ElasticsearchSinkConnector Failed to deserialize data to Avro

I created the simplest kafka sink connector config and I'm using confluent 4.1.0:
{
"connector.class":
"io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"type.name": "test-type",
"tasks.max": "1",
"topics": "dialogs",
"name": "elasticsearch-sink",
"key.ignore": "true",
"connection.url": "http://localhost:9200",
"schema.ignore": "true"
}
and in the topic I save the messages in JSON
{ "topics": "resd"}
But in the result I get an error:
Caused by: org.apache.kafka.common.errors.SerializationException: Error deserializing Avro message for id -1
Caused by: org.apache.kafka.common.errors.SerializationException: Unknown magic byte!
As cricket_007 says, you need to tell Connect to use Json deserialiser, if that's the format your data is in. Add this to your connector configuration:
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "false",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "false"
That error happens because it's trying to read non Confluent Schema Registry encoded Avro messages.
If the topic data is Avro, it needs to use the Schema Registry.
Otherwise, if topic data is JSON, then you've started the connect cluster with AvroConverter on your keys or values in the property file, where you need to use the JsonConverter instead