Kafka HDFS Sink Connector Protobuf not being written - apache-kafka

I am trying to use the Kafka HDFS 3 sink connector to write protobuf binary files to the HDFS. However, the connector keeps writing avro files.
I have setup my sink connector with the following config
{
"name": "hdfs3-connector-test",
"config": {
"connector.class": "io.confluent.connect.hdfs3.Hdfs3SinkConnector",
"tasks.max": "1",
"topics": "testproto",
"hdfs.url": "hdfs://10.8.0.1:9000",
"flush.size": "3",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "io.confluent.connect.protobuf.ProtobufConverter",
"value.converter.schema.registry.url":"http://10.8.0.1:8081",
"confluent.topic.bootstrap.servers": "10.8.0.1:9092",
"confluent.topic.replication.factor": "1",
"key.converter.schemas.enable": "true",
"value.converter.schemas.enable": "true"
}
}
as you can see I am using the ProtobufConverter for the value converter, and the plugin is installed. (Is the ProtobufConverter converting to Avro file format?).
I also registered my schema and sent data to the topic using the following Java file:
package app;
import java.util.Properties;
import org.apache.kafka.clients.producer.*;
import org.apache.kafka.clients.producer.KafkaProducer;
import test.Test.*;
public class App
{
public static void main( String[] args )
{
try {
Properties props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "10.8.0.1:9092");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer");
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
io.confluent.kafka.serializers.protobuf.KafkaProtobufSerializer.class.getName());
props.put("schema.registry.url", "http://10.8.0.1:8081");
KafkaProducer<String, MyMsg> producer = new KafkaProducer<String, MyMsg>(props);
String topic = "testproto";
String key = "testkey";
MyMsg m = MyMsg.newBuilder().setF1("testing").build();
ProducerRecord<String, MyMsg> record = new ProducerRecord<String, MyMsg>(topic, key, m);
producer.send(record).get();
producer.close();
} catch (Exception e) {
System.out.println(e.toString());
}
}
}
here is my proto file
syntax = "proto3";
package test;
message MyMsg {
string f1 = 1;
}
So my question is, is this correct? Can I only write Avro files to the HDFS using this connector? Or is my configuration incorrect and I should be expecting protobuf files in the HDFS?

You need to set format.class config,
format.class is
The format class to use when writing data to the store. Format classes implement the io.confluent.connect.storage.format.Format interface.
Type: class
Default: io.confluent.connect.hdfs3.avro.AvroFormat
Importance: high
These classes are available by default:
io.confluent.connect.hdfs3.avro.AvroFormat
io.confluent.connect.hdfs3.json.JsonFormat
io.confluent.connect.hdfs3.parquet.ParquetFormat
io.confluent.connect.hdfs3.string.StringFormat
https://docs.confluent.io/kafka-connect-hdfs3-sink/current/configuration_options.html#hdfs3-config-options

Related

MirrorSourceConnector: override consumer key.serializer property

I am trying to run MirrorSourceConnector from a Topic in cluster A to cluster B.
After creating the connector and consuming first message I noticed that mirrored topic key and value is always serialized as a ByteArray. Which in case of a key is a bit of a problem when doing the transformations with a custom class.
After checking MirrorSourceConfig class in github I found out that with source.admin. and target.admin I could basically add consumer and producer properties. But seems it does not make any different (in logs I could still see that ByteArray serializer is being used).
My connector config looks like that:
{"target.cluster.status.storage.replication.factor": "-1",
"connector.class": "org.apache.kafka.connect.mirror.MirrorSourceConnector",
"auto.create.mirror.topics.enable": true,
"offset-syncs.topic.replication.factor": "1",
"replication.factor": "1",
"sync.topic.acls.enabled": "false",
"topics": "test-topic",
"target.cluster.config.storage.replication.factor": "-1",
"source.cluster.alias": "source-cluster-dev",
"source.cluster.bootstrap.servers": "source-cluster-dev:9092",
"target.cluster.offset.storage.replication.factor": "-1",
"target.cluster.alias": "target-cluster-dev",
"target.cluster.security.protocol": "PLAINTEXT",
"header.converter": "org.apache.kafka.connect.converters.ByteArrayConverter",
"value.converter": "org.apache.kafka.connect.converters.ByteArrayConverter",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"name": "test-mirror-connector",
"source.admin.key.deserializer": "org.apache.kafka.common.serialization.StringDeserializer",
"source.admin.value.deserializer":"org.apache.kafka.common.serialization.ByteArrayDeserializer",
"target.admin.key.serializer": "org.apache.kafka.common.serialization.StringDeserializer",
"target.admin.value.serializer":"org.apache.kafka.common.serialization.ByteArrayDeserializer",
"target.cluster.bootstrap.servers": "target-cluster-dev:9092"}
Is there a way to override Consumer and Producer Ser/De-serialization properties or any other way to make mirror topic to be exactly the same as a source topic? In the meaning of seralization.

How To parse Json in Flink Kafka + Scala

Hi I am new developer to Apache Flink. I am trying to build a Data-streaming Application in Apache Flink integrated with Kafka.
I already an Have a Topic named source-json created in Kafka containing a json which i am required to parse using Apache Flink(Scala).
I have already written the code to extract the Json in Apacheflink facing issue in parsing the json in apache Flink using the deseralizer for spark i apparently i could not find much support for apache flink spark in docs.
json file i am Trying to parse
{
"timestamp": "2020-01-09",
"integrations": {},
"context": {
"page": {
"path": "/",
"title": "shop",
"url": "www.example.com"
},
"library": {
"version": "3.10.1"
}
},
"properties": {
"url": "https://www.example.com/",
"page_type": "Home",
},
"Time": "2020-01-09",
"_sorting": {
"SortAs": ["SGML"],
"Acronym": ["SGML"]
}
}
I was able to write the code for print the json from kafka using flink.
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val source = KafkaSource.builder[String].
setBootstrapServers("localhost:9092")
.setTopics("source-json")
.setGroupId("my-group")
.setStartingOffsets(OffsetsInitializer.earliest)
.setValueOnlyDeserializer(new SimpleStringSchema())
.build
val streamEnv : DataStream[String] = env.fromSource(source, WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(20)), "Kafka Source")
streamEnv.print()
env.execute("running stream")
}
Need help in parsing the json storing the filtered json in a different topic in kafka named sink-json in apache flink.

How to migrate consumer offsets using MirrorMaker 2.0?

With Kafka 2.7.0, I am using MirroMaker 2.0 as a Kafka-connect connector to replicate all the topics from the primary Kafka cluster to the backup cluster.
All the topics are being replicated perfectly except __consumer_offsets. Below are the connect configurations:
{
"name": "test-connector",
"config": {
"connector.class": "org.apache.kafka.connect.mirror.MirrorSourceConnector",
"topics.blacklist": "some-random-topic",
"replication.policy.separator": "",
"source.cluster.alias": "",
"target.cluster.alias": "",
"exclude.internal.topics":"false",
"tasks.max": "10",
"key.converter": "org.apache.kafka.connect.converters.ByteArrayConverter",
"value.converter": "org.apache.kafka.connect.converters.ByteArrayConverter",
"source.cluster.bootstrap.servers": "xx.xx.xxx.xx:9094",
"target.cluster.bootstrap.servers": "yy.yy.yyy.yy:9094",
"topics": "test-topic-from-primary,primary-kafka-connect-offset,primary-kafka-connect-config,primary-kafka-connect-status,__consumer_offsets"
}
}
In a similar question here, the accepted answer says the following:
Add this in your consumer.config:
exclude.internal.topics=false
And add this in your producer.config:
client.id=__admin_client
Where do I add these in my configuration?
Here the Connector Configuration Properties does not have such property named client.id, I have set the value of exclude.internal.topics to false though.
Is there something I am missing here?
UPDATE
I learned that Kafka 2.7 and above supports automated consumer offset sync using MirrorCheckpointTask as mentioned here.
I have created a connector for this having the below configurations:
{
"name": "mirror-checkpoint-connector",
"config": {
"connector.class": "org.apache.kafka.connect.mirror.MirrorCheckpointConnector",
"sync.group.offsets.enabled": "true",
"source.cluster.alias": "",
"target.cluster.alias": "",
"exclude.internal.topics":"false",
"tasks.max": "10",
"key.converter": "org.apache.kafka.connect.converters.ByteArrayConverter",
"value.converter": "org.apache.kafka.connect.converters.ByteArrayConverter",
"source.cluster.bootstrap.servers": "xx.xx.xxx.xx:9094",
"target.cluster.bootstrap.servers": "yy.yy.yyy.yy:9094",
"topics": "__consumer_offsets"
}
}
Still no help.
Is this the correct approach? Is there something needed?
you do not want to replicate connsumer_offsets. The offsets from the src to the destination cluster will not be the same for various reasons.
MirrorMaker2 provides the ability to do offset translation. It will populate the destination cluster with a translated offset generated from the src cluster. https://cwiki.apache.org/confluence/display/KAFKA/KIP-545%3A+support+automated+consumer+offset+sync+across+clusters+in+MM+2.0
__consumer_offsets is ignored by default
topics.exclude = [.*[\-\.]internal, .*\.replica, __.*]
you'll need to override this config

Kafka JDBC sink connector is not creating new postgreSQL tables in real-time

I am writing a test to check if my JDBC Kafka connector creates a postgres new table properly, when I publish topics with new names. Although the creation eventually works, the postgres table is created with a delay (some minutes), and therefore my real-time test fails.
I wonder if there is a way to force the JDBC connector to create postgres tables in real-time.
test_connector.py
from mymodule.producer import produce
from mymodule.athlete import Athlete
import unittest
import psycopg2
class TestConnector(unittest.TestCase):
''' Test for JDBC Kafka Connector '''
conn = psycopg2.connect("postgres://******")
cursor = conn.cursor()
def test_table_auto_create(self, topic_name=sometopic_test'):
''' Tests whether a postgreSQL table is auto-created when the producer publishes a new topic.
Topic name must match the regex: sometopic_(.*)
Args:
topic_name (str): The name of the topic (default: 'sometopic_test')
'''
produce(topic_name)
self.cursor.execute(
f"SELECT EXISTS (SELECT 1 AS result FROM pg_tables WHERE schemaname = 'public' AND tablename = '{topic_name}');")
table_created = self.cursor.fetchone()[0]
self.assertEqual(table_created, True)
if __name__ == '__main__':
unittest.main()
Connector config
{
"name": "jdbc_sink",
"connector.class": "io.aiven.connect.jdbc.JdbcSinkConnector",
"tasks.max": "1",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter.schema.registry.url": "****",
"key.converter.basic.auth.credentials.source": "USER_INFO",
"key.converter.basic.auth.user.info": "***",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "***",
"value.converter.basic.auth.credentials.source": "USER_INFO",
"value.converter.basic.auth.user.info": "***",
"topics.regex": "sometopic_(.*)",
"connection.url": "***",
"connection.user": "****",
"connection.password": "***",
"insert.mode": "insert",
"table.name.format": "${topic}",
"auto.create": "true"
}

Kafka Connect issue when reading from a RabbitMQ queue

I'm trying to read data into my topic from a RabbitMQ queue using the Kafka connector with the configuration below:
{
"name" : "RabbitMQSourceConnector1",
"config" : {
"connector.class" : "io.confluent.connect.rabbitmq.RabbitMQSourceConnector",
"tasks.max" : "1",
"kafka.topic" : "rabbitmqtest3",
"rabbitmq.queue" : "taskqueue",
"rabbitmq.host" : "localhost",
"rabbitmq.username" : "guest",
"rabbitmq.password" : "guest",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "true",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "true"
}
}
But I´m having troubles when converting the source stream to JSON format as I´m losing the original message
Original:
{'id': 0, 'body': '010101010101010101010101010101010101010101010101010101010101010101010'}
Received:
{"schema":{"type":"bytes","optional":false},"payload":"eyJpZCI6IDEsICJib2R5IjogIjAxMDEwMTAxMDEwMTAxMDEwMTAxMDEwMTAxMDEwMTAxMDEwMTAxMDEwMTAxMDEwMTAxMDEwMTAxMDEwMTAxMDEwMTAxMCJ9"}
Does anyone have an idea why this is happening?
EDIT: I tried to convert the message to String using the "value.converter": "org.apache.kafka.connect.storage.StringConverter", but the result is the same:
11/27/19 4:07:37 PM CET , 0 , [B#1583a488
EDIT2:
I´m now receiving the JSON file but the content is still encoded in BASE64
Any idea on how to convert it back to UTF8 directly?
{
"name": "adls-gen2-sink",
"config": {
"connector.class":"io.confluent.connect.azure.datalake.gen2.AzureDataLakeGen2SinkConnector",
"tasks.max":"1",
"topics":"rabbitmqtest3",
"flush.size":"3",
"format.class":"io.confluent.connect.azure.storage.format.json.JsonFormat",
"value.converter":"org.apache.kafka.connect.converters.ByteArrayConverter",
"internal.value.converter": "org.apache.kafka.connect.converters.ByteArrayConverter",
"topics.dir":"sw66jsoningest",
"confluent.topic.bootstrap.servers":"localhost:9092",
"confluent.topic.replication.factor":"1",
"partitioner.class" : "io.confluent.connect.storage.partitioner.DefaultPartitioner"
}
}
UPDATE:
I got the solution, considering this flow:
Message (JSON) --> RabbitMq (ByteArray) --> Kafka (ByteArray) -->ADLS (JSON)
I used this converter on the RabbitMQ to Kafka connector to decode the message from Base64 to UTF8.
"value.converter": "org.apache.kafka.connect.converters.ByteArrayConverter"
Afterwards I treated the message as a String and saved it as a JSON.
"value.converter":"org.apache.kafka.connect.storage.StringConverter",
"format.class":"io.confluent.connect.azure.storage.format.json.JsonFormat",
Many thanks!
If you set schemas.enable": "false", you shouldn't be getting the schema and payload fields
If you want no translation to happen at all, use ByteArrayConverter
If your data is just a plain string (which includes JSON), use StringConverter
It's not clear how you're printing the resulting message, but looks like you're printing the byte array and not decoding it to a String