In Kafka Connector, how do I get the bootstrap-server address My Kafka Connect is currently using? - apache-kafka

I'm developing a Kafka Sink connector on my own. My deserializer is JSONConverter. However, when someone send a wrong JSON data into my connector's topic, I want to omit this record and send this record to a specific topic of my company.
My confuse is: I can't find any API for me to get my Connect's bootstrap.servers.(I know it's in the confluent's etc directory but it's not a good idea to write hard code of the directory of "connect-distributed.properties" to get the bootstrap.servers)
So question, is there another way for me to get the value of bootstrap.servers conveniently in my connector program?

Instead of trying to send the "bad" records from a SinkTask to Kafka, you should instead try to use the dead letter queue feature that was added in Kafka Connect 2.0.
You can configure the Connect runtime to automatically dump records that failed to be processed to a configured topic acting as a DLQ.
For more details, see the KIP that added this feature.

Related

Send Apache Kafka messages through localtunnel.me

I am new to Apache Kafka and I'm trying to build a Python app which is able to handle Kafka messages. I've set Kafka up to produce and consume messages locally. Now I also want this to work non-locally, so that I can send messages from everywhere to my Python app.
My idea was to just expose the specific port that Kafka is using by using Localtunnel. I thought this would just mirror the local messages, so that I can consume them via the generated URL. But surprise, it doesn't work.
I just don't receive any messages at all. Do you have an idea why this is? Do I maybe have to configure the listeners in the Kafka server.properties first?
Thanks!

Kafka to BigQuery, best way to consume messages

I need to receive messages to my BigQuery tables and I want to know what is the best way to consume those messages.
My Kafka servers who are at AWS they produce AVRO messages and from what I saw Dataflow needs receive JSON format messages. So I googled and found an article explaining how to receive messages to PubSub, but on PubSub what I only see in this type of architecture, they create a Kafka VM on GCP to produce the messages.
What I need to know is:
It's possible to receive AVRO messages on PubSub from external Kafka Servers and then deserialize the message using my Schema, sending it to Dataflow and finally send it to BigQuery tables?
Or do I need to create a Kafka VM and use it to consume messages from external servers?
This might seem a bit confusing but it is what I am feeling right now. The main goal here is to get messages from Kafka (AVRO format) at AWS and put them on BigQuery tables. If you have any suggestions they are very welcomed
Thanks a lot in advance
The Kafka Connect BigQuery Connector may be exactly what you need. It is a Kafka sink connector that allows you to export messages from Kafka directly to BigQuery. The README page provides detailed configuration instructions, including how to let the connector recognize your Kafka queue and how to enter the information for the destination BigQuery table. This connector should be able to retrieve the AVRO schema automatically from your Kafka project.

How can I manage Kafka connect schema errors?

I'm using kafka connect (confluent distribution) to connect an mqtt broker to a kafka topic (https://docs.lenses.io/connectors/source/mqtt.html), but when a message arrives and it isn't conform to the expected schema, the connector stops!
How can I prevent this from happening?
I'd like also to manage the error and for example keep track of it!
If you are using a ready made connector, you need to satisfy the proper schema. If any error occurs it will stop the connector. So, best way is to identify the schema error based on error message.
If its impossible to use the existing connector, create one for your own which could satisfy your need.

How to dump avro data from Kafka topic and read it back in Java/Scala

We need to export production data from a Kafka topic to use it for testing purposes: the data is written in Avro and the schema is placed on the Schema registry.
We tried the following strategies:
Using kafka-console-consumer with StringDeserializer or BinaryDeserializer. We were unable to obtain a file which we could parse in Java: we always got exceptions when parsing it, suggesting the file was in the wrong format.
Using kafka-avro-console-consumer: it generates a json which includes also some bytes, for example when deserializing BigDecimal. We didn't even know which parsing option to choose (it is not avro, it is not json)
Other unsuitable strategies:
deploying a special kafka consumer would require us to package and place that code in some production server, since we are talking about our production cluster. It is just too long. After all, isn't kafka console consumer already a consumer with configurable options?
Potentially suitable strategies
Using a kafka connect Sink. We didn't find a simple way to reset the consumer offset since apparently the connector created consumer is still active even when we delete the sink
Isn't there a simply, easy way to dump the content of the value (not the schema) of a Kafka topic containing avro data to a file so that it can be parsed? I expect this to be achievable using kafka-console-consumer with the right options, plus using the correct Java Api of Avro.
for example, using kafka-console-consumer... We were unable to obtain a file which we could parse in Java: we always got exceptions when parsing it, suggesting the file was in the wrong format.
You wouldn't use regular console consumer. You would use kafka-avro-console-consumer which deserializes the binary avro data into json for you to read on the console. You can redirect > topic.txt to the console to read it.
If you did use the console consumer, you can't parse the Avro immediately because you still need to extract the schema ID from the data (4 bytes after the first "magic byte"), then use the schema registry client to retrieve the schema, and only then will you be able to deserialize the messages. Any Avro library you use to read this file as the console consumer writes it expects one entire schema to be placed at the header of the file, not only an ID pointing to anything in the registry at every line. (The basic Avro library doesn't know anything about the registry either)
The only thing configurable about the console consumer is the formatter and the registry. You can add decoders by additionally exporting them into the CLASSPATH
in such a format that you can re-read it from Java?
Why not just write a Kafka consumer in Java? See Schema Registry documentation
package and place that code in some production server
Not entirely sure why this is a problem. If you could SSH proxy or VPN into the production network, then you don't need to deploy anything there.
How do you export this data
Since you're using the Schema Registry, I would suggest using one of the Kafka Connect libraries
Included ones are for Hadoop, S3, Elasticsearch, and JDBC. I think there's a FileSink Connector as well
We didn't find a simple way to reset the consumer offset
The connector name controls if a new consumer group is formed in distributed mode. You only need a single consumer, so I would suggest standalone connector, where you can set offset.storage.file.filename property to control how the offsets are stored.
KIP-199 discusses reseting consumer offsets for Connect, but feature isn't implemented.
However, did you see Kafka 0.11 how to reset offsets?
Alternative options include Apache Nifi or Streamsets, both integrate into the Schema Registry and can parse Avro data to transport it to numerous systems
One option to consider, along with cricket_007's, is to simply replicate data from one cluster to another. You can use Apache Kafka Mirror Maker to do this, or Replicator from Confluent. Both give the option of selecting certain topics to be replicated from one cluster to another- such as a test environment.

Publish message to kafka via http

I'm new with kafka and I'm trying to publish data from external application via http but I cannot find the way to do this.
I already created a topic in kafka and test it producing and consuming the message but I don't know how to insert/publish message via http, I tried to invoke the following url to retrieve the topics but it does not retrieve any data http://servername:2181/topics/
I'm using cloudera 5.12.1.
You can access to your topics, if it was already created, using APIs. The easy way...(see client list)
Or see Connects Config to manage connectors by REST (rest.host.name, rest.port parameters). But only connectors...
To consume or produce message in a topic, use a middleware. IT is more feaseble.
Check out the open source Kafka REST Proxy from Confluent. It does exactly what you want.
You can get it standalone, or as part of Confluent Platform.