How can I log Kafka topic access - apache-kafka

from what I understand, Kafka has the schema of produce -> send data to topics, then consumer - get the data from those topics.
I will have many consumers (each on a different computer) on a single topic.
I would like to log each consumer's access to each topic so the log would look like this:
02-02-2022 IP:56.54.45.54 accessed topic "test topic" fragment 15
How can I do this?

Kafka has no such built-in mechanism
You would have to write/find an Authorizer implementation for this and configure that as an authorizer.class.name on the brokers
Two examples that I know of include
Apache Ranger w/ auditting enabled
OpenPolicyAgent w/ decision logs

Related

Separate Dead letter queue for each topic in kafka connect

I'm implementing streaming using kafka connect in one of my projects. I have created S3 sink connector to consume messages from different topics (using regex) and then write a file to S3. Consuming messages from different topics is done using the below property.
"topics.regex": "my-topic-sample\\.(.+)",
I have 3 different topics as below. Using the above property, S3 sink connector will consume messages from these 3 topics and writes a separate file (for each topic) to S3.
my-topic-sample.test1
my-topic-sample.test2
my-topic-sample.test3
For now, ignoring all the invalid messages. However, want to implement dead letter queue.
We can achieve that using the below properties
'errors.tolerance'='all',
'errors.deadletterqueue.topic.name' = 'error_topic'
From the above property, we can move all the invalid messages to DLQ. But the issue is that we will have only 1 DLQ though there are 3 different topics from which S3 sink connector is consuming the messages. Invalid messages from all the 3 topics will be pushed to same DLQ.
Is there a way that we can have multiple DLQ's and write the message to a different DLQ based on the topic that was consumed from.
Thanks
You can only configure a single DLQ topic per connector, it's not possible to use a different DLQ topic for each source topic.
Today if you want to split records your connector fails to process into different DLQ topics, you need to start multiple connector instances each consuming from a different source topic.
Apache Kafka is an open source project so if you are up for the challenge, you can propose implementing such a feature if you want to. See https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals for the overall process.

MQTT topics and kafka topics mapping

I have started to learn about MQTT as I have a use case in telematics in my current organisation. I would like to integrate MQTT broker ( mosquitto ) messages to my kafka.
Since every vehicle is sending the data in its own topic in MQTT broker within a single organisation, I would like to push all this data in kafka. Now I know it is not advisable to create so many topics in kafka ( more than a million ). Also I would like not like to save all the vehicles data in one kafka topic as I would like to later put all this data in S3, differentiated via vehicle id.
How can I achieve this without making so many topics in kafka. One way is the consumer of kafka segregate the events and put in s3 but I believe there will be a lot of small files in S3.
Generally, if you have the same logical entity you would use the same topic.
You can use the MQTT plugin for Kafka Connect to stream the data from MQTT into Kafka, and Kafka Connect's Single Message Transform RegexRouter to modify the topic name to which messages are written, and other SMT to modify the message key. That way you get all the messages in one topic, partitioned based on the vehicle id. That's probably the best way to store it.
From there, you can use the data however you want. When it comes to stream it to S3, you can use Kafka Connect S3 sink and as cricket_007 mentioned partition the data by time if it's just volume you're worried about. If you want to route the messages to different buckets or areas of the same bucket you could use a stream processing (e.g. Kafka Streams / ksqlDB) to pre-process the topic to populate others.
See here for an example of the MQTT connector.

Sending from external system to Kafka without duplicates in a transaction

I have a requirement to send data from an external system to a Kafka topic with exactly once semantics.
The source has an offset, we can consume messages from a given offset.
Looking at Kafka documentation, I see there are 2 ways to do this.
Kafka Source Connector
Use plain Kafka producer with transactions.
It looks like option 1 doesn't support exactly once semantics now, Kafka jira 6080 is unresolved. Also I would like to understand how we can do this directly with the producer apis.
For option 2, the (consume, transform, produce) loop in all the documents show committing offsets of consumers using AddOffsetsToTxn. What is the recommended strategy if the source is not a Kafka topic? Looks like writing the source offset in a different topic as part of the transaction and using it during recovery would work. Is this the recommended way?

Apache Kafka Lineage Information

Is there any way to obtain lineage data for Kafka jobs.
Like for example we have a Job History URL for all the MapReduce jobs.
Is there anything similar for Kafka where I can get the metadata of the Producer producing information to a particular topic? (eg: IP address of the producer)
Out of the box, no, not for Producers.
You can list consumer client addressses, but not see producers into a topic.
One option would be to look into Apache Atlas for lineage information, which is one component of Hortonwork's Stream Messaging Manager for analyzing Kafka connection information.
Otherwise, you would be stuck trying to enfore producers to send this data along with the messages.

Kafka - collect logs from multiple servers. Should each producer running on write to the same topic?

I am learning to use Kafka. I would like to implement a centralized log service using Kafka. I have multiple servers running my application, I would like my application to write their log to Kafka (i.e producers) and then a consumer on the other side to read the logs back. I would like to use the same topic for all my applications. For example, I would like my application to write to a topic called "AppLog" and then have the consumer just read the AppLog topic back.
Does Kafka support multiple producers writing to the same topic?
Note: The relative sequences of the log does not matter to me.
Any help is appreciated. Thanks ahead.
If you take a look here you'll see that having even 40 producers for 1 topic does not affect performance that much, so yes, you may use multiple producers to write into one topic.
Yes, you can have as many producers as you want writing to the same topic. To improve parallelism you can scale your topic increasing the number of partitions and use more brokers (Kafka servers) in your cluster.