How to run a Kafka Producer and Kafka Consumer via CLI commands for 24 hours - apache-kafka

We have a requirement, where we would need to showcase the resiliency of a kafka cluster. To prove this, we have a use case where we need to run a producer and consumer ( I am thinking kafka-console-producer and kafla-console-consumer) preferably via cli commands and/or scripts to run continuously for 24hrs. We are not concerned with the message size and contents; preferably the size can be as small as possible and messages be any random value, say the present timestamp.
How can I achieve this?

There's nothing preventing you from doing this, and the problem isn't unique to Kafka.
You can use nohup to run a script as a daemon, otherwise, the commands will terminate when that console session ends. You could also use cron to schedule any script, a minimum of every minute...
Or you can write your own app with a simple while(true) loop.
Regardless, you will want a proess supervisor to truly ensure the command remains running at all times.

Related

Apache Flink Streaming Job: deployment patterns

We want to use Apache Flink for the streaming job – read from one Kafka topic and write to another. The infrastructure will be deployed to Kubernetes. I want to restart the job on any PR merge to master branch.
Therefore, I wonder whether Flink guarantees that resubmitting the job will continue the data stream from the last processed message? Because one of the most important job's feature is message deduplication on time window.
What are the patterns of updating streaming jobs for Apache Flink? Should I just stop the old job and submit the new one?
My suggestion would be to simply try it.
Deploy your app manually and then stop it. Run kafka-consumer-groups script to find your consumer group. Then restart/upgrade the app, and run the command again with the same group. If the lag goes down (as it should), rather than resets to the beginning of the topic, then it's working as expected, as it would for any Kafka consumer.
read from one Kafka topic and write to another.
Ideally, Kafka Streams is used for this.
Kafka consumer offsets are saved as part of the checkpoint. So as long as your workflow is running in exactly-once mode, and your Kafka source is properly configured (e.g. you've set a group id), then restarting your job from the last checkpoint or savepoint will guarantee no duplicate records in the destination Kafka topic.
If you're stopping/restarting the job as part of your CI processing, then you'd want to:
Stop with savepoint.
Re-start from the savepoint
You could also set ExecutionCheckpointingOptions.ENABLE_CHECKPOINTS_AFTER_TASKS_FINISH to true (so that a checkpoint is taken when the job is terminated), enable externalized checkpoints, and then restart from the last checkpoint, but the savepoint approach is easier.
And you'd want to have some regular process that removes older savepoints, though the size of the savepoints will be very small (only the consumer offsets, for your simple workflow).

How to stop a sinkTask in Kafka?

I am running a sinkTask using connect-standalone.sh and connect-standalone.properties. I am doing this in a shell script and I am not sure how to stop the sinkTask once the data is consumed by the consumer.
I tried various settings in the properties file like connections.max.idle.ms=5000. But nothing is stopping the sink.
I don't want to try the distributed mode as it requires REST API calls. Any suggest to stop the sinkTask once the messages in the producer are empty?
When running in standalone, the only way to stop a connector is to stop the connect process you started with connect-standalone.sh.
If you want to often start and stop connectors, I'd recommend you to reconsider distributed mode as it makes controlling the life cycle of connectors easy to manage via the REST API.

Why my Kafka connect sink cluster only has one worker processing messages?

I've recently setup a local Kafka on my computer for testing and development purposes:
3 brokers
One input topic
Kafka connect sink between the topic and elastic search
I managed to configure it in standalone mode, so everything is localhost, and the Kafka connect was started using ./connect-standalone.sh script.
What I'm trying to do now is to run my connectors in distributed mode, so the Kafka messages can be separated into both workers.
I've started the two workers (still everything on the same machine), but when I send message to my Kafka topic, only one worker (the last started) is processing messages.
So my question is: Why only one worker is processing Kafka messages instead of both ?
When I kill one of the worker, the other one takes the message flow back, so I think the cluster is well setup.
What I think:
I don't put Keys inside my Kafka messages, can it be related to this ?
I'm running everything in localhost, does distributed mode can work this way ? (I've correctly configure specific unique field such as ret.port)
Resolved:
From Kafka documentation:
The division of work between tasks is shown by the partitions that each task is assigned
If you don't use partition (push all messages in same partition), workers won't be able to divide messages.
You don't need to use message keys, you can just push your messages to different partition in a cyclic way.
See: https://docs.confluent.io/current/connect/concepts.html#distributed-workers

Running a single kafka s3 sink connector in standalone vs distributed mode

I have a kafka topic "mytopic" with 10 partitions and want to use S3 sink connector to sink records to an S3 bucket. For scaling purposes it should be running on multiple nodes to write partitions data in parallel to the same S3 bucket.
In Kafka connect user guide and actually many other blogs/tutorials it's recommended to run workers in distributed mode instead of standalone to achieve better scalability and fault tolerance:
... distributed mode is more flexible in terms of scalability and offers the added advantage of a highly available service to minimize downtime.
I want to figure out which mode to choose for my use case: having one logical connector running on multiple nodes in parallel. My understanding is following:
If I run in distributed mode, I will end up having only 1 worker processing all the partitions, since it's considered one connector task.
Instead I should run in standalone mode in multiple nodes. In that case I will have a consumer group and achieve parallel processing of partitions.
In above described standalone scenario I will actually have fault tolerance: if one instance dies, the consumer group will rebalance and other standalone workers will handle the freed partitions.
Is my understaning correct or am I missing something?
Unfortunately I couldn't find much information on this topic other than this google groups discussion, where the author came to the same conclusion as I did.
In theory, that might work, but you'll end up ssh-ing to multiple machines, having basically the same config files, and just not using the connect-distributed command instead of connect-standalone.
You're missing the part about Connect server task rebalancing, though, which communicates over the Connect server REST ports
The underlying task code is all the same, only the entrypoint and offset storage are different. So, why not just use distributed if you have multiple machines?
You don't need to run, multiple instances of standalone processes, the Kafka workers are taking care of distributing the tasks, rebalancing, offset management under the distributed mode, you need to specify the same group id ...

Logging Celery on Client/Producer

This is not a question about how to capture logging on celery workers. Is there any way to capture celery logging on a Producer. What I want is to capture every log that get generated by celery on the Producer when I call task.delay(...) or task.apply_async(...).
EDIT:
I don't want to capture worker logs on producer. I want to capture everything that happen in celery from the time of my call to apply_async until the task is sent to the broker.
No, there is no way to capture worker logs on the producer. All you get is the exception, if thrown. Any logging is happening on the worker side, so you have to examine logs of that particular worker, or if you use some centralised log system then you have to look for logs from that worker...
Update: seems like you want to capture eventual logging from Celery on the producer (client) side. As far as I know Celery and the underlying transport handling library (Kombu) do not log anything. I could be wrong of course, but I can't remember seeing any logging there and I have read Celery (Kombu not that much to be fair) code many times...
A possible solution is to make Celery workers send logs to some centralised system that your Celery client can access...