KafkaStream Vs Flink - apache-kafka

I have used the Flink for sending data from source to sink.
My flink app consumes the data from Kafka and send to the destination.
The destination is also kafka topic which has a different topic name.
The Flink is only used for delivering purpose without having any business logic.
In this case, I think that changing the flink to Kafka Stream will increase the throughput. Because the flink has no contribution except for delivering data from source to sink. Also my source and sink uses the kafka so I think that kafka streams will be faster in case of the delivery data.
I would appreciate if you could give you any opinion for my question.
Thanks.

There's no guarantee one will be faster than the other. You still need to do JVM and network tuning.
Either will work, but the limitation of Kafka Streams is that the data must remain in the same Kafka cluster. Flink has no such limitation.
Or you can simply use MirrorMaker for moving data between Kafka topics of different clusters.

Related

Improve performance by using Flink instead of Kafka Streams when Source and Sink are in Kafka?

Assuming I have input data coming in via Kafka topics, and output data to be sent to Kafka topics as well, under what circumstances would Flink be able to process data faster than Kafka Streams? At least when it comes to the time spent consuming and producing, I would not expect Flink to be any faster than Kafka Streams.
Both Flink and Kafka Streams are built on top of the same Producer and Consumer API, so they'll act similarly, up to a point. Once you get into the specific API/DSL, then the stacktrace gets more nested.
Outside of record serialization, Flink can perform more tasks like using Flink SQL compared to Kafka's KSQL, but in those cases, you're managing an external cluster.
Personally, I find Kafka Streams to be faster to develop and maintain because the application itself is the deployable unit, not something to submit to a pool of resources that might be preempted by some scheduler. But if you want to use more than a JVM language, then you will need to venture into Flink or even Beam. And those other languages will be slower because the code will then interface with those native Java libraries.

Multiple Flink pipelines for the same Kafka topic

Background
We have a Kafka topic with a steady stream of data. To process it we have a stateless Flink pipeline that consumes that topic and writes to another topic.
From time to time we have bursts of information that our Flink is not configured to handle. We don't want to configure our Flink pipeline and cluster to always support the maximum load we can have, we want to dynamically scale according to the load. (budget reasons $$$)
Solutions we thought of
One way to do so is to add/remove nodes to the Flink cluster and change the parallelism of the Flink pipeline operators. This will require stopping the Flink job with a snapshot, reconfiguring the parallelism and restarting with new parallelism.
This would be great but we cannot allow ourselves the downtime it produces. We have to scale up/down without downtime.
If we would use regular Kafka consumers it would be as simple as adding a consumer (assuming we have enough Kafka partitions) and Kafka would redistribute the topic partitions between all the consumers.
The Flink Kafka consumer manages the partition assignment and the offset on its own which allows exactly-once semantics (we don't need it). The drawback is that a single Flink job always uses all the topic partitions.
We thought we could create another instance of Flink that would subscribe to the same topic with the same group and let Kafka distribute the partitions between them. But for that we would need the Kafka Flink consumer to let Kafka manage which partitions are assigned to which consumer.
What are we looking for
We couldn't find a library that contains such a consumer or a configuration in the existing consumer. We could write it on our own (not so difficult) but if there is an existing solution we'd rather use it.
Are we missing something? Are we misunderstanding something? Is there a better solution?
Thanks!
The most straightforward approach, since you said that at worst you'll need double the capacity, would be to modify your topology to be able to write Kafka messages you can't process quickly enough to a second overflow Kafka topic. Both input and output Kafka topic names would be configurable. Maybe you would have a threshold backlog delay that automatically triggers this writing or maybe you would have a flag in the topology that you can externally set while the topology is running. That's a design detail you can work through that has operational implications.
This gives you a Flink topology that can handle some maximum number of messages in a timely fashion while writing the rest of the messages that can't be handled to a second Kafka topic. You can then run a second instance of the same Flink topology that reads from that secondary topic and writes, if necessary to a third topic. If the writing to the overflow topic happens very early in the topology processing, you could chain several of these instances together via Kafka with minimal latency and without having to reconfigure and restart any topologies.

Is it possible to use Flume Kafka Source by itself?

Let's say there are a bunch of producers writing avro records (that have the same schema) into a Kafak Topic.
Can I use Flume Kafka Source to read those records and write them to HDFS. Even though the records were not published using a Flume Sink?
Yes, you can. In general, what a Kafka consumer can or cannot do is fully independent on who produced the data, but only on the format in which the data has been encoded.
That is the whole point of Kafka and Enterprise Service Bus.

Where to run the processing code in Kafka?

I am trying to setup a data pipeline using Kafka.
Data go in (with producers), get processed, enriched and cleaned and move out to different databases or storage (with consumers or Kafka connect).
But where do you run the actual pipeline processing code to enrich and clean the data? Should it be part of the producers or the consumers? I think I missed something.
In the use case of a data pipeline the Kafka clients could serve both as a consumer and producer.
For example, if you have raw data being streamed into ClientA where it is being cleaned before being passed to ClientB for enrichment then ClientA is serving as a consumer (listening to a topic for raw data) and a producer (publishing cleaned data to a topic).
Where you draw those boundaries is a separate question.
It can be part of either producer or consumer.
Or you could setup an environment dedicated to something like Kafka Streams processes or a KSQL cluster
It is possible either ways.Consider all possible options , choose an option which suits you best. Lets assume you have a source, raw data in csv or some DB(Oracle) and you want to do your ETL stuff and load it back to some different datastores
1) Use kafka connect to produce your data to kafka topics.
Have a consumer which would consume off of these topics(could Kstreams, Ksql or Akka, Spark).
Produce back to a kafka topic for further use or some datastore, any sink basically
This has the benefit of ingesting your data with little or no code using kafka connect as it is easy to set up kafka connect source producers.
2) Write custom producers, do your transformations in producers before
writing to kafka topic or directly to a sink unless you want to reuse this produced data
for some further processing.
Read from kafka topic and do some further processing and write it back to persistent store.
It all boils down to your design choice, the thoughput you need from the system, how complicated your data structure is.

Kafka Consumer Vs Apache Flink

I did a poc in which I read data from Kafka using spark streaming. But our organization is either using Apache Flink or Kafka consumer to read data from Apache kafka as a standard process. So I need to replace Kafka streaming with Kafka consumer or Apache Flink. In my application use case, I need to read data from kafka, filter json data and put fields in cassandra, so the recommendation is to use Kafka consumer rather than flink/other streamings as I don't really need to do any processing with Kafka json data. So I need your help to understand below questions:
Using Kafka consumer, can I achieve same continuous data read as we do in case of spark streaming or flink?
Is kafka consumer sufficient for me considering I need to read data from kafka, deserialize using avro scehma, filter fields and put in cassandra?
Kafka consumer application can be created using kafka consumer API, right?
Is there any down sides in my case if I just use Kafka consumer instead of Apache flink?
Firstly, let's take a look at Flinks Kafka Connector, And Spark Streaming with Kafka, both of them use Kafka Consumer API(either simple API or high level API) inside to consume messages from Apache Kafka for their jobs.
So, regarding to your questions:
Yes
Yes. However, if you use Spark, you can consider to use Spark Cassandra connector, which helps to save data into Cassandara efficiently
Right
As mentioned above, Flink also uses Kafka consumer for its job. Moreover, it is a distributed stream and batch data processing, it helps to process data efficiently after consuming from Kafka. In your cases, to save data into Cassandra, you can consider to use Flink Cassandra Connector rather than coding by yourselve.