Apache Flink: Best Practice - apache-kafka

I have a few questions regarding apache flink.
I have reference data stored in multiple relational database, I can get them via restapi.
These data are static and only need to be loaded once. And it meant to be shared by all flink operators. What should I do here, Can I just load it within my flink job.
Does it make sense to have flink parallelism > kafka partitions. What do we gain here? I am assuming flink will automatically pass data from partitions and rebalance and redistributed to more threads for computation. So the gain is mainly on computation part, but the speed for sourcing cannot be improved because it is strictly binded by how many partitions you have in kafka.

Related

Kafka Streams vs Flink

I wrote an application that reads 100.000 Avro records per second from Kafka topic, aggregates by key, use tumbling windows with 5 different sizes, do some calculation to know the highest, lowest, initial and end value, and write back to another Kafka topic.
This application already exists in Flink, but the source is RSocket in CSV format and the sink is Cassandra. The problem is that the new application is using a lot more CPU and memory. I checked this article and noticed performance is not mentioned.
Am I correct to assume the difference is mostly because of Avro serialisation / deserialisation, or is Flink supposed to be faster for this use case? If the difference is small, I'd prefer Kafka Streams to avoid needing to manage the cluster.
I don't think this question can be answered generally. Both Flink and Kafka Streaming can be tuned to the workload, and small changes in parameters can make a large difference in performance. Generally, there is no fundamental reason why Flink should be a lot faster for such a use case than Kafka Streams. One exception may be repartitioning, which always need to go through the Kafka cluster for Kafka streams and can stay within the cluster for Flink, but as I understand, you are not repartitioning in your use case.
Serialization format may play a large role, however. Some benchmarks that I remember for protobuf (for avro is similar) showed that the size in (Java) memory is 100x larger than the serialized data on the wire. Again, this depends on many things, in particular how nested/complex your schema is. If avro is deserialized to a complex object model, this will cause a significant CPU / memory overhead compared to passing strings around.
However, the only way to tell for certain what is slowing down your use case is profiling it and seeing where the additional resources are spent.
Without benchmarks on your own hardware, or JVM profiling your code, it's hard to say which will be faster.
Flink does invoke more JVM function calls than Kafka Streams, from what I've seen.
Kafka Streams doesn't work well (or at all) with external systems such as RSocket or Cassandra. Therefore, you would still need Flink or some other ETL tool like Kafka Connect (i.e manage a cluster) to get data into a Kafka topic to then process, regardless of framework.
Serialization format shouldn't matter. Flink or Kafka Streams will use the exact same JVM methods from Avro (or any other format) SDK.

Improve performance by using Flink instead of Kafka Streams when Source and Sink are in Kafka?

Assuming I have input data coming in via Kafka topics, and output data to be sent to Kafka topics as well, under what circumstances would Flink be able to process data faster than Kafka Streams? At least when it comes to the time spent consuming and producing, I would not expect Flink to be any faster than Kafka Streams.
Both Flink and Kafka Streams are built on top of the same Producer and Consumer API, so they'll act similarly, up to a point. Once you get into the specific API/DSL, then the stacktrace gets more nested.
Outside of record serialization, Flink can perform more tasks like using Flink SQL compared to Kafka's KSQL, but in those cases, you're managing an external cluster.
Personally, I find Kafka Streams to be faster to develop and maintain because the application itself is the deployable unit, not something to submit to a pool of resources that might be preempted by some scheduler. But if you want to use more than a JVM language, then you will need to venture into Flink or even Beam. And those other languages will be slower because the code will then interface with those native Java libraries.

Apache NiFi & Kafka Integration

I am not sure this questions is already addressed somewhere, but I couldn't find a helpful answer anywhere on internet.
I am trying to integrate Apache NiFi with Kafka - consuming data from Kafka using Apache NiFi. Below are few questions that comes to my mind before proceeding with this.
Q-1) The use case that we have is - read data from Kafka real time, parse the data, do some basic validations on the data and later push the data to HBase. I know
Apache NiFi is the right candidate for doing this kind of processing, but how easy it is to build the workflow if the JSON that we are processing is a complex one ? We were
initially thinking of doing the same using Java Code, but later realised this can be done with minimum effort in NiFi. Please note, 80% of data that we are processing from
Kafka would be simple JSONs, but 20% would be complex ones(invovles arrays)
Q-2) The trickiest part while writing Kafka consumer is handling the offset properly. How Apache NiFi will handle offsets while consuming from Kafka topics ? How offsets
would be properly committed in case rebalancing is triggered while processing ? The frameworks like Spring-Kafka provide options to commit the offsets (to some extent) in case
rebalance is triggered in the middle of processing. How NiFi handles this ?
I have deployed a number of pipeline in 3 node NiFi cluster in production, out of which one is similar to your use case.
Q-1) It's very simple and easy to build a pipeline for your use-case. Since you didn't mention the types of tasks involved in processing a json, I'm assuming generic tasks. Generic task involving JSONs can be schema validation which can be achieved using ValidateRecord Processor, transformation using JoltTransformRecord Processor, extraction of attribute values using EvaluateJsonPath, conversion of json to some other format say avro using ConvertJSONToAvro processors etc.
Nifi gives you flexibility to scale each stage/processor in the pipelines independently. For example, if transformation using JoltTransformRecord is time consuming, you can scale it to run N concurrent tasks in each node by configuring Concurrent Tasks under Scheduling tab.
Q-2) As far as ConsumeKafka_2_0 processor is concerned, the offset management is handled by committing the NiFi processor session first and then the Kafka offsets which means we have an at-least once guarantee by default.
When Kafka trigger rebalancing of consumers for a given partition, processor quickly commits(processor session and Kafka offset) whatever it has got and will return the consumer to the pool for reuse.
ConsumeKafka_2_0 handles committing offset when members of the consumer group change or the subscription of the members changes. This can occur when processes die, new process instances are added or old instances come back to life after failure. Also taken care for cases where the number of partitions of subscribed topic is administratively adjusted.

Spark/Spark Streaming in production without HDFS

I have been developing applications using Spark/Spark-Streaming but so far always used HDFS for file storage. However, I have reached a stage where I am exploring if it can be done (in production, running 24/7) without HDFS. I tried sieving though Spark user group but have not found any concrete answer so far. Note that I do use checkpoints and stateful stream processing using updateStateByKey.
Depending on the streaming(I've been using Kafka), you do not need to use checkpoints etc.
Since spark 1.3 they have implemented a direct approach with so many benefits.
Simplified Parallelism: No need to create multiple input Kafka streams
and union-ing them. With directStream, Spark Streaming will create as
many RDD partitions as there is Kafka partitions to consume, which
will all read data from Kafka in parallel. So there is one-to-one
mapping between Kafka and RDD partitions, which is easier to
understand and tune.
Efficiency: Achieving zero-data loss in the first approach required
the data to be stored in a Write Ahead Log, which further replicated
the data. This is actually inefficient as the data effectively gets
replicated twice - once by Kafka, and a second time by the Write Ahead
Log. This second approach eliminate the problem as there is no
receiver, and hence no need for Write Ahead Logs.
Exactly-once semantics: The first approach uses Kafka’s high level API
to store consumed offsets in Zookeeper. This is traditionally the way
to consume data from Kafka. While this approach (in combination with
write ahead logs) can ensure zero data loss (i.e. at-least once
semantics), there is a small chance some records may get consumed
twice under some failures. This occurs because of inconsistencies
between data reliably received by Spark Streaming and offsets tracked
by Zookeeper. Hence, in this second approach, we use simple Kafka API
that does not use Zookeeper and offsets tracked only by Spark
Streaming within its checkpoints. This eliminates inconsistencies
between Spark Streaming and Zookeeper/Kafka, and so each record is
received by Spark Streaming effectively exactly once despite failures.
If you are using Kafka, you can found out more here:
https://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html
Approach 2.

Does it make sense to build a data processing pipeline using only Kafka?

I am building a data processing pipeline using Kafka.
The pipeline is linear with 4 stages.
The data volume is medium (will need more than one machine but not hundreds or thousands; data volume is a few tens of gigabytes)
My question: can I use only Kafka, having a pipeline stage consume from a topic and produce on another topic? Should I be using Spark or Storm and why? Of course, I prefer the simplest possible architecture. If I can do it all with Kafka, I'd prefer that. In the future I may need some additional machine learning stages and that may affect the answer. I have no strong once-only semantics, I can accept some message loss and some duplication with no problem.
My question: can I use only Kafka, having a pipeline stage consume from a topic and produce on another topic? Should I be using Spark or Storm and why?
Technically yes you can. If you are ready to handle the whole distributed architecture on your own. Writing your own multi-threaded producers, managing those consumers and so on. You also need to consider in terms of Scalability, performance, durability etc. And here comes the beauty of using computation engine like Storm, Spark etc. So you can simply concentrate on the core logic and leave the infrastructure be maintained by them.
For example using a combination of Kafka and Storm for your architecture, you can store terabytes of data using kafka and feed them to storm for processing. If you are familiar with storm then a sample topology can be something like this:
(kafka-spout consuming messages from topic) --> ( Bolt-A for processing the data receive through spout & feeding it to bolt B) --> (Bolt-B for pushing back the processed data into another kafka topic)
Using such architecture offers great deal in scalability, throughput, performance etc.Making some easy configuration changes you will be able to tune your application based on your requirements.