Classical Architecture to Kafka, how do you realize the following? - apache-kafka

we are trying to move away from our classical architecture J2EE application server/Relational database to Kafka. I have an use case that I am not sure how exactly to proceed....
Our application exports with a Scheduler from Relation Database, in the future, we are planning to not to place information at all at Relational Database but to realise export directly from the information at Kafka Topic(s).
What I am not sure will be best solution would be, is to configure consumer that polls the topic(s) with the same schedule as the scheduler and export things.
Or to create KafkaStream at schedule triggering point to collect this information from a Kafka Stream?
What do you think?

The approach you want to adopt is technically feasible, few possible solutions:
1) Continuous running Kafka-Consumer with Duration=<export schedule time>
2) Cron triggered kafka-streaming-consumer with batch-duration same as schedule. Do offset commit to Kafka.
3) Cron triggered Kafka-consumer programmatically handle offsets and pull records based on offsets as per your schedule.
Important considerations:
Increase retention.ms to much more than your schedule batch job time.
Increase disk space to accommodate data volume spike since you are going to hold data for longer duration.
Risks & Issues:
Weekend retention could be missed.
Another application if by mistake uses same group.id can mislead offsets.
No aggregation/math function can be applied before retrieval.
Your application can not filter/extract records based on any parameter.
Unless offsets are managed externally, application can not re-read records.
Records will not be formatted i.e. mostly Json strings or maybe some other formats.

Related

kafka consumer to store history of events in a data store

We are working with kafka as Event Streaming Platform. So far, there is one producer of data and 3 consumers, each of them subcribed to one or several topics in kafka. This is working perfectly fine. Fyi, the kafka retention period is set to 5s since we don't need to persist the events more than that.
Right now, we have a new use-case coming to persist all the events for the latest 20 mins (in an another data store) for post-analysis (mainly for training purposes). So this new kafka consumer should subscribe to all existing topics. We only want to persist the history of latest 20mins of events in the data store and not all the events for a session (that can represent several hours or days). TThe targetted througput is 170kb/s and for 20mins it is almost 1M of messages to be persisted.
We are wondering which architecture pattern is adapted for such situtation? This is not a nominal use-case compared to the current use-cases, so we don't want to reduce the performance of the system to be able to manage it. Our idea is to empty the topcis as fast as we can , push the data into a queue and have another app with a different rate in charge of reading the data from the queue and persisting them into the store.
We woul greatly appreciate any experience or feedback to manage such use-case. Especially about the expiration/pruge mechanism to be used. For sure we need something highly available and scalable.
Regards
You could use Kafka Connect with topics.regex=* to consume everything and write to one location, but you'll end up with a really high total lag, especially if you keep adding new topics.
If you have retention.ms=5000, then I don't know if Kafka is a proper tool for your use case, but perhaps you could ingest into Splunk or Elasticsearch or other time-series system where you can properly slice by 20 minute windows.

Starting new Kafka Streams microservice, when there is data retention period on input topics

Lets assume i have (somewhat) high velocity input topic - for example sensor.temperature and it has retention period of 1 day.
Multiple microservices are already consuming data from it. I am also backing up events in historical event store.
Now (as a simplified example) I have new requirement - calculating maximum all time temperature per sensor.
This is fitting very well with Kafka Streams, so I have prepared new microservice that creates KTable aggregating temperature (with max) grouped per sensor.
Simply deploying this microservice would be enough if input topic had infinite retention, but now maximum would be not all-time, as is our requirement.
I feel this could be common scenario but somehow I was not able to find satisfying solution on the internet.
Maybe I am missing something, but my ideas how to make it work do not feel great:
Replay all past events into the input topic sensor.temperature. This is large amount of data and it would cause all subscribing microservices to run excessive computation, which is most likely not acceptable.
Create duplicate of input topic for my microservice: sensor.temperature.local, where I would always copy all events and then further process(aggregate) them from this local topic.
This way I can freely replay historical events into local topic without affecting other microservices.
However this local duplicate would be required for all Kafka Streams microservices, and if input topic is high velocity this could be too much duplication.
Maybe there some way to modify KTables more directly, so one could query the historical event store for max value per sensor and put it in the KTable once?
But what if streams topology is more complex? It would require orchestrating consistent state in all microsevice's KTables, rather than simply replaying events.
How to design the solution?
Thanks in advance for your help!
In this case I would create a topic that stores the max periodically (so that it won't fell off the topic beacuse of a cleanup). Then you could make your service report the max of the max-topic and the max of the measurement-topic.

Apache NiFi & Kafka Integration

I am not sure this questions is already addressed somewhere, but I couldn't find a helpful answer anywhere on internet.
I am trying to integrate Apache NiFi with Kafka - consuming data from Kafka using Apache NiFi. Below are few questions that comes to my mind before proceeding with this.
Q-1) The use case that we have is - read data from Kafka real time, parse the data, do some basic validations on the data and later push the data to HBase. I know
Apache NiFi is the right candidate for doing this kind of processing, but how easy it is to build the workflow if the JSON that we are processing is a complex one ? We were
initially thinking of doing the same using Java Code, but later realised this can be done with minimum effort in NiFi. Please note, 80% of data that we are processing from
Kafka would be simple JSONs, but 20% would be complex ones(invovles arrays)
Q-2) The trickiest part while writing Kafka consumer is handling the offset properly. How Apache NiFi will handle offsets while consuming from Kafka topics ? How offsets
would be properly committed in case rebalancing is triggered while processing ? The frameworks like Spring-Kafka provide options to commit the offsets (to some extent) in case
rebalance is triggered in the middle of processing. How NiFi handles this ?
I have deployed a number of pipeline in 3 node NiFi cluster in production, out of which one is similar to your use case.
Q-1) It's very simple and easy to build a pipeline for your use-case. Since you didn't mention the types of tasks involved in processing a json, I'm assuming generic tasks. Generic task involving JSONs can be schema validation which can be achieved using ValidateRecord Processor, transformation using JoltTransformRecord Processor, extraction of attribute values using EvaluateJsonPath, conversion of json to some other format say avro using ConvertJSONToAvro processors etc.
Nifi gives you flexibility to scale each stage/processor in the pipelines independently. For example, if transformation using JoltTransformRecord is time consuming, you can scale it to run N concurrent tasks in each node by configuring Concurrent Tasks under Scheduling tab.
Q-2) As far as ConsumeKafka_2_0 processor is concerned, the offset management is handled by committing the NiFi processor session first and then the Kafka offsets which means we have an at-least once guarantee by default.
When Kafka trigger rebalancing of consumers for a given partition, processor quickly commits(processor session and Kafka offset) whatever it has got and will return the consumer to the pool for reuse.
ConsumeKafka_2_0 handles committing offset when members of the consumer group change or the subscription of the members changes. This can occur when processes die, new process instances are added or old instances come back to life after failure. Also taken care for cases where the number of partitions of subscribed topic is administratively adjusted.

Is there any way to ensure that duplicate records are not inserted in kafka topic?

I have been trying to implement a queuing mechanism using kafka where I want to ensure that duplicate records are not inserted into topic created.
I found that iteration is possible in consumer. Is there any way by which we can do this in producer thread as well?
This is known as exactly-once processing.
You might be interested in the first part of Kafka FAQ that describes some approaches on how to avoid duplication on data production (i.e. on producer side):
Exactly once semantics has two parts: avoiding duplication during data
production and avoiding duplicates during data consumption.
There are two approaches to getting exactly once semantics during data
production:
Use a single-writer per partition and every time you get a network
error check the last message in that partition to see if your last
write succeeded
Include a primary key (UUID or something) in the
message and deduplicate on the consumer.
If you do one of these things, the log that Kafka hosts will be
duplicate-free. However, reading without duplicates depends on some
co-operation from the consumer too. If the consumer is periodically
checkpointing its position then if it fails and restarts it will
restart from the checkpointed position. Thus if the data output and
the checkpoint are not written atomically it will be possible to get
duplicates here as well. This problem is particular to your storage
system. For example, if you are using a database you could commit
these together in a transaction. The HDFS loader Camus that LinkedIn
wrote does something like this for Hadoop loads. The other alternative
that doesn't require a transaction is to store the offset with the
data loaded and deduplicate using the topic/partition/offset
combination.
I think there are two improvements that would make this a lot easier:
Producer idempotence could be done automatically and much more cheaply
by optionally integrating support for this on the server.
The existing
high-level consumer doesn't expose a lot of the more fine grained
control of offsets (e.g. to reset your position). We will be working
on that soon

Reliable usage of DirectKafkaAPI

I am pllaned to develop a reliable streamig application based on directkafkaAPI..I will have one producer and another consumer..I wnated to know what is the best approach to achieve the reliability in my consumer?..I can employ two solutions..
Increasing the retention time of messages in Kafka
Using writeahead logs
I am abit confused regarding the usage of writeahead logs in directkafka API as there is no receiver..but in the documentation it indicates..
"Exactly-once semantics: The first approach uses Kafka’s high level API to store consumed offsets in Zookeeper. This is traditionally the way to consume data from Kafka. While this approach (in combination with write ahead logs) can ensure zero data loss (i.e. at-least once semantics), there is a small chance some records may get consumed twice under some failures. "
so I wanted to know what is the best approach..if it suffices to increase the TTL of messages in kafka or I have to also enable write ahead logs..
I guess it would be good practice if I avoid one of the above since the backup data (retentioned messages, checkpoint files) can be lost and then recovery could face failure..
Direct Approach eliminates the duplication of data problem as there is no receiver, and hence no need for Write Ahead Logs. As long as you have sufficient Kafka retention, messages can be recovered from Kafka.
Also, Direct approach by default supports exactly-once message delivery semantics, it does not use Zookeeper. Offsets are tracked by Spark Streaming within its checkpoints.