Kafka use case to send data to external system - apache-kafka

Studying kafka in the documentation I found next sentence:
Queuing is the standard messaging type that most people think of: messages are produced by one part of an application and consumed by another part of that same application. Other applications aren't interested in these messages, because they're for coordinating the actions or state of a single system. This type of message is used for sending out emails, distributing data sets that are computed by another online application, or coordinating with a backend component.
It means that Kafka topics aren't suitable for streaming data to external applications. However, in our application, we use Kafka for such purpose. We have some consumers which read messages from Kafka topics and try to send them to an external system. With such approach we have a number of problems:
Need a separet topic for each external application (assume that the number of external application numbers > 300, doesn't suite well)
Messages to an external system can fail when the external application is unavailable or for some another reason. It is incorrect to keep retrying to send the same message and not to commit offset. Another way there is no nicely configured log when I can see all fail messages and try to resend them.
What are other best practice approach to stream data to an external application? OR Kafka is a good choice for the purpose?

Just sharing a piece of experience. We use Kafka extensively for integrating external applications in the enterprise landscape.
We use topic-per-event-type pattern. The current number of topics is about 500. The governance is difficult but we have our own utility tool, so it is feasible.
Where possible we extend an external application to integrate with Kafka. So the consumers become a part of the external application and when the application is not available they just don't pull the data.
If the extension of the external system is not possible, we use connectors, which are mostly implemented by us internally. We distinguish two type of errors: recoverable and not recoverable. If the error is not recoverable, for example, the message is corrupted or not valid, we log the error and commit the offset. If the error is recoverable, for example, the database for writing the message is not available then we do not commit the message, suspend consumers for some period of time and after that period try again. In your case it is probably makes sense to have more topics with different behavior (logging errors, rerouting the failed messages to different topics and so on)

Related

Which messaging system for a web dashboard?

I would like to make a Web Dashboard system and I am facing a problem. I need to get an information that is in the cache of one of the instances of my program, for this I had thought of doing Pub/Sub with Kafka however I don't know how to do to Publish and get a response from one of my Subscriber. Do you know a pattern that allows this and a service that allows me to do this?
EDIT: I would like to design an infrastructure that follows this pattern:
Attached diagram is showing simple request->response flow, Kafka is designed for different types of architecture, so IMHO you should not focus on Kafka in this case.
However, if you still want to use Kafka for some other reasons I can suggest to you two options:
Stick with request->response flow and use ReplyingKafkaTemplate or AggregatingKafkaTemplate to handle it, second one is an extension of first one, this adds functionality to handle more responses then one. You can send a request to Kafka topic from the Dashboard application, then poll the message by one of the Bot instances, next, send reply to reply topic, and then process reply in Dashboard application.
Use Kafka to implement Event-Carried State Transfer pattern, move state (mutual guilds data) from Bot Instances directly to Dashboard application via Kafka topic. You can use several tools to implement this:
Bot applications send events to Kafka topic via simple KafkaProducer or KafkaTemplate, then use one of the Kafka Connect sink connectors to save data in Dashboards database.
Bot applications send events to Kafka topic via simple KafkaProducer or KafkaTemplate. Run Kafka Streams thread in Dashboard application and build a state using Kafka Streams functionalities - grouping, aggregating etc. Then read the state directly from Kafka Streams internal RocksDB database.

Are Kafka and Kafka Streams right tools for our case?

I'm new to Kafka and will be grateful for any advice
We are updating a legacy application together with moving it from IBM MQ to something different.
Application currently does the following:
Reads batch XML messages (up to 5 MB)
Parses it to something meaningful
Processes data parallelizing this procedure somehow manually for parts of the batch. Involves some external legacy API calls resulting in DB changes
Sends several kinds of email notifications
Sends some reply to some other queue
input messages are profiled to disk
We are considering using Kafka with Kafka Streams as it is nice to
Scale processing easily
Have messages persistently stored out of the box
Built-in partitioning, replication, and fault-tolerance
Confluent Schema Registry to let us move to schema-on-write
Can be used for service-to-service communication for other applications as well
But I have some concerns.
We are thinking about splitting those huge messages logically and putting them to Kafka this way, as from how I understand it - Kafka is not a huge fan of big messages. Also it will let us parallelize processing on partition basis.
After that use Kafka Streams for actual processing and further on for aggregating some batch responses back using state store. Also to push some messages to some other topics (e.g. for sending emails)
But I wonder if it is a good idea to do actual processing in Kafka Streams at all, as it involves some external API calls?
Also I'm not sure what is the best way to handle the cases when this external API is down for any reason. It means temporary failure for current and all the subsequent messages. Is there any way to stop Kafka Stream processing for some time? I can see that there are Pause and Resume methods on the Consumer API, can they be utilized somehow in Streams?
Is it better to use a regular Kafka consumer here, possibly adding Streams as a next step to merge those batch messages together? Sounds like an overcomplication
Is Kafka a good tool for these purposes at all?
Overall I think you would be fine using Kafka and probably Kafka Streams as well. I would recommend using streams for any logic you need to do i.e. filtering or mapping that you have todo. Where you would want to write with a connector or a standard producer.
While it is ideal to have smaller messages I have seen streams users have messages in the GBs.
You can make remote calls, to send and email, from a Kafka Streams Processor but that is not recommend. It would probably be better to write the event to send an email to an output topic and use a normal consumer to read and send the messages. This would also take care of your concern about the API being down as you can always remember the last offset in case and restart from there. Or use the Pause and Resume methods.

Kafka user - project design advise

I am new to Kafka and data streaming and need some advice for the following requirement,
Our system is expecting close to 1 million incoming messages per day. The message carries a project identifier. The message should be pushed to users of only that project. For our case, lets say we have projects A, B and C. Users who opens project A's dashboard only sees / receives messages of project A.
This is my idea so far on implementing solution for the requirement,
The messages should be pushed to a Kafka Topic as they arrive, lets call this topic as Root Topic. The messages once pushed to the Root Topic, can be read by a Kafka Consumer/Listener and based on the project identifier in the message can push that message to a project specific Topic. So any message can end up at Topic A or B or C. Thinking of using websockets to update the message as they arrive on the project users' dashboards. There will be N Consumers/Listeners for the N project Topics. These consumers will push the project specific message to the project specifc websocket endpoints.
Please advise if I can make any improvements to the above design.
Chose Kafka as the messaging system here as it is highly scalable and fault tolerant.
There is no complex transformation or data enrichment before it gets sent to the client. Will it makes sense to use Apache Flink or Hazelcast Jet for the streaming or Kafka streaming is good enough for this simple requirement.
Also, when should I consider using Hazelcast Jet or Apache Flink in my project.
Should i use Flink say when I have to update few properties in the message based on a web service call or database lookup before sending it to the users?
Should I use Hazelcast Jet only when I need the entire dataset in memory to arrive at a property value? or will using Jet bring some benefits even for my simple use case specified above. Please advise.
Kafka Streams are a great tool to convert one Kafka topic to another Kafka topic.
What you need is a tool to move data from a Kafka topic to another system via web sockets.
Stream processor gives you a convenient tooling to build this data pipeline (among others connectors to Kafka and web sockets and scalable, fault-tolerant execution environment). So you might want use stream processor even if you don't transform the data.
The benefit of Hazelcast Jet is it's embedded scalable caching layer. You might want to cache your database/web service calls so that the enrichment is performed locally, reducing remote service calls.
See how to use Jet to read from Kafka and how to write data to a TCP socket (not websocket).
I would like to give you another option. I'm not Spark/Jet expert at all, but I've studying them for a few weeks.
I would use Pentaho Data Integration(kettle) to consume from the Kafka and I would write a kettle step (or User Defined Java Class step) to write the messages to a Hazelcast IMAP.
Then, would use this approach http://www.c2b2.co.uk/middleware-blog/hazelcast-websockets.php to provided the Websockets for the end-users.

Kafka connect or Kafka Client

I need to fetch messages from Kafka topics and notify other systems via HTTP based APIs. That is, get message from topic, map to the 3rd party APIs and invoke them. I intend to write a Kafka Sink Connector for this.
For this use case, is Kafka Connect the right choice or I should go with Kafka Client.
Kafka clients when you have full control on your code and you are expert developer, you want to connect an application to Kafka and can modify the code of the application.
push data into Kafka
pull data from Kafka.
https://cwiki.apache.org/confluence/display/KAFKA/Clients
Kafka Connect when you don’t have control on third party code new in Kafka and to you have to connect Kafka to datastores that you can’t modify code.
Kafka Connect’s scope is narrow: it focuses only on copying streaming data to and from Kafka and does not handle other tasks.
http://docs.confluent.io/2.0.0/connect/
I am adding few lines form other blogs to explain differences
Companies that want to adopt Kafka write a bunch of code to publish their data streams. What we’ve learned from experience is that doing this correctly is more involved than it seems. In particular, there are a set of problems that every connector has to solve:
• Schema management: The ability of the data pipeline to carry schema information where it is available. In the absence of this capability, you end up having to recreate it downstream. Furthermore, if there are multiple consumers for the same data, then each consumer has to recreate it. We will cover the various nuances of schema management for data pipelines in a future blog post.
• Fault tolerance: Run several instances of a process and be resilient to failures
• Parallelism: Horizontally scale to handle large scale datasets
• Latency: Ingest, transport and process data in real-time, thereby moving away from once-a-day data dumps.
• Delivery semantics: Provide strong guarantees when machines fail or processes crash
• Operations and monitoring: Monitor the health and progress of every data integration process in a consistent manner
These are really hard problems in their own right, it just isn’t feasible to solve them separately in each connector. Instead you want a single infrastructure platform connectors can build on that solves these problems in a consistent way.
Until recently, adopting Kafka for data integration required significant developer expertise; developing a Kafka connector required building on the client APIs.
https://www.confluent.io/blog/announcing-kafka-connect-building-large-scale-low-latency-data-pipelines/
Kafka Connect will work well for this purpose, but this would also be a pretty straightforward consumer application as well because consumers also have the benefits of fault tolerance/scalability and in this case you're probably just doing simple message-at-a-time processing within each consumer instance. You can also easily use enable.auto.commit for this application, so you will not encounter the tricky parts of using the consumer directly. The main thing using Kafka Connect would give you compared to using the consumer in this case would be that the connector could be made generic for different input formats, but that may not be important to you for a custom connector.
you should use kafka connect sink when you are using kafka connect source for producing messages to a specific topic.
for e.g. when you are using file-source then you should use file-sink to consume what source have been produced. or when you are using jdbc-source you should use jdbc-sink to consume what you have produced.
because the schema of the producer and sink consumer should be compatible then you should use compatible source and sink in both sides.
if in some cases the schemas are not compatible you can use SMT (Simple message transform) capability that is added since version 10.2 of kafka onward and you will be able to write message transformers to transfer message between incompatible producers and consumers.
Note: if you want to transfer messages faster I suggest that you use avro and schema registry to transfer message more efficiently.
If you can code with java you can use java kafka stream, Spring-Kafka project or stream processing to achieve what you desire.
In the book that is called Kafka In Actionis explained like following:
The purpose of Kafka Connect is to help move data in or out of Kafka without having to deal with writing our own producers and clients. Connect is a framework that is already part of Kafka that really can make it simple to use pieces that have been already been built to start your streaming journey.
As for your problem, Firstly, one of the simpliest questions that one should ask is if you can modify the application code of the systems from which you need data interaction.
Secondly, If you would write custom connector which have the in-depth knowledge the ability and this connector will be used by others, it worth it. Because it may help others that may not be the experts in those systems. Otherwise, this kafka connector is used only by yourself, I think you should write Kafka connector. So you can get more flexibility and can write more easily implementing.

Is Kafka suitable for running a public API?

I have an event stream that I want to publish. It's partitioned into topics, continually updates, will need to scale horizontally (and not having a SPOF is nice), and may require replaying old events in certain circumstances. All the features that seem to match Kafka's capabilities.
I want to publish this to the world through a public API that anyone can connect to and get events. Is Kafka a suitable technology for exposing as a public API?
I've read the Documentation page, but not gone any deeper yet. ACLs seem to be sensible.
My concerns
Consumers will be anywhere in the world. I can't see that being a problem seeing Kafka's architecture. The rate of messages probably won't be more than 10 per second.
Is integration with zookeeper an issue?
Are there any arguments against letting subscriber clients connect that I don't control?
Are there any arguments against letting subscriber clients connect that I don't control?
One of the issues that I would consider is possible group.id collisions.
Let's say that you have one single topic to be used by the world for consuming your messages.
Now if one of your clients has a multi-node system and wants to avoid reading the same message twice, they would set the same group.id to both nodes, forming a consumer group.
But, what if someone else in the world uses the same group.id? They would affect the first client, causing it to lose messages. There seems to be no security at that level.