Using Kafka to Transfer Files between two clients - apache-kafka

I have kafka cluster setup between to machines (machine#1 and machine#2) and the configuration is the following:
1) Each machine is configured to have one broker and one zookeeper running.
2) Server and zookeeper properties are configured to have a multi-broker, mulit-node zookeeper.
I currently have the following understanding of KafkaProducer and KafkaConsumer:
1) If I send a file from machine#1 to machine#2, it's broken down in lines using some default delimiter (LF or \n).
2) Therefore, if machine#1 publishes 2 different files to the same topic, that doesn't mean that machine#2 will receive the two files. Instead, every line will be appended to the topic log partitions and a machine#2 will read it from the log partitions in the order of arrival. i.e. the order is not the same as
file1-line1
file1-line2
end-of-file1
file2-line1
file2-line2
end-of-file2
but it might be something like:
file1-line1
file2-line1
file1-line2
end-of-file1
file-2-line2
end-of-file2
Assuming that the above is correct (i'm happy to be wrong), I believe simple Producer Consumer usage to transfer files is not the correct approach (Probably connect API is the solution here). Since Kafka Website says that "Log Aggregation" is a very popular use case, I was wonder if someone has any example projects or website which demonstrates file exchange examples using Kafka.
P.S. I know that by definition Connect API says that this is for reliable data exchange between kafka and "Other" systems - but I don't see why the other system cannot have kafka. So I am hoping that my question doesn't have to focus on "Other" non-kafka systems.

Your understanding is correct, however if u want the same order you can use just 1 partition for that topic.
So the order in which machine#2 reads will be the same as what you sent.
However this will be inefficient and will lack parallelism for which Kafka is widely used.
Kafka has ordering guarantee within a partition. quote from documentation
Kafka only provides a total order over records within a partition, not
between different partitions in a topic
In order to send all the lines from a file to only one partition, send an additional key to the producer client which will hash the sent message to the same partition.
This will make sure you receive the events from one file in the same order on machine#2. If you have any questions feel free to ask, as we use Kafka for ordering guarantee of events generated from multiple sources in production which is basically your use case as well.

Related

Apache Kafka as a REST Replacement?

I would like to harness the speed and power of Apache Kafka to replace REST calls in my Java application.
My app has the following architecture:
Producer P1 writes a search query to topic search
Consumer C1 reads/consumes the search query and produces search results which it writes to another topic search_results.
Both Producer P1 and Consumer C1 are part of a group of producers/consumer living on different physical machines. I need the Producer P1 server to be the one to consume/read the search results output produced by Consumer C1 so it can serve the search results to the client who submitted the search query.
The above example was simplified for demonstration purposes - in reality the process entails several additional intermediate Producers and Consumers where the query may be thrown to/from multiple servers to be processed. The main point is that the value produced by the last Producer needs to be read/consumed by the first Producer.
In the typical Apache Kafka architecture, there's no way to ensure that the final output is read by the server that originally produced the search query - as there are multiple servers reading the same topic.
I do not want to use REST for this purpose because it is very sloooooow when processing thousands of queries. Apache Kafka can handle millions of queries with 10 millisecond latency. In my particular use case it is critical that the query is transmitted with sub-millisecond speed. Scaling with REST is also much more difficult. Suppose our traffic increases and we need to add a dozen more servers to intercept client queries. With Apache Kafka it's as simple as adding new servers and adding them to the Producer P1 group. With REST not so simple. Apache Kafka also provides a very high level of decoupling which REST does not.
What design/architecture can be used to force a specific server/produce to consume the end result of initial query?
Thanks
In the typical Apache Kafka architecture, there's no way to ensure
that the final output is read by the server that originally produced
the search query - as there are multiple servers reading the same
topic.
You can use custom partitioner in your producer that determines which search query to land in which partition.
Similarly, you can use custom partition assignor in consumer to determine which partitions should be assigned to which consumer. The consumer configuration is partition.assignment.strategy
The fact Kafka is faster than REST is due to the way it is implemented. What is important here is to decide which pattern works for you - request-response or publish-subscribe or something else. You can check this answer for REST vs Kafka.
Maybe it makes sense to have multiple topics for the answers, not just one big topic:
This way the "results" topics act as "mailboxes".
Probably you'll need to set auto.create.topics.enable=true since creating topics for all P1,...PN could be complicated.

Kafka stream application not consume data after restart

After I did restart our Kafka cluster my application of Kafka streams didn't receive messages from input topic and I got an exception of "can׳t create internal topic". After some research, I did reset with the Kafka tool (to the input topic and the application) the tool is Kafka-streams-application-reset.sh.
Unfortunately, it didn't resolve the problem and I also got the exception again
From the error message, you can infer that the topic already exists and thus, cannot be created. The reason for the failure is, that the existing topic does not have the expected number of partitions (it has 1 instead of 150) -- if the number of partitions would match, Kafka Streams would just use the existing topic.
This can happen, if you have topic auto-create enabled at the brokers (and the topic was created with a wrong number of partitions), or if the number of partitions of your input topic changed. Kafka Streams does not automatically change the number of partitions for the repartition topic, because this might result in data corruption and thus lead to incorrect results.
One way to fix this, it to either manually delete this topic: note, that this might result in data loss and you should only do this, if you know that it is what you want.
Another (better way) would be, to reset the application cleanly using bin/kafka-streams-application-reste.sh in combination with KafkaStreams#cleanup().
Because you need to clean up the application and users should be aware of the implication, Kafka Streams fails to make user aware of the issue instead of "auto magically" take some actions that might be undesired from a user point of view.
Check out the docs for more details. There is also a blog post that explains application reset in details:
https://kafka.apache.org/11/documentation/streams/developer-guide/app-reset-tool.html
https://www.confluent.io/blog/data-reprocessing-with-kafka-streams-resetting-a-streams-application/

Kafka: multiple consumers in the same group

Let's say I have a Kafka cluster with several topics spread over several partitions. Also, I have a cluster of applications act as clients for Kafka. Each application in that cluster has a client that is subscribed to a same set of topics, which is identical over the whole cluster. Also, each of these clients share same Kafka group ID.
Now, speaking of commit mode. I really do not want to specify offset manually, but I do not want to use autocommit either, because I need to do some handing after I receive my data from Kafka.
With this solution, I expect to occur "same data received by different consumers" problem, because I do not specify offset before I do reading (consuming), and I read data concurrently from different clients.
Now, my question: what are the solutions to get rid of multiple reads? Several options coming to my mind:
1) Exclusive (sequential) Kafka access. Until one consumer committed read, no other consumers access Kafka.
2) Somehow specify offset before each reading. I do not even know how to do that with assumption that read might fail (and offset will not be committed) - we gonna need some complicated distributed offset storage.
I'd like to ask people experienced with Kafka to recommend something to achieve behavior I need.
Every partition is consumed only by one client - another client with the same group ID won't get access to that partition, so concurrent reads won't occur...

Correlating in Kafka and dynamic topics

I am building a correlated system using Kafka. Suppose, there's a service A that performs data processing and there're its thousands of clients B that submit jobs to it. Bs are short-lived, they appear on the network, push the data to A and then two important things happen:
B will immediately receive a status from A;
B then will either
drop out completely, stay online to receive further updates on
status, or will sporadically pop back on to check the status.
(this is not dissimilar to grid computing or mpi).
Both points should be achieved using a well-known concept of correlationId: B possesses a unique id (UUID in my case), which it sends to A in headers, which, in turn, uses it as Reply-To topic to send status updates to. Which means it has to create topics on the fly, they can't be predetermined.
I have auto.create.topics.enable switched on, and it indeed creates topics dynamically, but existing consumers are not aware of them and require to be restarted [to fetch topic metadata i suppose, if i understood the docs right]. I also checked consumer's metadata.max.age.ms setting, but it doesn't help it seems, even if i set it to a very low value.
As far as i've read, this is yet unanswered, i.e.: kafka filtering/Dynamic topic creation, kafka consumer to dynamically detect topics added, Can a Kafka producer create topics and partitions? or answered unsatisfactory.
As there're hundreds of As and thousands of Bs, i can't possibly use shared topics or anything like it, lest i overload my network. I can use Kafka's AdminTools, or whatever it's called, to pre-create topics, but i find it somehow silly (even though i saw real-life examples of people using it to talk to Zookeeper and Kafka infrastructure itself).
So the question is, is there a way to dynamically create Kafka topics in a way that makes both consumer and producer aware of it without being restarted or anything? And, in the worst case, will AdminTools really help it and on which side must i use it - A or B?
Kafka 0.11, Java 8
UPDATE
Creating topics with AdminClient doesn't help for whatever reason, consumers still throw LEADER_NOT_AVAILABLE when i try to subscribe.
Ok, so i’d answer my own question.
Creating topics with AdminClient works only if performed before corresponding consumers are created.
Changed the topology i have, taking into account 1) and introducing exchange of correlation ids in message headers (same as in JMS). I also had to implement certain topology management methodologies, grouping Bs into containers.
It should be noted that, as many people have said, this only works when Bs are in single-consumer groups and listen to topics with 1 partition.
To get some idea of the work i'm into, you might have a look at the middleware framework i've been working on https://github.com/ikonkere/magic.
Creating an unbounded number of topics is not recommended. Id advise to redesign your topology/system.
Ive thought of making dynamic topics myself but then realized that eventually zookeeper will fail as it will run out of memory due to stale topics (imagine a year from now on how many topics could be created). Maybe this could work if you make sure you have some upper bound on topics ever created. Overall an administrative headache.
If you look up using Kafka with request response you will find others also say it is awkward to do so (Does Kafka support request response messaging).

Create Kafka Topic using vertx

How can we create producer for different topics using vert.x. Vert.x zanox module did some pretty job but seems it was limited to one topic. There is no way to send messages to desired topic. It sticks to one topic which we gave at config file.
If you are using the Zanox module (like me) I just deploy one module per topic that I need to produce to, passing in a different configuration file (with appropriate topic name).
Not the most efficient approach I agree, but short of writing your own Kafka module/integration this is the only option available.