How can we check like how many streams are supported by a Kafka cluster with 3 nodes
My project is related to videos, I am transferring video from source to destination in the form of meta-data using Kafka. i.e I am doing some process on the videos then forming metadata and then sending this data to Kafka using Kafka-Producer API through topic 'test-topic'. I am a consumer class which takes that data(meta-data) and do some process. Like this I have implemented 3Kafka processors. I am running each process 6 times for different kind of inputs, so totally I have 18 processors. But here my doubt is not related to Kafka processors. My doubt is related to Kafka-Cluster. How streams are related to kafka-cluster and is there any way to know the number of streams supported by a kafka cluster.
It will hugely depend on what your application is doing, the throughput, and so on. Some general resources to help you:
Elastic Scaling in the Streams API in Kafka
Kafka Streams Capacity planning and sizing
For general deployment sizing of your Kafka clusters, see the Enterprise Reference Architecture
Related
I am learning Apache Kafka (as a messaging system) and in that process came to know of term StateStore , link here
I am also aware of Apache kafka streams, the client API.
Is StateStore applicable for Apache kafka in the context of messaging systems or it is applicable to Apache Kafka Streams.
Does Apache have their "own" implementation of StateStore or use third party implementation (for example, rockdsb.
Can anyone help me understand this.
Adding an overview to the good concise explanation about StateStore in the context of Kafka Streams and your question.
Kafka Broker in a nutshell
In a messaging context your work simplified would be:
Publishing state (producing messages)
Saving messages for a period of time for later consumption (retention time)
Consuming state (getting the messages)
And in a nutshell #2 plus fault tolerance and keeping track of the position of your consumer groups' reads (offsets) is what a Kafka broker does for you.
Kafka client API's
Apart from that Kafka provides client libraries for your common patterns of working with messages:
Producer - Publish messages to Kafka topics
Consumer - Subscribe to Kafka topics
Connect - Create reliable integrations with external stores such as various DBMS.
Streams - DSL and utilities aimed to simplify development of common streaming application patterns.
Admin - Programmatically manage / monitor Kafka resources.
Kafka Streams State Stores
I'll quote the great explanation from the Streams Architecture docs (I highly recommend Kafka docs as they are built very good and for any level of experience).
Kafka Streams provides so-called state stores, which can be used by stream processing applications to store and query data, which is an important capability when implementing stateful operations. The Kafka Streams DSL, for example, automatically creates and manages such state stores when you are calling stateful operators such as join() or aggregate(), or when you are windowing a stream.
As you can see the StateStore is used as a helper for extending the built-in abilities from a single message processing context to multi-message processing, thus enabling more complex functions over a bunch of messages (all the messages passed in a time window, aggregation functions over several messages, etc.)
I'll add to that that RocksDB is the default implementation used by Kafka and can be changed as was mentioned in previous answer.
Also if you want to explore more here is a link to the great intro videos form Apache Kafka's official docs:
Streams API intro videos
Have an awesome learning experience!
StateStore is applicable to kafka streams context.
Some processors like reduce or aggregate are stateful operations.
Kafka streams use state stores to manage this. By default, it uses rocksDB, but it is customizable.
I need to fetch messages from Kafka topics and notify other systems via HTTP based APIs. That is, get message from topic, map to the 3rd party APIs and invoke them. I intend to write a Kafka Sink Connector for this.
For this use case, is Kafka Connect the right choice or I should go with Kafka Client.
Kafka clients when you have full control on your code and you are expert developer, you want to connect an application to Kafka and can modify the code of the application.
push data into Kafka
pull data from Kafka.
https://cwiki.apache.org/confluence/display/KAFKA/Clients
Kafka Connect when you don’t have control on third party code new in Kafka and to you have to connect Kafka to datastores that you can’t modify code.
Kafka Connect’s scope is narrow: it focuses only on copying streaming data to and from Kafka and does not handle other tasks.
http://docs.confluent.io/2.0.0/connect/
I am adding few lines form other blogs to explain differences
Companies that want to adopt Kafka write a bunch of code to publish their data streams. What we’ve learned from experience is that doing this correctly is more involved than it seems. In particular, there are a set of problems that every connector has to solve:
• Schema management: The ability of the data pipeline to carry schema information where it is available. In the absence of this capability, you end up having to recreate it downstream. Furthermore, if there are multiple consumers for the same data, then each consumer has to recreate it. We will cover the various nuances of schema management for data pipelines in a future blog post.
• Fault tolerance: Run several instances of a process and be resilient to failures
• Parallelism: Horizontally scale to handle large scale datasets
• Latency: Ingest, transport and process data in real-time, thereby moving away from once-a-day data dumps.
• Delivery semantics: Provide strong guarantees when machines fail or processes crash
• Operations and monitoring: Monitor the health and progress of every data integration process in a consistent manner
These are really hard problems in their own right, it just isn’t feasible to solve them separately in each connector. Instead you want a single infrastructure platform connectors can build on that solves these problems in a consistent way.
Until recently, adopting Kafka for data integration required significant developer expertise; developing a Kafka connector required building on the client APIs.
https://www.confluent.io/blog/announcing-kafka-connect-building-large-scale-low-latency-data-pipelines/
Kafka Connect will work well for this purpose, but this would also be a pretty straightforward consumer application as well because consumers also have the benefits of fault tolerance/scalability and in this case you're probably just doing simple message-at-a-time processing within each consumer instance. You can also easily use enable.auto.commit for this application, so you will not encounter the tricky parts of using the consumer directly. The main thing using Kafka Connect would give you compared to using the consumer in this case would be that the connector could be made generic for different input formats, but that may not be important to you for a custom connector.
you should use kafka connect sink when you are using kafka connect source for producing messages to a specific topic.
for e.g. when you are using file-source then you should use file-sink to consume what source have been produced. or when you are using jdbc-source you should use jdbc-sink to consume what you have produced.
because the schema of the producer and sink consumer should be compatible then you should use compatible source and sink in both sides.
if in some cases the schemas are not compatible you can use SMT (Simple message transform) capability that is added since version 10.2 of kafka onward and you will be able to write message transformers to transfer message between incompatible producers and consumers.
Note: if you want to transfer messages faster I suggest that you use avro and schema registry to transfer message more efficiently.
If you can code with java you can use java kafka stream, Spring-Kafka project or stream processing to achieve what you desire.
In the book that is called Kafka In Actionis explained like following:
The purpose of Kafka Connect is to help move data in or out of Kafka without having to deal with writing our own producers and clients. Connect is a framework that is already part of Kafka that really can make it simple to use pieces that have been already been built to start your streaming journey.
As for your problem, Firstly, one of the simpliest questions that one should ask is if you can modify the application code of the systems from which you need data interaction.
Secondly, If you would write custom connector which have the in-depth knowledge the ability and this connector will be used by others, it worth it. Because it may help others that may not be the experts in those systems. Otherwise, this kafka connector is used only by yourself, I think you should write Kafka connector. So you can get more flexibility and can write more easily implementing.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I would like to ask if my understanding of Kafka is correct.
For really really big data stream, conventional database is not adequate so people use things such as Hadoop or Storm. Kafka sits on top of said databases and provide ...directions where the real time data should go?
I don't think so.
Kafka is messaging system and it does not sit on top of database.
You can compare Kafka with messaging systems like ActiveMQ, RabbitMQ etc.
From Apache documentation page
Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design.
Key takeaways:
Kafka maintains feeds of messages in categories called topics.
We'll call processes that publish messages to a Kafka topic producers.
We'll call processes that subscribe to topics and process the feed of published messages consumers..
Kafka is run as a cluster comprised of one or more servers each of which is called a broker.
Communication between the clients and the servers is done with a simple, high-performance, language agnostic TCP protocol.
Use Cases:
Messaging: Kafka works well as a replacement for a more traditional message broker. In this domain Kafka is comparable to traditional messaging systems such as ActiveMQ or RabbitMQ
Website Activity Tracking: The original use case for Kafka was to be able to rebuild a user activity tracking pipeline as a set of real-time publish-subscribe feeds
Metrics: Kafka is often used for operational monitoring data, which involves aggregating statistics from distributed applications to produce centralized feeds of operational data
Log Aggregation
Stream Processing
Event sourcing is a style of application design where state changes are logged as a time-ordered sequence of records.
Commit Log: Kafka can serve as a kind of external commit-log for a distributed system. The log helps replicate data between nodes and acts as a re-syncing mechanism for failed nodes to restore their data
To fully understand Apache Kafka's role you should get a wider picture and know Kafka's use cases. Modern data processing systems try to break with the classic application architecture. You can start from the kappa architecture overview:
http://milinda.pathirage.org/kappa-architecture.com
In this architecture you don't store the current state of the world in any SQL or key-value database. All data is processed and stored as one or more series of events in an append-only immutable log. Immutable events are easier to replicate and store in a distributed environment. Apache Kafka is a system that is used storing these events and for brokering them between other system components.
Use cases on Apache Kafka's official site: http://kafka.apache.org/documentation.html#uses
More use cases :-
Kafka-Storm Pipeline -
Kafka can be used with Apache Storm to handle data pipeline for high speed filtering and pattern matching on the fly.
Apache Kafka is not just a message broker. It was initially designed and implemented by LinkedIn in order to serve as a message queue. Since 2011, Kafka has been open sourced and quickly evolved into a distributed streaming platform, which is used for the implementation of real-time data pipelines and streaming applications.
It is horizontally scalable, fault-tolerant, wicked fast, and runs in
production in thousands of companies.
Modern organisations have various data pipelines that facilitate the communication between systems or services. Things get a bit more complicated when a reasonable number of services needs to communicate with each other at real time.
The architecture becomes complex since various integrations are required in order to enable the inter-communication of these services. More precisely, for an architecture that encompasses m source and n target services, n x m distinct integrations need to be written. Also, every integration comes with a different specification, meaning that one might require a different protocol (HTTP, TCP, JDBC, etc.) or a different data representation (Binary, Apache Avro, JSON, etc.), making things even more challenging. Furthermore, source services might address increased load from connections that could potentially impact latency.
Apache Kafka leads to more simple and manageable architectures, by decoupling data pipelines. Kafka acts as a high-throughput distributed system where source services push streams of data, making them available for target services to pull them at real-time.
Also, a lot of open-source and enterprise-level User Interfaces for managing Kafka Clusters are available now. For more details refer to my answer to this question.
You can find more details about Apache Kafka and how it works in the blog post "Why Apache Kafka?"
Apache Kafka is an open-source software platform written in Scala and Java, mainly used for stream processing.
The use cases of Apache Kafka are:
Messaging
Website Activity Tracking
Metrics
Log Aggregation
Stream Processing
Event Sourcing
Commit Log
For more information use the official apache Kafka site.
https://kafka.apache.org/uses
Kafka is a pub-sub highly scalable messaging system. It acts as a transport layer guaranteeing exactly once semantics and Spark steaming does the processing. The next question that comes to my mind is even spark can poll directories to check for files and even read from a socket or port. How this Kafka and spark work in tandem ? I mean does an application written in some language instead of writing to a database for storage directly feds to the port (or places the files which would not really be tak time and would rather be some kind of batch processing) from which the data is then read by a Kafka producer and then via the Kafka consumer API is then read and processing by spark streaming?
I am developing a mobile messaging app. I was going through technology needed and found two MQTT & Apache Kafta. To me both seems doing the same thing in the same way (in terms of subscribing & publishing to a topic).
I heard that MQTT is fit for mobiles as it is very light weight ? So basically what is the difference between these two and what are the advantage of each on other?
The main motive behind Kafka is scalability.
MQTT is a protocol with public specification for lightweight client / message broker communications, allowing publish/subscribe exchanges. Multiple implementations of client libraries and brokers (Mosquitto, JoramMQ...) exist and are virtually compatible. MQTT just specifies the transport, and vaguely the application part (i.e. how data is handled and possibly stored, how clients are authorized...). The spec is not clear if data consumed on a topic is only real-time or possibly persistent. The spec doesn't state anything about how the message broker implementing MQTT could/should scale.
On the other hand, Apache Kafka is a message broker based on an internal "commit log": its focus is storing massive amounts of data on disk, and allowing consumption in real-time or later (as long as data is still available on disk). It's designed to be deployable as cluster of multiple nodes, with good scalability properties. Kafka uses its own network protocol.
So you are comparing two different things here: a standard pub/sub protocol (with multiple implementations), and a specific message storing/distributing software, vaguley of the same family with its own protocol.
I'd say that if you need to store massive amount of messages, to ensure batch processing, look more at Kafka. If you have lots of clients/apps exchanging messages in real-time on many independent topics look more at the MQTT (or even AMQP) message broker implementations.
MQTT is a standard protocol (with many implementations). Kafka (which is also a protocol) is normally used by downloading it from the Apache website or e.g. a Confluent Docker image.
It is like comparing apples and oranges, both exist for very different reasons.
Most use cases I see in IoT environments combine both MQTT and Apache Kafka. The edge devices speak MQTT protocol (for the benefits it has in edge environments. These are then forwarded to Apache Kafka to get the events into the rest of the enterprise architecture.
You can do this either via a MQTT Broker like HiveMQ + Apache Kafka or via a MQTT Proxy (so that you don't need the MQTT Broker). Both options have trade-offs, of course.
See this example of how to combine MQTT with Apache Kafka. Or go directly to the Github code: "Deep Learning UDF for KSQL for Streaming Anomaly Detection of MQTT IoT Sensor Data".
I also created a live demo about how to integrate Apache Kafka and MQTT.
I am writing a clustered application sitting on top of Kafka -- it uses Kafka exclusively for interprocess communications and coordination. I could use Zookeeper to manage my cluster -- but it would not be very difficult to use Kafka topics to manage the cluster. And the more I think about it, other than for historical reasons, it seems like Kafka could drop Zookeeper and just use a topic-based solution
For example, there could be a special topic or topics in Kafka where you publish all of the same data currently kept track of in Zookeeper. Brokers, Topics, Partitions, Leaders, etc -- seems like this is just as easily tracked via Kafka topics as via Zookeeper.
I know in Kafka 0.9.0 there's some movement away from Zookeeper, more towards this model, and remember my question is less about Kafka development or more me trying to figure out which direction to go in my application.
I'm not asking for an opinion -- what I want to know is are there any specific functions provided by Zookeeper that are going to be difficult with a Kafka/topic-based approach to coordination. But I can't think of anything.
Even heartbeat monitoring -- which was the reason I started looking at Zookeeper in the first place -- you could have a client connection topic, and clients could publish to it when they join the cluster, publish heartbeats at a given interval, and publish as they leave it.
Let us start from a space eyed view: You have two distributed
systems which store data. Zookeeper organizes it's data in nodes in some kind
of directory like structure. Kafka stores messages within topics.
From a bird eye view kafka is build for high-throughput and scalability while one of zookeepers
main design goal is consistency. Zookeeper is mean to be a a Distributed Coordination Service for
Distributed Applications while Kafka can be thought as a distributed commit log.
So the answer to your question is surprisingly: 'It depends'. For coordinating
a distributed system I would use zookeeper: Thats what it was build for. You could
do this also with kafka but there are couple of things which needs to be done
manualy which comes out of the box if you are using zookeeper.
Some examples:
Consistency: The ZK-Client can choose if he needs strong or a eventual consistency
Ephemeral nodes: Together with ZK-Watches a great thing to react on failing services
Sequential Consistency: It's not granted that you recieve the kafka-message in the order you wrote it to the broker (it's only granted that messages within a partion are ordered)
ACLs: Never used it but its at least something which is not offered out of the box by kafka
Sequence Nodes
A pretty nice overview about what you can do with zookeeper are the zookeeper-recipes: https://zookeeper.apache.org/doc/trunk/recipes.html
[EDIT]: Heartbeating an application using kafka is of course possible. But ephemeral nodes in zookeeper are in my eyes the easier option.
This is currently being worked on in scope of KIP-500.