What is difference between Apache Kafka and GCP PubSub? [closed] - apache-kafka

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
What is difference between Apache Kafka and GCP PubSub? when to use kafka and when to use pubsub.

Since you did not provide your use case, I will state below the main characteristics of each tool.
PubSub:
It is a cloud asynchronous messaging service that decouples senders and receivers provided by Google Cloud. It offers high availability and consistent performance at scale.
No Ops: in PubSub you do not need to worry about partitions and shards.
Scalability: is built-in without any required operation, it handles scalability automatically.
Monitoring: you can monitor your process at a Topic and Subscription level within StackDriver.
Access management: you can configure access at a project, Topic and Subscriber level.
Reliability: it guarantees the the message will be delivered at least once. Although, it does not guarantee ordering (which can be handled in Dataflow).
Message retention in PubSub: the minimum is 10 minutes and the maximum is 7 days.
Kafka:
It is an open-source distributed publish-subscribe messaging ecosystem. It can be used on-prem or deployed in cloud environments.
Scalability: it does not support automatic scalability. Thus, you need to increase partitions, replications etc, manually.
Ordering: it can support ordered messages in the partition level.
Reliability: it guarantees no data loss.
Monitoring: it offers various types of built-in monitoring systems.
Notice that I just shared the main characteristics of each tool. Although there are many others which can be more relevant for your use case. Here are some links where you can find other information about each one's aspects: 1, 2, 3.

Related

Kafka operations [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
Hi im very new to Kafka operations, all i understand from it is event data is stored in so called topics. These topics are like logs and are written to disk and even duplicated.
What are producers and consumers? Are they essentially just parts of the application like micro services where one producers data and another requests data?
My question is what exactly is the difference between a conventional database and Kafka topics?
Is it just that the data type is different?
In databases, objects are stored and in topics events are stored? They are both written to hard disk?
What problem does Kafka actually solve?
There are some problems with decentralised micro services with dependencies across micro services
How does Kafka solve this problem?
Thanks everyone
First off, producers and consumers can be part of the same application. You don't need to have "microservices" to use Kafka.
one producers data and another requests data?
Yes
what exactly is the difference between a conventional database and Kafka topics?
Unclear what you consider as a "conventional" database, but Kafka itself has no query capabilities nor any defined record schema. Such features are enabled by external tooling
They are both written to hard disk?
Not all databases write to disk. Kafka does write to disk
What problem does Kafka actually solve?
There's use cases mentioned on the website, but the original goal was log/metric aggregation into a datalake, not intra-service communication.
But if you have point-to-point-to-point dependency chain, you need to ensure all applications in that chain are up, whereas they could instead fail occasionally and pickup from where they stopped reading from a replicated log
Data is stored in so called topics. These topics are like logs and are written to disk and even duplicated.
Data in Kafka is seen as events. Each event usually represents that something happened. The event is stored in a given topic on a Kafka broker. The topic can be seen as a way to organize data into categories.
What are producers and consumers?
Producers create events and submit them to Kafka brokers which then store these events in the appropriate topic. Consumers can consume from the aforementioned data, pulling the events that were created by a producer.
My question is what exactly is the difference between a conventional database and Kafka topics?
Hard to define conventional. But I suppose no, Kafka is not a conventional database. You will probably often find yourself using other databases with kafka. Kafka is primarilly suited for capturing real-time events, storing them in order to direct them elsewhere in real-time (historical retrieval is also possible).
What problem does Kafka actually solve?
Handling anything that requires event streaming. It does so durably and provides a large amount of guarantees and flexibility in handling large amounts of data.
In conclusion: I would suggest you start by going through the first part of the documentation found at Kafka Documentation.
If you really want to dive in then you can also find a book titled Kafka: The definitive edition.

Kafka Connect vs Streams for Sinks [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I am trying to understand what Connect buys you that Streams does not. We have a part of our application where we want to consume a topic and write to mariadb.
I could accomplish this with a simple processor. Read the record, store in state store and then bulk insert into mariadb.
Why is this a bad idea? What does JDBC Sink Connector buy you?
Great question! It's all about using the right tool for the job. Kafka Connect's specific purpose is streaming integration between source systems and Kafka, or from Kafka down to other systems (including RDBMS).
What does Kafka Connect give you?
Scalability; you can deploy multiple workers and Kafka Connect will distribute tasks across them
Resilience; if a node fails, Kafka Connect will restart the work on another worker
Ease of use; connectors exist for numerous technologies, so to implement a connector usually means just a few lines of JSON
Schema management; support for schemas in JSON, full integration with the Schema Registry for Avro, pluggable converters from the community for Protobuf
Inline transformations with Single Message Transform
Unified and centralised management and configuration for all your integration tasks
That's not to say that you can't do this in Kafka Streams, but you would end up having to code a lot of this yourself, when it's provided out of the box for you by Kafka Connect. In the same way you could use the Consumer API and a bunch of bespoke code to do the stream processing that Kafka Streams API gives you, similarly you could use Kafka Streams to get data from a Kafka topic into a database—but why would you?
If you need to transform data before it's sent to a sink then a recommended pattern is to decouple the transformation from the sending. Transform the data in Kafka Streams (or KSQL) and write it back to another Kafka topic. Use Kafka Connect to listen to that new topic and write the transformed messages to the target sink.

Confluent Platform or Solace? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Our enterprise has both Solace and Confluent Platform capabilities.
While Solace also supports real-time streaming and an appliance based offering, why and when should an enterprise go to Confluent Platform ?
Answer from employee of Solace Corporation:
This is a great question. In fact, at Solace we are working on a more thorough blog/document to answer this exact question. We plan to put the details on the Solace site in the next week or so. I will post the URL when this is available.
Kafka was designed to be a batched (micro batch) log aggregation system. Its primary purpose was to deal with large volumes of data with a focus on data-at-rest. The default quality-of-service (QoS) is rather low, which allows a high throughput; at the expense of high latency and potential loss of data, out-of-order delivery and low security enforcement. While it is possible to use the thick client API to improve QoS with Kafka, it is at a great expense of performance, throughput and latency. Kafka also is generally restricted to a Publish/Subscribe Message Exchange Pattern (MEP).
Confluent adds some extensions to Apache Kafka that improve administration, but still is making use of the same Apache Kafka core and suffers from the same issues.
Solace was designed as a high performance, low latency, extremely reliable distributed event-driven messaging system that was targeted at data-in-motion. Solace supports all modern Message Exchange Patterns,(MEP) and natively supports industry standards and accepted specifications such as REST, WebSockets, AMQP, MQTT and JMS, without the requirement for adapters or gateways. Solace also supports a set of Solace/Kafka Source and Sink Connectors if you require movement of data between Solace and Confluent (Kafka). The Connectors make it easy to use Solace and Kafka together.
Solace also provides security and the highest level of QoS while maintaining predictable throughput and latency, even with extremely high client connection counts. This why Solace is used by Financial Institutions, Government Agencies, Manufactures, Connected Vehicle applications, etc. for their most stringent MEP requirements of no data loss, duplication and order of processing with 24/7/365 processing support. You cannot lose or duplicate multimillion dollar transactions! A recent quote from a financial client (RBC) discusses how their globally connected Solace event mesh reliably processes 65 billion messages a day.
If your requirement is a large volume of data-at-rest processing with a low QoS or security requirement, Confluent may be your choice. If you have high QoS requirements, stringent security, real-time data-in-motion processing with advanced MEP and 24/7/365 processing, Solace is you best choice. If you have both requirements, the Solace Connectors will provide bi-directional integration.

I am evaluating Google Pub/Sub vs Kafka. What are the differences? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
The community reviewed whether to reopen this question 7 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I have not worked on kafka much but wanted to build data pipeline in GCE. So we wanted to know Kafka vs PUB/Sub. Basically I want to know how message consistency, message availability, message reliability is maintained in both Kafka and Pub/sub
Thanks
In addition to Google Pub/Sub being managed by Google and Kafka being open source, the other difference is that Google Pub/Sub is a message queue (e.g. Rabbit MQ) where as Kafka is more of a streaming log. You can't "re-read" or "replay" messages with Pubsub. (EDIT - as of 2019 Feb, you CAN replay messages and seek backwards in time to a certain timestamp, per comment below)
With Google Pub/Sub, once a message is read out of a subscription and ACKed, it's gone. In order to have more copies of a message to be read by different readers, you "fan-out" the topic by creating "subscriptions" for that topic, where each subscription will have an entire copy of everything that goes into the topic. But this also increases cost because Google charges Pub/Sub usage by the amount of data read out of it.
With Kafka, you set a retention period (I think it's 7 days by default) and the messages stay in Kafka regardless of how many consumers read it. You can add a new consumer (aka subscriber), and have it start consuming from the front of the topic any time you want. You can also set the retention period to be infinite, and then you can basically use Kafka as an immutable datastore, as described here: http://stackoverflow.com/a/22597637/304262
Amazon AWS Kinesis is a managed version of Kafka whereas I think of Google Pubsub as a managed version of Rabbit MQ.
Amazon SNS with SQS is also similar to Google Pubsub (SNS provides the fanout and SQS provides the queueing).
I have been reading the answers above and I would like to complement them, because I think there are some details pending:
Fully Managed System Both system can have fully managed version in the cloud. Google provides Pubsub and there are some fully managed Kafka versions out there that you can configure on the cloud and On-prem.
Cloud vs On-prem I think this is a real difference between them, because Pubsub is only offered as part of the GCP ecosystem whereas Apache Kafka you can use as a both Cloud service and On-prem service (doing the cluster configuration by yourself)
Message duplication
- With Kafka you will need to manage the offsets of the messages by yourself, using an external storage, such as, Apache Zookeeper. In that way you can track the messages read so far by the Consumers. Pubsub works using acknowledging the message, if your code doesn't acknowledge the message before the deadline, the message is sent again, that way you can avoid duplicated messages or another way to avoid is using Cloud Dataflow PubsubIO.
Retention policy Both Kafka and Pubsub have options to configure the maximum retention time, by default, I think is 7 days.
Consumers Group vs Subscriptions Be careful how you read messages in both systems. Pubsub use subscriptions, you create a subscription and then you start reading messages from that subscription. Once a message is read and acknowledge, the message for that subscription is gone. Kafka use the concept of "consumer group" and "partition", every consumer process belongs to a group and when a message is read from a specific partition, then any other consumer process which belongs to the same "consumer group" will not be able to read that message (that is because the offset eventually will increase). You can see the offset as a pointer which tells the processes which message have to read.
I think there is not a correct answer for your question, it will really depends on what you will need and the constrains you have (below are some examples of the escenarios):
If the solution must be in GCP, obviously use Google Cloud Pubsub. You will avoid all the settings efforts or pay extra for a fully automated system that Kafka requires.
If the solution should require process data in Streaming way but also needs to support Batch processing (eventually), it is a good idea to use Cloud Dataflow + Pubsub.
If the solution require to use some Spark processing, you could explore Spark Streaming (which you can configure Kafka for the stream processing)
In general, both are very solid Stream processing systems. The point which make the huge difference is that Pubsub is a cloud service attached to GCP whereas Apache Kafka can be used in both Cloud and On-prem.
Update (April 6th 2021):
Finally Kafka without Zookeeper
One big difference between Kafka vs. Cloud Pub/Sub is that Cloud Pub/Sub is fully managed for you. You don't have to worry about machines, setting up clusters, fine tune parameters etc. which means that a lot of DevOps work is handled for you and this is important, especially when you need to scale.

What do you use Apache Kafka for? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I would like to ask if my understanding of Kafka is correct.
For really really big data stream, conventional database is not adequate so people use things such as Hadoop or Storm. Kafka sits on top of said databases and provide ...directions where the real time data should go?
I don't think so.
Kafka is messaging system and it does not sit on top of database.
You can compare Kafka with messaging systems like ActiveMQ, RabbitMQ etc.
From Apache documentation page
Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design.
Key takeaways:
Kafka maintains feeds of messages in categories called topics.
We'll call processes that publish messages to a Kafka topic producers.
We'll call processes that subscribe to topics and process the feed of published messages consumers..
Kafka is run as a cluster comprised of one or more servers each of which is called a broker.
Communication between the clients and the servers is done with a simple, high-performance, language agnostic TCP protocol.
Use Cases:
Messaging: Kafka works well as a replacement for a more traditional message broker. In this domain Kafka is comparable to traditional messaging systems such as ActiveMQ or RabbitMQ
Website Activity Tracking: The original use case for Kafka was to be able to rebuild a user activity tracking pipeline as a set of real-time publish-subscribe feeds
Metrics: Kafka is often used for operational monitoring data, which involves aggregating statistics from distributed applications to produce centralized feeds of operational data
Log Aggregation
Stream Processing
Event sourcing is a style of application design where state changes are logged as a time-ordered sequence of records.
Commit Log: Kafka can serve as a kind of external commit-log for a distributed system. The log helps replicate data between nodes and acts as a re-syncing mechanism for failed nodes to restore their data
To fully understand Apache Kafka's role you should get a wider picture and know Kafka's use cases. Modern data processing systems try to break with the classic application architecture. You can start from the kappa architecture overview:
http://milinda.pathirage.org/kappa-architecture.com
In this architecture you don't store the current state of the world in any SQL or key-value database. All data is processed and stored as one or more series of events in an append-only immutable log. Immutable events are easier to replicate and store in a distributed environment. Apache Kafka is a system that is used storing these events and for brokering them between other system components.
Use cases on Apache Kafka's official site: http://kafka.apache.org/documentation.html#uses
More use cases :-
Kafka-Storm Pipeline -
Kafka can be used with Apache Storm to handle data pipeline for high speed filtering and pattern matching on the fly.
Apache Kafka is not just a message broker. It was initially designed and implemented by LinkedIn in order to serve as a message queue. Since 2011, Kafka has been open sourced and quickly evolved into a distributed streaming platform, which is used for the implementation of real-time data pipelines and streaming applications.
It is horizontally scalable, fault-tolerant, wicked fast, and runs in
production in thousands of companies.
Modern organisations have various data pipelines that facilitate the communication between systems or services. Things get a bit more complicated when a reasonable number of services needs to communicate with each other at real time.
The architecture becomes complex since various integrations are required in order to enable the inter-communication of these services. More precisely, for an architecture that encompasses m source and n target services, n x m distinct integrations need to be written. Also, every integration comes with a different specification, meaning that one might require a different protocol (HTTP, TCP, JDBC, etc.) or a different data representation (Binary, Apache Avro, JSON, etc.), making things even more challenging. Furthermore, source services might address increased load from connections that could potentially impact latency.
Apache Kafka leads to more simple and manageable architectures, by decoupling data pipelines. Kafka acts as a high-throughput distributed system where source services push streams of data, making them available for target services to pull them at real-time.
Also, a lot of open-source and enterprise-level User Interfaces for managing Kafka Clusters are available now. For more details refer to my answer to this question.
You can find more details about Apache Kafka and how it works in the blog post "Why Apache Kafka?"
Apache Kafka is an open-source software platform written in Scala and Java, mainly used for stream processing.
The use cases of Apache Kafka are:
Messaging
Website Activity Tracking
Metrics
Log Aggregation
Stream Processing
Event Sourcing
Commit Log
For more information use the official apache Kafka site.
https://kafka.apache.org/uses
Kafka is a pub-sub highly scalable messaging system. It acts as a transport layer guaranteeing exactly once semantics and Spark steaming does the processing. The next question that comes to my mind is even spark can poll directories to check for files and even read from a socket or port. How this Kafka and spark work in tandem ? I mean does an application written in some language instead of writing to a database for storage directly feds to the port (or places the files which would not really be tak time and would rather be some kind of batch processing) from which the data is then read by a Kafka producer and then via the Kafka consumer API is then read and processing by spark streaming?