Is subscription lies in any region (like Topic) in Google Pub Sub - publish-subscribe

I am working on project which using google pub sub for transfer of messages, I have seen the Topic created in pub sub should specify regions (if we define multiple region, the message will store only nearby region of publisher). I have question regarding this.
Is there is any replication in between regions for Topic for storing message.
Is subscription also lies in any region which is created on Topic.

Cloud Pub/Sub is a global service. What that means is that topics and subscriptions are not tied to a specific region. Publishers can publish messages to topics in any region and subscribers can connect in any region (the same or different as the publishers) and receive the messages published to the topic via subscriptions.
The ability to specify regions for storage in Cloud Pub/Sub is a feature called message storage policy. The intention of this feature is to allow you specify which regions messages can or cannot be stored (though they still transit through other regions if subscribers are attached in those other regions). This allows one to place restrictions on data storage to, for example, comply with local laws.
If using the global Pub/Sub endpoint, publishers and subscriber connect to the network-nearest region to send or receive messages. When running publishers or subscribers on GCP, this means they will connect to the region where those resources are located. If using regional endpoints, then the publisher or subscriber connects to the region specified. However, even when using regional endpoints, the service is still global in that messages published in different regions to the same topic are still all available to subscribers using the same single subscription tied to that topic.
Messages are not replicated across regions when published. They are replicated to three zones within a region, though. If subscribers are located in a different region than the messages were published, then Cloud Pub/Sub routes messages across the regions to deliver them to the subscribers, wherever they are currently running.

Related

Kafka P2P Header based routing

I have a requirement to send an event to multiple systems based on their system code. Destination system can grow in future and they should be able to subscribe only to the interested events. Its a security mandate, so as a producer we need to ensure this.
We could use RabbitMQ header exchange and use multiple shovel configurations to the different queues in different vhost or cluster. But I am looking for a similar pattern with Kafka.
If we maintain different topic and authorise the consumer to their corresponding topic, it can grow in future, so as a producer I need to do the topic routing logic and the number of topics will grow.
The other option is to use AWS SNS and subscribe multiple SQS queues. Based on filter policies the message can be routed.
Could anyone think of a better solution to this problem?
send an event to multiple systems based on their system code
Using Kafka Streams API, you can use branching to route data to different topics based on Predicate logic
Once data is in their respective topics, "multiple systems" can consume them

Is it possible for a single message to be given to multiply instance of the same subscription in gcloud PubSub

I have a publisher that publishes messages to a particular topic (myTopic), then on my PubSub I create a subscription name: myTopicSub to this topic (myTopic), then I have a VM that runs a service that listeners on my subscription myTopicSub
THIS WORKS
MY PROBLEM IS: if there be a need to scale, and I add 5 more VM to handle more messages from my subscription... is it possible for PubSub to send the same message to more than one VM...
Because I only need one VM to process the message once. Please I need help
Cloud Pub/Sub offers at-least-once delivery. That means that a message can be delivered multiple times and in some cases, can be delivered to two different subscribers for the same subscription within a short period of time. That particular type of duplicate delivery is rare, but not impossible.
Subscribers have to be able to handle the delivery of duplicates and depending on their nature may handle it in different ways. For some, all actions are idempotent, so re-processing the same message has no ill effects. In other cases, the subscribers need to track which messages they have received and processed and if a message is a duplicate, just immediately ack the message instead of process it.

How do I limit a Kafka client to only access one customer's data?

I'm evaluating Apache Kafka for publishing some event streams (and commands) between my services running on machines.
However, most of those machines are owned by customers, on their premises, and connected to their networks.
I don't want a machine owned by one customer to have access to another customer's data.
I see that Kafka has an access control module, which looks like it lets you restrict a client's access based on topic.
So, I could create a topic per customer and restrict each customer to just their own topic. This seems like a bad idea that I could regret in the future, because I've seen things that recommend restricting the number of Kafka topics to the 1000s at most.
Another design is to create a partition per customer. However, I don't see a way to restrict access if I do that.
Is there a way out of this quandary?

Using kafka state store change log topic to share state

I am trying to do a small POC on event sourcing in the e-commerce domain. So I have an order service, a customer contact service and a delivery service. Now for logistical reasons, some customer contact information should be available to the delivery person so that they can call the customer in case they are not at home etc. Obviously, customers can update their contact information and the delivery should ideally show the latest contact info. I am using kafka as the messaging framework and state store.
When I create a state store by reading events related to an aggregate's state changes (e.g. customer contact info change) and apply them on the aggregate, the state store is backed with a change-log topic (e.g. contact-service-customer-contact-changelog). Now if I need some of this data in another service (e.g. delivery service) can I use the same change log topic to create another state store local to that service? In the literature available this type of topic is termed "internal" and so it seems we are not supposed to use this topic for anything other than re-building the state store for the instances of the original service. So we should be re-publishing the updates to the state store to a new topic for other services to join with this data? Or is there another way to address this use case?
It's called internal topic because it's managed by the Kafka Streams library. This implies, that the topic name might change, and thus it might not be save to consume the topic. Nothing prevents you to consume the topic though.
To not mess up the state, you should never write to this topic though. Reading from it, is "conceptually save".
It's hard to advice and you should make your own decision to read the topic or to create a topic manually and write the same data a second time.

I am evaluating Google Pub/Sub vs Kafka. What are the differences? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
The community reviewed whether to reopen this question 7 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I have not worked on kafka much but wanted to build data pipeline in GCE. So we wanted to know Kafka vs PUB/Sub. Basically I want to know how message consistency, message availability, message reliability is maintained in both Kafka and Pub/sub
Thanks
In addition to Google Pub/Sub being managed by Google and Kafka being open source, the other difference is that Google Pub/Sub is a message queue (e.g. Rabbit MQ) where as Kafka is more of a streaming log. You can't "re-read" or "replay" messages with Pubsub. (EDIT - as of 2019 Feb, you CAN replay messages and seek backwards in time to a certain timestamp, per comment below)
With Google Pub/Sub, once a message is read out of a subscription and ACKed, it's gone. In order to have more copies of a message to be read by different readers, you "fan-out" the topic by creating "subscriptions" for that topic, where each subscription will have an entire copy of everything that goes into the topic. But this also increases cost because Google charges Pub/Sub usage by the amount of data read out of it.
With Kafka, you set a retention period (I think it's 7 days by default) and the messages stay in Kafka regardless of how many consumers read it. You can add a new consumer (aka subscriber), and have it start consuming from the front of the topic any time you want. You can also set the retention period to be infinite, and then you can basically use Kafka as an immutable datastore, as described here: http://stackoverflow.com/a/22597637/304262
Amazon AWS Kinesis is a managed version of Kafka whereas I think of Google Pubsub as a managed version of Rabbit MQ.
Amazon SNS with SQS is also similar to Google Pubsub (SNS provides the fanout and SQS provides the queueing).
I have been reading the answers above and I would like to complement them, because I think there are some details pending:
Fully Managed System Both system can have fully managed version in the cloud. Google provides Pubsub and there are some fully managed Kafka versions out there that you can configure on the cloud and On-prem.
Cloud vs On-prem I think this is a real difference between them, because Pubsub is only offered as part of the GCP ecosystem whereas Apache Kafka you can use as a both Cloud service and On-prem service (doing the cluster configuration by yourself)
Message duplication
- With Kafka you will need to manage the offsets of the messages by yourself, using an external storage, such as, Apache Zookeeper. In that way you can track the messages read so far by the Consumers. Pubsub works using acknowledging the message, if your code doesn't acknowledge the message before the deadline, the message is sent again, that way you can avoid duplicated messages or another way to avoid is using Cloud Dataflow PubsubIO.
Retention policy Both Kafka and Pubsub have options to configure the maximum retention time, by default, I think is 7 days.
Consumers Group vs Subscriptions Be careful how you read messages in both systems. Pubsub use subscriptions, you create a subscription and then you start reading messages from that subscription. Once a message is read and acknowledge, the message for that subscription is gone. Kafka use the concept of "consumer group" and "partition", every consumer process belongs to a group and when a message is read from a specific partition, then any other consumer process which belongs to the same "consumer group" will not be able to read that message (that is because the offset eventually will increase). You can see the offset as a pointer which tells the processes which message have to read.
I think there is not a correct answer for your question, it will really depends on what you will need and the constrains you have (below are some examples of the escenarios):
If the solution must be in GCP, obviously use Google Cloud Pubsub. You will avoid all the settings efforts or pay extra for a fully automated system that Kafka requires.
If the solution should require process data in Streaming way but also needs to support Batch processing (eventually), it is a good idea to use Cloud Dataflow + Pubsub.
If the solution require to use some Spark processing, you could explore Spark Streaming (which you can configure Kafka for the stream processing)
In general, both are very solid Stream processing systems. The point which make the huge difference is that Pubsub is a cloud service attached to GCP whereas Apache Kafka can be used in both Cloud and On-prem.
Update (April 6th 2021):
Finally Kafka without Zookeeper
One big difference between Kafka vs. Cloud Pub/Sub is that Cloud Pub/Sub is fully managed for you. You don't have to worry about machines, setting up clusters, fine tune parameters etc. which means that a lot of DevOps work is handled for you and this is important, especially when you need to scale.