Let's say you are using either ServiceFabric or Kubernetes, and you are hosting a transaction data warehouse microservice (maybe a bad example, but suppose all it dose is a simple CQRS architecture consisting of Id of sender, receiver, date and the payment amount, writes and reads into the DB).
For the sake of the argument, if we say that this microservice needs to be replicated among different geographic locations to insure that the data will be recoverable if one database goes down.
Now the naïve approach that I'm thinking is to have an event which gets fired when the transaction is received, and the orchestrator microservice will except to receive event-processed acknowledgment within specific timeframe.
But the question stays that what about the database ? what will happen when we will scale out the microservices and a new microservice instances will be raise up?
they will write to the same database, no ?
One of solutions can be to put the database within the docker, and let it be owned by each replica, is this a good solution?
Please share your thoughts and best practices.
what will happen when we will scale out the microservices and a new microservice instances will be raise up? they will write to the same database?
Yes, the instances of your service, all share the same logical database. To achieve high availability, you typically run a distributed database cluster, but it appears as a single database system for your service.
One of solutions can be to put the database within the docker, and let it be owned by each replica, is this a good solution?
No, you typically want that all your instances of your service see the same consistent data. E.g. a read-request sent to two different instances of your service, should respond with the same data.
If the database becomes your bottleneck, then you can mitigate that by implementing caching or shard your data, or serve read-requests from specific read-instances.
Related
We have a microservice acts as a cache service and decided to have only 2 instances of this microservice up and running. This microservice receives data through kafka topic and stores in it as in memory cache. But we are having a challenge to sync data between these 2 microservices. We decided to use different consumer group for each instance to receive same data, so that, both instances will be in sync. Being same codebase, how to achieve subscribing to different consumer group during startup. For example, if instance#1 subscribes to consumergrp1, other instance2 should be able to subscribe to consumergrp2. Please suggest me how to achieve this.
You can not sync in-memory data in microservices for multiple instance when you are getting data from streaming system or it's getting multiple times.If you are getting data only once in pod life, then you can achieve the sync in-memory data. For e,g. while service is getting up, you can get the data from source and persist in-memory.In this case both pod is having the same data.
You need to use the distributed cache database like redis, couchbase cache.That will be the more clean and neat approach for this.
You haven't specified any details about the way you use kafka (language/thirdparties), etc. So, speaking "in general", you can:
specify a random (or partially random) consumer group id. It won't be as "clean"
as "consumergrp1" and "conumergrp2", but its a string after all, so you can generate it randomly. This idea includes generating the identification of the process in a name of consumer group, for example, if the microservice instances are supposed to be running on different machines, you could include the name of machine as a part of the name of the consumer group.
More complicated, but still: if you have some shared storage, you could use it as a "synchronization" and store the monotonically increasing counter of the "current consumer group to create". once the value is read, it has to be increased. Of course the implementation details depend on the shared storage you actually use (DB, stuff like Redis, whatever).
So there are many different possible solutions. As a suggestion, in any solution you take, do not rely on the fact that you have exactly two instances of the service, maybe you'll reconsider that in future.
I am re-designing a dotnet backend api using the CQRS approach. This question is about how to handle the Query side in the context of a Kubernetes deployment.
I am thinking of using MongoDb as the Query Database. The app is dotnet webapi app. So what would be the best approach:
Create a sidecar Pod which containerizes the dotnet app AND the MongoDb together in one pod. Scale as needed.
Containerize the MongoDb in its own pod and deploy one MongoDb pod PER REGION. And then have the dotnet containers use the MongoDb pod within its own region. Scale the MongoDb by region. And the dotnet pod as needed within and between Regions.
Some other approach I haven't thought of
I would start with the most simple approach and that is to place the write and read side together because they belong to the same bounded context.
Then in the future if it is needed, then I would consider adding more read side or scaling out to other regions.
To get started I would also consider adding the ReadSide inside the same VM as the write side. Just to keep it simple, as getting it all up and working in production is always a big task with a lot of pitfalls.
I would consider using a Kafka like system to transport the data to the read-sides because with queues, if you later add a new or if you want to rebuild a read-side instance, then using queues might be troublesome. Here the sender will need to know what read-sides you have. With a Kafka style of integration, each "read-side" can consume the events in its own pace. You can also more easily add more read-sides later on. And the sender does not need to be aware of the receivers.
Kafka allows you to decouple the producers of data from consumers of the data, like this picture that is taken form one of my training classes:
In kafka you have a set of producers appending data to the Kafka log:
Then you can have one or more consumers processing this log of events:
It has been almost 2 years since I posted this question. Now with 20-20 hindsight I thought I would post my solution. I ended up simply provisioning an Azure Cosmos Db in the region where my cluster lives, and hitting the Cosmos Db for all my query-side requirements.
(My cluster already lives in the Azure Cloud)
I maintain one Postges Db in my original cluster for my write-side requirements. And my app scales nicely in the cluster.
I have not yet needed to deploy clusters to new regions. When that happens, I will provision a replica of the Cosmos Db to that additional region or regions. But still just one postgres db for write-side requirements. Not going to bother to try to maintain/sync replicas of the postgres db.
Additional insight #1. By provisioning the the Cosmos Db separately from my cluster (but in the same region), I am taking the load off of my cluster nodes. In effect, the Cosmos Db has its own dedicated compute resources. And backup etc.
Additional insight #2. It is obvious now but wasnt back then, that tightly coupling a document db (such as MongoDb) to a particular pod is...a bonkers bad idea. Imagine horizontally scaling your app and with each new instance of your app you would instantiate a new document db. You would quickly bloat up your nodes and crash your cluster. One read-side document db per cluster is an efficient and easy way to roll.
Additional insight #3. The read side of any CQRS can get a nice jolt of adrenaline with the help of an in-memory cache like Redis. You can first see if some data is available in the cache before you hit the docuement db. I use this approach for data such as for a checkout cart, where I will leave data in the cache for 24 hours but then let it expire. You could conceivably use redis for all your read-side requirements, but memory could quickly become bloated. So the idea here is consider deploying an in-memory cache on your cluster -- only one instance of the cache -- and have all your apps hit it for low-latency/high-availability, but do not use the cache as a replacemet for the document db.
I am currently planning some server infrastructure. I have two servers in different locations. My apps (apis and stuff) are running on both of them. The client connects to the nearest (best connection). In case of failure of one server the other can process the requests. I want to use mongodb for my projects. The first idea is to use a replica set, therefore I can ensure the data is consistent. If one server fails the data is still accessible and the secondary switches to primary. When the app on the primary server wants to use the data, it is fine, but the other server must connect to to the primary server in order to handle data (that would solve the failover, but not the "best connection" problem). In Mongodb there is an option to read data from secondary servers, but then I have to ensure, that the inserts (only possible on primary) are consistent on every secondary. There is also an option for this "writeConcern". Is it possible to somehow specify “writeConcern on specific secondary”? Because If an add a second secondary without the apps on it, "writeConcern" on every secondary would not be necessary. And if I specify a specific value I don't really know on which secondary the data is available, right ?
Summary: I want to reduce the connections between the servers when the api is called.
Please share some thought or Ideas to fix my problem.
Writes can only be done on primaries.
To control which secondary the reads are directed to, you can use max staleness as well as tags.
that the inserts (only possible on primary) are consistent on every secondary.
I don't understand what you mean by this phrase.
If you have two geographically separated datacenters, A and B, it is physically impossible to write data in A and instantly see it in B. You must either wait for the write to propagate or wait for the read to fetch data from the remote node.
To pay the cost at write time, set your write concern to the number of nodes in the deployment (2, in your proposal). To pay the cost at read time, use primary reads.
Note that merely setting write concern equal to the number of nodes doesn't make all nodes have the same data at all times - it just makes your application only consider the write successful when all nodes have received it. The primary can still be ahead of a particular secondary in terms of operations committed.
And, as noted in comments, a two-node replica set will not accept writes unless both members are operational, which is why it is generally not a useful configuration to employ.
Summary: I want to reduce the connections between the servers when the api is called.
This has nothing to do with the rest of the question, and if you really mean this it's a premature optimization.
If what you want is faster network I/O I suggest looking into setting up better connectivity between your application and your database (for example, I imagine AWS would offer pretty good connectivity between their various regions).
I am currently researching different databases to use for my next project. I was wanting to use a decentralized database. For example Apache Cassandra claims to be decentralized. MongoDB however says it uses replication. From what I can see, as far as these databases are concerned, replication and decentralization are basically the same thing. Is that correct or is there some difference/feature between decentralization and replication that I'm missing?
Short answer, no, replication and decentralization are two different things. As a simple example, let's say you have three instances (i1, i2 and i3) that replicate the same data. You also have a client that fetches data from only i1. If i1 goes down you will still have the data replicated to i2 and i3 as a backup. But since i1 is down the client has no way of getting the data. This an example of a centralized database with single point of failure.
A centralized database has a centralized location that the majority of requests goes through. It could, as in Mongo DB's case be instances that route queries to instances that can handle the query.
A decentralized database is obviously the opposite. In Cassandra any node in a cluster can handle any request. This node is called the coordinator for the request. The node then reads/writes data from/to the nodes that are responsible for that data before returning a result to the client.
Decentralization means that there should be no single point of failure in your application architecture. These systems will provide deployment scheme, where there's no leader (or master) elected during the service life-cycle. These are often deliver services in a peer-to-peer fashion.
Replication means, that simply your data is copied over to another server instance to ensure redundancy and failure tolerance. Client requests can still be served from copies, but your system should ensure some level of "consistency", when making copies.
Cassandra serves requests in a peer-to-peer fashion. Meaning that clients can initiate requests to any node participating in the cluster. It also provides replication and tunable consistency.
MongoDB offers master/slave deployment, so it's not considered as decentralized. You can deliver a multi-master, to ensure that requests can still be served if master node goes down. It also provides replication out-of-the box.
Links
Cassandra's tunable consistency
MongoDB's master-slave configuration
Introduction to Cassandra's architecture
My web app uses ADO.NET against SQL Server 2008. Database writes happen against a primary (publisher) database, but reads are load balanced across the primary and a secondary (subscriber) database. We use SQL Server's built-in transactional replication to keep the secondary up-to-date. Most of the time, the couple of seconds of latency is not a problem.
However, I do have a case where I'd like to block until the transaction is committed at the secondary site. Blocking for a few seconds is OK, but returning a stale page to the user is not. Is there any way in ADO.NET or TSQL to specify that I want to wait for the replication to complete? Or can I, from the publisher, check the replication status of the transaction without manually connecting to the secondary server.
[edit]
99.9% of the time, The data in the subscriber is "fresh enough". But there is one operation that invalidates it. I can't read from the publisher every time on the off chance that it's become invalid. If I can't solve this problem under transactional replication, can you suggest an alternate architecture?
There's no such solution for SQL Server, but here's how I've worked around it in other environments.
Use three separate connection strings in your application, and choose the right one based on the needs of your query:
Realtime - Points directly at the one master server. All writes go to this connection string, and only the most mission-critical reads go here.
Near-Realtime - Points at a load balanced pool of subscribers. No writes go here, only reads. Used for the vast majority of OLTP reads.
Delayed Reporting - In your environment right now, it's going to point to the same load-balanced pool of subscribers, but down the road you can use a technology like log shipping to have a pool of servers 8-24 hours behind. These scale out really well, but the data's far behind. It's great for reporting, search, long-term history, and other non-realtime needs.
If you design your app to use those 3 connection strings from the start, scaling is a lot easier, especially in the case you're experiencing.
You are describing a synchronous mirroring situation. Replication cannot, by definition, support your requirement. Replication must wait for a transaction to commit before reading it from the log and delivering it to the distributor and from there to the subscriber, which means replication by definition has a window of opportunity for data to be out of sync.
If you have a requirement an operation to read the authorithative copy of the data, then you should make that decission in the client and ensure you read from the publisher in that case.
While you can, in threory, validate wether a certain transaction was distributed to the subscriber or not, you should not base your design on it. Transactional replication makes no latency guarantee, by design, so you cannot rely on a 'perfect day' operation mode.