I am working with a backend service (Spring Boot 2.2.13.RELEASE + Hikari + JOOQ) that uses an AWS Aurora PostgreSQL DB cluster configured with a Writer (primary) node and a Reader (Read Replica) node. The reader node has just been sitting there idle/warm waiting to be promoted to primary in case of fail-over.
Recently, we've decided to start serving queries exclusively from the Reader Node for some of our GET endpoints. To achieve this we used a "flavor" of RoutingDataSource so that whenever a service is annotated with #Transactional(readOnly=true) the queries are performed against the reader datasource.
Until here everything was going smooth. However after applying this solution I've noticed a latency increase up to 3x when compared with the primary datasource.
After drilling down on this I found out that each transaction was doing a couple of extra round trips to the db to SET SESSION CHARACTERISTICS:
SET SESSION CHARACTERISTICS READ ONLY
ACTUAL QUERY/QUERIES
SET SESSION CHARACTERISTICS READ WRITE
To improve this I tried to play with the readOnlyMode setting that was introduced in pg-jdbc pg-jdbc 42.2.10. This setting allows to control the behavior when a connection is set to read only (readOnly=true).
https://jdbc.postgresql.org/documentation/head/connect.html
In my first attempt I used readOnly=true and readOnlyMode=always. Even though I stooped seeing the SET SESSION CHARACTERISTICS statements, the latency remained unchanged. Finally I tried to use readOnly=false and readOnlyMode=ignore. This last option caused the latency to decrease however it is still worse than it was before.
Has someone else experience with this kind of setup? What is the optimal configuration?
I don't have a need to flag the transaction as read only (besides to tell the routing datasource to use the read replica instead) so I would like to figure out if it's possible to do anything else so that the latency remains the same between the Writer an Reader Nodes.
Note: At current moment the reader node is just service 1% of all the traffic (+- 20req/s).
Related
I am currently planning some server infrastructure. I have two servers in different locations. My apps (apis and stuff) are running on both of them. The client connects to the nearest (best connection). In case of failure of one server the other can process the requests. I want to use mongodb for my projects. The first idea is to use a replica set, therefore I can ensure the data is consistent. If one server fails the data is still accessible and the secondary switches to primary. When the app on the primary server wants to use the data, it is fine, but the other server must connect to to the primary server in order to handle data (that would solve the failover, but not the "best connection" problem). In Mongodb there is an option to read data from secondary servers, but then I have to ensure, that the inserts (only possible on primary) are consistent on every secondary. There is also an option for this "writeConcern". Is it possible to somehow specify “writeConcern on specific secondary”? Because If an add a second secondary without the apps on it, "writeConcern" on every secondary would not be necessary. And if I specify a specific value I don't really know on which secondary the data is available, right ?
Summary: I want to reduce the connections between the servers when the api is called.
Please share some thought or Ideas to fix my problem.
Writes can only be done on primaries.
To control which secondary the reads are directed to, you can use max staleness as well as tags.
that the inserts (only possible on primary) are consistent on every secondary.
I don't understand what you mean by this phrase.
If you have two geographically separated datacenters, A and B, it is physically impossible to write data in A and instantly see it in B. You must either wait for the write to propagate or wait for the read to fetch data from the remote node.
To pay the cost at write time, set your write concern to the number of nodes in the deployment (2, in your proposal). To pay the cost at read time, use primary reads.
Note that merely setting write concern equal to the number of nodes doesn't make all nodes have the same data at all times - it just makes your application only consider the write successful when all nodes have received it. The primary can still be ahead of a particular secondary in terms of operations committed.
And, as noted in comments, a two-node replica set will not accept writes unless both members are operational, which is why it is generally not a useful configuration to employ.
Summary: I want to reduce the connections between the servers when the api is called.
This has nothing to do with the rest of the question, and if you really mean this it's a premature optimization.
If what you want is faster network I/O I suggest looking into setting up better connectivity between your application and your database (for example, I imagine AWS would offer pretty good connectivity between their various regions).
I have an application which runs stable under heavy load, but only when the load increase in graduate way.
I run 3 or 4 pods at the same time, and it scales to 8 or 10 pods when necessary.
The standard requests per minute is about 4000 (means 66 req-per-second per node, means 16 req-per-second per single pod).
There is a certain scenario, when we receive huge load spike (from 4k per minute to 20k per minute). New pods are correctly created, then they start to receive new load.
Problem is, that in about 10-20% of cases newly created pod struggles to handle initial load, DB requests are taking over 5000ms, piling up, finally resulting in exception that connection pool was exhausted: The connection pool has been exhausted, either raise MaxPoolSize (currently 200) or Timeout (currently 15 seconds)
Here goes screenshots from NewRelic:
I can see that other pods are doing well, and also that after initial struggle, all pods are handling the load without any issue.
Here goes what I did when attempting to fix it:
Get rid of non-async calls. I had few lines of blocking code inside async methods. I've changed everything to async. I do not longer have non-async methods.
Removed long-running transactions. We had long running transactions, like this:
beginTransactionAsync
selectDataAsync
saveDataAsync
commitTransactionAsync
which I refactored to:
- selectDataAsync
- saveDataAsync // under-the-hood EF core short lived transaction
This helped a lot, but did not solve problem completely.
Ensure some connections are always open and ready. We added Minimum Pool Size=20 to connection string, to always keep at least 20 connections open.
This also helped, but still sometimes pods struggle.
Our pods starts properly after Readiness probe returns success. Readiness probe checks connection to the DB using standard .NET core healthcheck.
Our connection string have MaxPoolSize=100;Timeout=15; setting.
I am of course expecting that initially new pod instance will need some spin-up time when it operates at lower parameters, but I do not expect that so often pod will suffocate and throw 90% of errors.
Important note here:
I have 2 different DbContexts sharing same connection string (thus same connection pool). Each of this DbContext accesses different schema in the DB. It was done to have a modular architecture. DbContexts never communicate with each other, and are never used together in same request.
My current guess, is that when pod is freshly created, and receives a huge load immediately, the pod tries to open all 100 connections (it is visible in DB open sessions chart) which makes it too much at the beginning. What can be other reason? How to make sure that pod does operate at it's optimal performance from very beginning?
Final notes:
DB processor is not at its max (about 30%-40% usage under heaviest load).
most of the SQL queries and commands are relatively simple SELECT and INSERT
after initial part, queries are taking no more than 10-20ms each
I don't want to patch the problem with increasing number of connections in pool to more than 100, because after initial struggle pods operates properly with around 50 connections in use
I rather not have connection leak because in such case it will throw exceptions after normal load as well, after some time
I use scoped DbContext, which is disposed (thus connection is released to the pool) at the end of each request
EDIT 25-Nov-2020
My current guess is that new created pod is not receiving enough of either BANDWITH resource or CPU resource. This reasoning is supported by fact that even those requests which DOES NOT include querying DB were struggling.
Question: is it possible that new created pod is granted insufficient resources (CPU or network bandwidth) at the beginning?
EDIT 2 26-Nov-2020 (answering Arca Artem)
App runs on k8s cluster on AWS.
App connects to DB using standard ADO.NET connection pool (max 100 connections per pod).
I'm monitoring DB connections and DB CPU (all within reasonable limits). Hosts CPU utilization is also around 20-25%.
I thought that when pod start, and /health endpoint responds successfully (it checks DB connections, with simple SELECT probe) and also pod's max capacity is e.g. 200rps - then this pod is able to handle this traffic since very first moment after /health probe succeeded. However, from my logs I see that after '/health' probe succeed 4 times in a row under 20ms, then traffic starts coming in, few first seconds of pod handling traffic is taking more than 5s per request (sometimes even 40seconds per req).
I'm NOT monitoring hosts network.
At this point it's just a speculation on my part without knowing more about the code and architecture, but it's worth mentioning one thing that jumps out to me. The health check might not be using the normal code path that your other endpoints use, potentially leading to a false positive. If you have the option, use of a profiler could help you pin-point exactly when and how this happens. If not, we can take educated guesses where the problem might be. There could be a number of things at play here, and you may already be familiar with these, but I'm covering them for completeness sake:
First of all, it's worth bearing in mind that connections in Postgres are very expensive (to put it simply, it's because it's a fork on the database process) and your pods are consequently creating them in bulk when you scale your app all at once. A relatively considerable time is needed to set each one up and if you're doing them in bulk, it'll add up (how long is dependent on configuration, available resources..etc).
Assuming you're using ASP.NET Core (because you mentioned DbContext), the initial request(s) will take the penalty of initialising the whole stack (create min required connections in the pool, initialise ASP.NET stack, dependencies...etc). Again, this will all depend on how you structure your code and what your app is actually doing during initialisation. If your health endpoint is connecting to the DB directly (without utilising the connection pool), it would mean skipping the costly pool initialisation resulting in your initial requests to take the burden.
You're not observing the same behaviour when your load increases gradually possibly because usually these things are an interplay between different components and it's generally a non-linear function of available resources, code behaviour...etc. Specifically if it's just one new pod that spun up, it'll require much less number of connections than, say, 5 new pods spinning up, and Postgres would be able to satisfy it much quicker. Postgres is the shared resource here - creating 1 new connection would be significantly faster than creating 100 new connections (5 pods x 20 min connections in a pool) for all pods waiting on a new connection.
There are a few things you can do to speed up this process with config changes, using an external connection pooler like PgBouncer...etc but they won't be effective unless your health endpoint represents the actual state of your pods.
Again it's all based on assumptions but if you're not doing that already, try using the DbContext in your health endpoint to ensure the pool is initialised and ready to take connections. As someone mentioned in the comments, it's worth looking at other types of probes that might be better suited to implementing this pattern.
I found ultimate reason for the above issue. It was insufficient CPU resources assigned to pod.
Problem was hard to isolate because NewRelic APM CPU usage charts are calculated in different way than expected (please refer to NewRelic docs). The real pod CPU usage vs. CPU limit can be seen only in NewRelic kubernetes cluster viewer (probably it uses different algorithm to chart CPU usage there).
What is more, when pod starts up, it need a little more CPU at the beginning. On top of it, the pods were starting because of high traffic - and simply, there was no enough of CPU to handle these requests.
We're using amazon web service for a business application which is using node.js server and mongodb as database. Currently the node.js server is runing on a EC2 medium instance. And we're keeping our mongodb database in a separate micro instance. Now we want to deploy replica set in our mongodb database, so that if the mongodb gets locked or unavailble, we still can run our database and get data from it.
So we're trying to keep each member of the replica set in separate instances, so that we can get data from the database even if the instance of the primary memeber shuts down.
Now, I want to add load balancer in the database, so that the database works fine even in huge traffic load at a time. In that case I can read balance the database by adding slaveOK config in the replicaSet. But it'll not load balance the database if there is huge traffic load for write operation in the database.
To solve this problem I got two options till now.
Option 1: I've to shard the database and keep each shard in separate instance. And under each shard there will be a reaplica set in the same instance. But there is a problem, as the shard divides the database in multiple parts, so each shard will not keep same data within it. So if one instance shuts down, we'll not be able to access the data from the shard within that instance.
To solve this problem I'm trying to divide the database in shards and each shard will have a replicaSet in separate instances. So even if one instance shuts down, we'll not face any problem. But if we've 2 shards and each shard has 3 members in the replicaSet then I need 6 aws instances. So I think it's not the optimal solution.
Option 2: We can create a master-master configuration in the mongodb, that means all the database will be primary and all will have read/write access, but I would also like them to auto-sync with each other every so often, so they all end up being clones of each other. And all these primary databases will be in separate instance. But I don't know whether mongodb supports this structure or not.
I've not got any mongodb doc/ blog for this situation. So, please suggest me what should be the best solution for this problem.
This won't be a complete answer by far, there is too many details and I could write an entire essay about this question as could many others however, since I don't have that kind of time to spare, I will add some commentary about what I see.
Now, I want to add load balancer in the database, so that the database works fine even in huge traffic load at a time.
Replica sets are not designed to work like that. If you wish to load balance you might in fact be looking for sharding which will allow you to do this.
Replication is for automatic failover.
In that case I can read balance the database by adding slaveOK config in the replicaSet.
Since, to stay up to date, your members will be getting just as many ops as the primary it seems like this might not help too much.
In reality instead of having one server with many connections queued you have many connections on many servers queueing for stale data since member consistency is eventual, not immediate unlike ACID technologies, however, that being said they are only eventually consistent by 32-odd ms which means they are not lagging enough to give decent throughput if the primary is loaded.
Since reads ARE concurrent you will get the same speed whether you are reading from the primary or secondary. I suppose you could delay a slave to create a pause of OPs but that would bring back massively stale data in return.
Not to mention that MongoDB is not multi-master as such you can only write to one node a time makes slaveOK not the most useful setting in the world any more and I have seen numerous times where 10gen themselves recommend you use sharding over this setting.
Option 2: We can create a master-master configuration in the mongodb,
This would require you own coding. At which point you may want to consider actually using a database that supports http://en.wikipedia.org/wiki/Multi-master_replication
This is since the speed you are looking for is most likely in fact in writes not reads as I discussed above.
Option 1: I've to shard the database and keep each shard in separate instance.
This is the recommended way but you have found the caveat with it. This is unfortunately something that remains unsolved that multi-master replication is supposed to solve, however, multi-master replication does add its own ship of plague rats to Europe itself and I would strongly recommend you do some serious research before you think as to whether MongoDB cannot currently service your needs.
You might be worrying about nothing really since the fsync queue is designed to deal with the IO bottleneck slowing down your writes as it would in SQL and reads are concurrent so if you plan your schema and working set right you should be able to get a massive amount of OPs.
There is in fact a linked question around here from a 10gen employee that is very good to read: https://stackoverflow.com/a/17459488/383478 and it shows just how much throughput MongoDB can achieve under load.
It will grow soon with the new document level locking that is already in dev branch.
Option 1 is the recommended way as pointed out by #Sammaye but you would not need 6 instances and can manage it with 4 instances.
Assuming you need below configuration.
2 shards (S1, S2)
1 copy for each shard (Replica set secondary) (RS1, RS2)
1 Arbiter for each shard (RA1, RA2)
You could then divide your server configuration like below.
Instance 1 : Runs : S1 (Primary Node)
Instance 2 : Runs : S2 (Primary Node)
Instance 3 : Runs : RS1 (Secondary Node S1) and RA2 (Arbiter Node S2)
Instance 4 : Runs : RS2 (Secondary Node S2) and RA1 (Arbiter Node S1)
You could run arbiter nodes along with your secondary nodes which would help you in election during fail-overs.
We have a data system in which writes and reads can be made in a couple of geographic locations which have high network latency between them (crossing a few continents, but not this slow). We can live with 'last write wins' conflict resolution, especially since edits can't be meaningfully merged.
I'd ideally like to use a distributed system that allows fast, local reads and writes, and copes with the replication and write propagation over the slow connection in the background. Do the datacenter-aware features in e.g. Voldemort or Cassandra deliver this?
It's either this, or we roll our own, probably based on collecting writes using something like
rsync and sorting out the conflict resolution ourselves.
You should be able to get the behavior you're looking for using Voldemort. (I can't speak to Cassandra, but imagine that it's similarly possible using it.)
The key settings in the configuration will be:
replication-factor — This is the total number of times the data is stored. Each put or delete operation must eventually hit this many nodes. A replication factor of n means it can be possible to tolerate up to n - 1 node failures without data loss.
required-reads — The least number of reads that can succeed without throwing an exception.
required-writes — The least number of writes that can succeed without the client getting back an exception.
So for your situation, the replication would be set to whatever number made sense for your redundancy requirements, while both required-reads and required-writes would be set to 1. Reads and writes would return quickly, with a concomitant risk of stale or lost data, and the data would only be replicated to the other nodes afterwards.
I have no experience with Voldemort, so I can only comment on Cassandra.
You can deploy Cassandra to multiple datacenters with an inter-DC latency higher than a few milliseconds (see http://spyced.blogspot.com/2010/04/cassandra-fact-vs-fiction.html).
To ensure fast local reads, you can configure the cluster to replicate your data to a certain number of nodes in each datacenter (see "Network Topology Strategy"). For example, you specify that there should always be two replica in each data center. So even when you lose a node in a data center, you will still be able to read your data locally.
Write requests can be sent to any node in a Cassandra cluster. So for fast writes, your clients would always speak to a local node. The node receiving the request (the "coordinator") will replicate the data to other nodes (in other datacenters) in the background. If nodes are down, the write request will still succeed and the coordinator will replicate the data to the failed nodes at a later time ("hinted handoff").
Conflict resolution is based on a client-supplied timestamp.
If you need more than eventual consistency, Cassandra offers several consistency options (including datacenter-aware options).
I have decided to start developing a little web application in my spare time so I can learn about MongoDB. I was planning to get an Amazon AWS micro instance and start the development and the alpha stage there. However, I stumbled across a question here on Stack Overflow that concerned me:
But for durability, you need to use at least 2 mongodb server
instances as master/slave. Otherwise you can lose the last minute of
your data.
Is that true? Can't I just have my box with everything installed on it (Apache, PHP, MongoDB) and rely on the data being correctly stored? At least, there must be a config option in MongoDB to make it behave reliably even if installed on a single box - isn't there?
The information you have on master/slave setups is outdated. Running single-server MongoDB with journaling is a durable data store, so for use cases where you don't need replica sets or if you're in development stage, then journaling will work well.
However if you're in production, we recommend using replica sets. For the bare minimum set up, you would ideally run three (or more) instances of mongod, a 'primary' which receives reads and writes, a 'secondary' to which the writes from the primary are replicated, and an arbiter, a single instance of mongod that allows a vote to take place should the primary become unavailable. This 'automatic failover' means that, should your primary be unable to receive writes from your application at a given time, the secondary will become the primary and take over receiving data from your app.
You can read more about journaling here and replication here, and you should definitely familiarize yourself with the documentation in general in order to get a better sense of what MongoDB is all about.
Replication provides redundancy and increases data availability. With multiple copies of data on different database servers, replication protects a database from the loss of a single server. Replication also allows you to recover from hardware failure and service interruptions. With additional copies of the data, you can dedicate one to disaster recovery, reporting, or backup.
In some cases, you can use replication to increase read capacity. Clients have the ability to send read and write operations to different servers. You can also maintain copies in different data centers to increase the locality and availability of data for distributed applications.
Replication in MongoDB
A replica set is a group of mongod instances that host the same data set. One mongod, the primary, receives all write operations. All other instances, secondaries, apply operations from the primary so that they have the same data set.
The primary accepts all write operations from clients. Replica set can have only one primary. Because only one member can accept write operations, replica sets provide strict consistency. To support replication, the primary logs all changes to its data sets in its oplog. See primary for more information.