I'm looking for more detailed guidance / other people's experience of using Npgsql in production with Pgbouncer.
Basically we have the following setup using GKE and Google Cloud SQL:
Right now - I've got npgsql configured as if pgbouncer wasn't in place, using a local connection pool. I've added pgbouncer as a deployment in my GKE cluster as Google SQL has very low max connection limits - and to be able to scale my application horizontally inside of Kubernetes I need to protect against overwhelming it.
My problem is one of reliability when one of the pgbouncer pods dies (due to a node failure or as I'm scaling up/down).
When that happens (1) all of the existing open connections from the client side connection pools in the application pods don't immediately close (2) - and basically result in exceptions to my application as it tries to execute commands. Not ideal!
As I see it (and looking at the advice at https://www.npgsql.org/doc/compatibility.html) I have three options.
Live with it, and handle retries of SQL commands within my application. Possible, but seems like a lot of effort and creates lots of possible bugs if I get it wrong.
Turn on keep alives and let npgsql itself 'fail out' relatively quickly the bad connections when those fail. I'm not even sure if this will work or if it will cause further problems.
Turn off client side connection pooling entirely. This seems to be the official advice, but I am loathe to do this for performance reasons, it seems very wasteful for Npgsql to have to open a connnection to pgbouncer for each session - and runs counter to all of my experience with other RDBMS like SQL Server.
Am I on the right track with one of those options? Or am I missing something?
You are generally on the right track and your analysis seems accurate. Some comments:
Option 2 (turning out keepalives) will help remove idle connections in Npgsql's pool which have been broken. As you've written your application will still some failures (as some bad idle connections may not be removed in time). There is no particular reason to think this would cause further problems - this should be pretty safe to turn on.
Option 3 is indeed problematic for perf, as a TCP connection to pgbouncer would have to be established every single time a database connection is needed. It will also not provide a 100% fail-proof mechanism, since pgbouncer may still drop out while a connection is in use.
At the end of the day, you're asking about resiliency in the face of arbitrary network/server failure, which isn't an easy thing to achieve. The only 100% reliable way to deal with this is in your application, via a dedicated layer which would retry operations when a transient exception occurs. You may want to look at Polly, and note that Npgsql helps our a bit by exposing an IsTransient exception which can be used as a trigger to retry (Entity Framework Core also includes a similar "retry strategy"). If you do go down this path, note that transactions are particularly difficult to handle correctly.
Related
Forgive me if there's already been an answer to this somewhere, but I haven't seen anything definitive in the docs.
Is there a limit to the size of the connection pool?
I have a situation where there could be 100s or 1000s of connections open at once - should the connection pool be used for this or would that be an abuse of the feature?
Is there a limit to the size of the connection pool?
Probably not, however a greater concern is that each connection will take up RAM.
I have a situation where there could be 100s or 1000s of connections open at once - should the connection pool be used for this or would that be an abuse of the feature?
I don't see this being an abuse. I think during the times you have 100 or 1000 clients connecting at once, the server would handle the connections much better.
However if there is only 10 clients connecting and you have a connection pool of 1000, 900 connections could be seen as wasted resources.
Source: Deep Dive into Connection Pooling
I do not have any experience with connection pools, so don't take my information as granted. I would like to hear from someone with experience on the topic.
It turns out that using the connection pool to manage traffic to more than one database isn't possible - which is what I was aiming to do. The solution I'll be using involves creating numerous connections using createConnection(), closing any unused ones.
See an issue I opened in the Mongoose git project for a fuller explanation https://github.com/Automattic/mongoose/issues/6206
As far as I know database connectivity technologies like the entity framework open and close connections automatically to enhance scalability. (Managing Connections and Transactions)
For example a form using asp.net mvc and the entity framework will connect to retrieve a record and them will immediately disconnect and remains disconnected until I modify the data in the controls and save it.
I wonder if the same behavior applies for an access 2013 form linked via odbc to SQL Server. Once a record is retrieved, is the connection closed until my next operation or the connection remains open until I close the form? Is the behavior configurable?
The fact or existence of a connection does not change nor increase scalability for typical applications. So if you have 10, or 1000 connections, and those connections are NOT doing anything, then SQL server not doing any work, and hence no increase in scalability will occur in these typical cases.
And OFTEN there is additional chatter over the network to open the connection pull the data, close the connection.
Then when you write the data back you AGAIN have 3 steps. So you again open the connection, open the table, write the data, and then close the table!
In fact keeping the connection open means you don’t waste network bandwidth opening and closing the connection!
The MAIN reason for disconnected datasets is that such connections work far more reliable in the case when you have a poor or less than ideal connection (such as over the internet or via Wi-Fi at a coffee shop). I these cases, if the open connection command fails, then the connection does NOT occur, and you don’t pull any data. And if a bit of time delay or re-try occurs as the connection is re-attempted, then no big deal. So you grab that data and close the connection.
However, this opening, and then closing often as noted causes additional overhead. However, given how the internet works (as opposed to a typical office network), then this disconnected approach is much the norm for pulling data over the internet, or when using something like Wi-Fi. So the approach is one of expecting that a minor disconnect will and can occur.
The second “common” reason for this 3 step process is other development platforms “promote” the use of disconnected data because the forms are NOT bound to the actual data tables (or bound to a query). The downside of this disconnected approach is you thus in general have to write code to pull the data down to the client, and THEN render the data from the recordset object to the form. The result is a TON of additional work to edit data in a form. So expect the typical asp .net application to cost 5 or even 10 times as much as writing that application in Access.
In the case of Access bound form model, it eliminates the developer having to code the data pull and coding and eliminates the need of the developer to pull that data into some object, and then close that connection. Once Access establishes a connection to the SQL server, then that connection remains open until you shut down the Access application.
The keep the connection open and active has the advantage of rapid application development due to the bound forms model. So you DO NOT need to write code to pull data from the server and THEN transfer that data from some type of object into the form.
So the downside to the Access approach has little (if anything) to do with scalability. The downside is a simple break in the connection is NOT at all well handled by Access.
So if you build a form in Access that is bound to a SQL server table of say 1 million records, and you launch that form with a where clause of InvoiceNumer = 12356, then Access is smart, and ONLY pulls down the ONE record from SQL server. So in terms of scalability and performance, the use of disconnected system as opposed to the bound connected model in Access will not result in a performance difference.
However, because Access keeps that connection open, then any breakage in that connection will result in an ODBC error. Worse is Access is NOT able to recover from such errors when using bound forms – your only recourse is to restart Access.
It is certainly possible to build un-bound forms in Access, but this is NOT how Access was designed. In fact if one is going to adopt a disconnected data model in Access then Access is the wrong tool due to no wizards or “developer aids” for such an approach. So say .net has wizards built around the disconnected system, and Access has tools built around the connected system. And a LOT of functionally that makes Access such a great rapid application development tool is lost if you build un-bound forms in Access. So bound Access forms have MANY additional events that you not find in say .net forms.
So it not scalability or better performance that is lost by the bound forms approach in Access, but simple ease of development is the main feature and gain that the Access development approach results in.
A developer will STILL need and should LIMIT the number of records pulled into a form (by use of the forms “where” clause.
(so .net forms don’t have such indulgences as a where clause).
So the major shortcoming is that Access does not recover from ODBC disconnections since it was designed to keep such connections open.
I am trying to build a server which can handle as many concurrent connections as possible. (100k at least, for a start)
Right now, when i test it through LAN, it can go up to 50k+ concurrent connections easily (did not test more yet). However when I test it from outside my LAN, it never goes beyond about 8k...
To be more precise, when going past 8k, the first sockets no longer receive any data, as if the new ones replaced them...
Does anyone have any idea what could cause this?
I have done some research, and it seems, although it isn't clear, that routers/modems may have a limited amount of supported concurrent connections, is that true?
If so, and if that's my problem, do I have to get one that can support more? Or get rid of it somehow?
My application uses Postgresql 9.0 and is composed by one or more stations that interacts with a global database: it is like a common client server application but to avoid any additional hardware, all stations include both client and server: a main station is promoted to act also as server, and any other act as a client to it. This solution permits me to be scalable: a user may initially need a single station but it can decide to expand to more in future without a useless separate server in the initial phase.
I'm trying to avoid that if main station goes down all others stop working; to do it the best solution could be to continuously replicate the main database to unused database on one or more stations.
Searching I've found that pgpool can be used for my needs but from all examples and tutorial it seems that point of failure moves from main database to server that runs pgpool.
I read something about multiple pgpool and heartbeat tool but it isn't clear how to do it.
Considering my architecture, where doesn't exist separated and specialized servers, can someone give me some hints ? In case of failover it seems that pgpool do everything in automatic, can I consider that failover situation can be handled by a standard user without the intervention of an administrator ?
For these kind of applications I really like Amazon's Dynamo design. The document by the link is quite big, but it is worth reading. In fact, there're applications that already implement this approach:
mongoDB
Cassandra
Project Voldemort
Maybe others, but I'm not aware. Cassandra started within Facebook, Voldemort is the one used by LinkedIn. Making things distributed and adding redundancy into your data distribution you will step away from traditional Master-Slave replication approaches.
If you'd like to stay with PostgreSQL, it shouldn't be a big deal to implement such approach. You will need to implement an extra layer (a proxy), that will decide based on pre-configured options how to retrieve/save the data.
The proxying layer can be implemented in:
application (requires lot's of work IMHO);
database;
as a middleware.
You can use PL/Proxy on the middleware layer, project originated in Skype. It is deeply integrated into the PostgreSQL, so I'd say it is a combination of options 2 and 3. PL/Proxy will require you to use functions for all kind of queries against the database.
In case you will hit performance issues, PgBouncer can be used.
Last note: any way you decide to go, a known amount of development will be required.
EDIT:
It all depends on what you call “failure” and what you consider system being in an interrupted state.
Let's look on the pgpool features.
Connection Pooling PostgreSQL is using a single process (fork) per session. Obviously, if you have a very busy site, you'll hit the OS limit. To overcome this, connection poolers are used. They also allow you to use your resources evenly, so generally it's a good idea to have pooler before your database.In case of pgpool outage you'll face a big number of clients unable to reach your database. If you'll point them directly to the database, avoiding pooler, you'll face performance issues.
Replication All your queries will be auto-replicated to slave instances. This has meaning for the DML and DDL queries.In case of pgpool outage your replication will stop and slaves will not be able to catchup with master, as there's no change tracking done outside pgpool (as far as I know).
Load Balance Your read-only queries will be spread across several instances, achieving nice response times, allowing you to put more bandwidth on the system.In case of pgpool outage your queries will suddenly run much slower, if the system is capable of handling such a load. And this is in the case that master database will catchup instead of failed pgpool.
Limiting Exceeding Connections pgpool will queue connections in case they're not being able to process immediately.In case of pgpool outage all such connections will be aborted, which might brake the DB/Application protocol, i.e. Application was designed to never get connection aborts.
Parallel Query A single query is executed on several nodes to reduce response time.In case of pgpool outage such queries will not be possible, resulting in a longer processing.
If you're fine to face such conditions and you don't treat them as a failure, then pgpool can serve you well. And if 5 minutes of outage will cost your company several thousands $, then you should seek for a more solid solution.
The higher is the cost of the outage, the more fine tuned failover system should be.
Typically, it is not just single tool used to achieve failover automation.
In each failure you will have to tweak:
DNS, unless you want all clients' reconfiguration;
re-initialize backups and failover procedures;
make sure old master will not try to fight for it's role in case it comes back (STONITH);
in my experience we're people from DBA, SysAdmin, Architects and Operations departments who decide proper strategies.
Finally, in my view, pgpool is a good tool, I do use it. But it is not designed as a complete failover solution, not without extra thinking, measures taken, scripts written. Thus I've provided links to the distributed databases, they provide a much higher level of availability.
And PostgreSQL can be made distributed with a little effort due to it's great extensibility.
First of all, I'd recommend checking out pgBouncer rather than pgpool. Next, what level of scaling are you attempting to reach? You might just choose to run your connection pooler on all your client systems (bouncer is light enough for this to work).
That said, vyegorov's answer is probably the direction you should really be looking at in this day and age. Are you sure you really need a database?
EDIT
So, the rather obvious answer is that pgPool creates a single point of failure if you only have one box running it. The obvious solution is to run multiple poolers across multiple boxes. You then need to engineer your application code to handle database disconnections. This is not as easy at it sounds, but basically you need to use 2-phase commit for non-idempotent changes. So to the greatest extent possible you should make your changes idempotent.
Based on your comments, I'd guess that maybe you have limited experience dealing with database replication? pgPool does statement based replication. There are tradeoffs here. The benefit is that it's very easy to set up. The downside is that there is no guarantee that data on the replicated databases will be identical. It is also (I believe but haven't checked lately) not compatible with 2pc.
My prior comment asking if you really need a database was driven by my perception that you have designed a system without going into much detail around this part of it. I have about 2 decades experience working on "this part" of similar systems. I expect you will find that there are no out of the box solutions and that the issues involved get very complicated. In other words, I'm suggesting you re-consider your design.
Try reading this blog (with lots of information about PostgreSQL and PgPool-II):
https://www.itenlight.com/blog/2016/05/21/PostgreSQL+HA+with+pgpool-II+-+Part+5
Search for "WATCHDOG" on that same blog. With that you can configure a PgPool-II cluster. Two machines on the same subnet are required, though, and a virtual IP on the same subnet.
Hope that this is useful for anyone trying the same thing (even if this answer is a lot late).
PGPool certainly becomes a single point of failure, but it is a much smaller one than a Postgres instance.
Though I have not attempted it yet, it should be possible to have two machines with PGPool installed, but only running on one. You can then use Linux-HA to restart PGPool on the standby host if the primary becomes unavailable, and to optionally fail it back again when the primary comes back. You can at the same time use Linux-HA to move a single virtual IP over as well, so that your clients can connect to a single IP for their Postgres services.
Death of the postgres server will make PGPool send queries to the backup Postgres (promoting it to master if necessary).
Death of the PGPool server will cause a brief outage (configurable, but likely in the region of <1min) until PGPool starts up on the standby, the IP address is claimed, and a gratuitous ARP sent out. Of course, the client will have to be intelligent enough to reconnect without dying.
I have this question asked in the Go mailing list, but I think it is more general to get better response from SO.
When work with Java/.Net platform, I never had to manage database connection manually as the drivers handle it. Now, when try to connect to a no sql db with very basic driver support, it is my responsibility to manage the connection. The driver let connect, close, reconnect to a tcp port, but not sure how should i manage it (see the link). Do i have to create a new connection for each db request? can I use other 3rd party connection pooling libraries?
thanks.
I don't know enough about MongoDB to answer this question directly, but do you know how MongoDB handles requests over TCP? For example, one problem with a single TCP connection can be that the db will handle each request serially, potentially causing high latency even though it may be bottlenecking on a single machine and could handle a higher capacity.
Are the machines all running on a local network? If so, the cost of opening a new connection won't be too high, and might even be insignificant from a performance perspective regardless.
My two cents: Do one TCP connection per request and just profile it and see what happens. It is very easy to add pooling later if you're DoSing yourself, but it may never be a problem. That'll work right now, and you won't have to mess around with a third party library that may cause more problems than it solves.
Also, TCP programming is really easy. Don't be intimidated by it, detecting a closed socket, and reconnecting synchronously or asynchronously is simple.
Most mongodb drivers (clients) will create and use a connection pool when connecting to the server. Each socket (connection) can do one operation at a time at the server; because of how data is read off the socket you can issue many requests and server will just get them one after another and return data as each one completes.
There is a Go mongo db driver but it doesn't seem to do connection pooling. http://github.com/mikejs/gomongo
In addition to the answers here: if you find you do need to do some kind of connection pooling redis.go is a decent example of a database driver that pools connections. Specifically, look at the Client.popCon and Client.pushCon methods in the source.