I am having a weird, recurring but not constant, error where I get "2013, 'Lost connection to MySQL server during query'". These are the premises:
a Python app runs around 15-20minutes every hour and then stops (hourly scheduled by cron)
the app is on a GCE n1-highcpu-2 instance, the db is on a D1 with a per package pricing plan and the following mysql flags
max_allowed_packet 1073741824
slow_query_log on
log_output TABLE
log_queries_not_using_indexes on
the database is accessed only by this app and this app only so the usage is the same, around 20 consecutive minutes per hour and then nothing at all for the other 40 minutes
the first query it does is
SELECT users.user_id, users.access_token, users.access_token_secret, users.screen_name, metadata.last_id
FROM users
LEFT OUTER JOIN metadata ON users.user_id = metadata.user_id
WHERE users.enabled = 1
the above query joins two tables that are each around 700 lines longs and do not have indexes
after this query (which takes 0.2 seconds when it runs without problems) the app starts without any issues
Looking at the logs I see that each time this error presents itself the interval between the start of the query and the error is 15 minutes.
I've also enabled the slow query log and those query are registered like this:
start_time: 2014-10-27 13:19:04
query_time: 00:00:00
lock_time: 00:00:00
rows_sent: 760
rows_examined: 1514
db: foobar
last_insert_id: 0
insert_id: 0
server_id: 1234567
sql_text: ...
Any ideas?
If your connection is idle for the 15 minute gap the you are probably seeing GCE disconnect your idle TCP connection, as described at https://cloud.google.com/compute/docs/troubleshooting#communicatewithinternet. Try the workaround that page suggests:
sudo /sbin/sysctl -w net.ipv4.tcp_keepalive_time=60 net.ipv4.tcp_keepalive_intvl=60 net.ipv4.tcp_keepalive_probes=5
(You may need to put this configuration into /etc/sysctl.conf to make it permanent)
Related
we are using Grails 4 with Hibernate 5 and Postgresql and I ran into a strange problem which I don't know how to troubleshoot.
I have created a test which creates a Family (a large object graph having around 20 INSERT and UPDATE queries) and then it tries to retrieve it using its id.
So there is a FamilyController extending grails.rest.RestfulControllerhaving static responseFormats = ['json'] and the following methods:
#Override
protected Family createResource() {
def instance = new NewFamilyCommand()
bindData(instance, getObjectToBind())
instance.validate()
//Check for binding errors
if(instance.hasErrors()) {
throw new ValidationException('Unable to bind new family', instance.errors)
}
//Send the command to the service, which will return us an actual Family domain object
return familyService.addFamilyToUser(instance)
}
and
#Override
protected Family queryForResource(Serializable id) {
def family = familyService.safeGetFamily(Long.parseLong(id))
...
return family
}
Running this in a loop using a Cypress test, and it works fine most of the time.
Problem is that from time to time (seems to be a multiple of 50 which coincidentally is the number of maxConnections configured in Tomcat) querying the Family by the returned id does not find it.
Here is the Tomcat configuration for the datasource (default recommended by Grails):
dataSource:
pooled: true
jmxExport: true
driverClassName: org.postgresql.Driver
dialect: "net.kaleidos.hibernate.PostgresqlExtensionsDialect"
username: "test"
password: "test"
properties:
initialSize: 2
maxActive: 50
minIdle: 2
maxIdle: 2
maxWait: 10000
maxAge: 10 * 60000
timeBetweenEvictionRunsMillis: 5000
minEvictableIdleTimeMillis: 60000
validationQuery: "SELECT 1"
validationQueryTimeout: 3
validationInterval: 15000
testOnBorrow: true
testWhileIdle: true
testOnReturn: false
jdbcInterceptors: "ConnectionState;StatementCache(max=200)"
defaultTransactionIsolation: 2 # TRANSACTION_READ_COMMITTED
removeAbandoned: true
removeAbandonedTimeout: 300
Postgresql runs in a docker container with 2 PIDS for this database, let's say 73 and 74.
I looked in the Postgresql logs and I noticed a difference between a successful test and a failed one.
In a successful scenario, both the Family creation and the Family retrieval run in the same PID.
In a failed scenario, the Family creation is performed by pid 74 and Family retrieval is performed by PID 73.
What is more curious is that PID 73 is idle most of the time, it runs it's first query around the creation of Family 50 (presumably when the connections are started to being reused) and then at Family 101 is getting used in the Family retrieval query, whose transaction begins before PID 74 Family creation transaction commits (at least that is shown in Postgres logs but maybe the logs are not printed chronologically).
Checking the database right after the test fails, I see the Family saved in the database and also the tests pass if I add a little wait time before querying for the result.
I am wondering how could that be, I assume that Postgresl return the ID only after the transaction is being committed, and then why would the other PID not see it?
Any help in troubleshooting this would be greatly appreciated.
UPDATE: seems to be related to Hibernate.FLUSH_MODE. If I set sessionFactory.currentSession.setFlushMode(FlushMode.MANUAL) in familyService.addFamilyToUser(instance) the problem goes away and the statements are properly ordered. It seems that Hibernate flushes the session too early returning the family id even if Postgres does not commit all the insert/update statements.
Looks like it was our fault, we messed around with the DirtyCheckable#hasChanged implementation and the code generated a lot of UPDATE statements after the initial insert and this caused the session to be flushed several times and the underlying Posgresql transactions were out of sync with the Hibernate transactions
I have pgbouncer.ini file with the below configuration
[databases]
test_db = host=localhost port=5432 dbname=test_db
[pgbouncer]
logfile = /var/log/postgresql/pgbouncer.log
pidfile = /var/run/postgresql/pgbouncer.pid
listen_addr = 0.0.0.0
listen_port = 5433
unix_socket_dir = /var/run/postgresql
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
admin_users = postgres
#pool_mode = transaction
pool_mode = session
server_reset_query = RESET ALL;
ignore_startup_parameters = extra_float_digits
max_client_conn = 25000
autodb_idle_timeout = 3600
default_pool_size = 250
max_db_connections = 250
max_user_connections = 250
and I have in my postgresql.conf file
max_connections = 2000
does it effect badly on the performance ? because of max_connections in my postgresql.conf ? or it doesn't mean anything and already the connection handled by the pgbouncer ?
one more question. in pgpouncer configuration, does it right listen_addr = 0.0.0.0 ? or should to be listen_addr = * ?
Is it better to set default_pool_size on PgBouncer equal to the number of CPU cores available on this server?
Shall all of default_pool_size, max_db_connections and max_user_connections to be set with the same value ?
So the idea of using pgbouncer is to pool connections when you can't afford to have a higher number of max_connections in PG itself.
NOTE: Please DO NOT set max_connections to a number like 2000 just like that.
Let's start with an example, if you have a connection limit of 20 and then your app or organization wants to have a 1000 connections at a given time, that is where pooler comes into picture and in this specific case you want the 20 connections to pool that 1000 coming in from the application.
To understand how it actually works let's take a step back and understand what happens when you do not have a connection pooler and only rely on PG config setting for the max connections which in our case is 20.
So when a connection comes in from a client\application etc. the main process of postgresql(PID, i.e. parent ID) spawns a child for that. So each new connection spawns a child process under the main postgres process, like so:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
24379 postgres 20 0 346m 148m 122m R 61.7 7.4 0:46.36 postgres: sysbench sysbench ::1(40120)
24381 postgres 20 0 346m 143m 119m R 62.7 7.1 0:46.14 postgres: sysbench sysbench ::1(40124)
24380 postgres 20 0 338m 137m 121m R 57.7 6.8 0:46.04 postgres: sysbench sysbench ::1(40122)
24382 postgres 20 0 338m 129m 115m R 57.4 6.5 0:46.09 postgres: sysbench sysbench ::1(40126)
So now once a connection request is sent, it is received by the POSTMASTER process and creates a child process at OS level under the main parent process. This connection then has a life span of "unlimited" unless close by the application or you have a time out set for idle connections in postgresql.
Now here comes the situation where it can be a very costly affair to manage the connections with a given compute, if they exceed a certain limit. Meaning n number of connections when served have a given compute cost and after some time the OS won't be able to handle a situation with HUGE connections and will in turn cause contentions at different compute level(i.e. Memory, CPU, I/O).
What if you can use the presently spawned child processes(backends) if they are not doing any work. You will save time on getting the child process(backends) and the additional cost as well(this can be different at times). This is where the pool of connections that are always open help to serve different client requests comes in and is also called pooling.
So basically now you have only n connections available but the pooler can manage n+i number of connections to serve the client requests.
This where pg-bouncer helps to reuse the connections. It can be configured with 3 types of pooling i.e Session pooling, Statement pooling and Transaction pooling. Basically bouncer returns the connection back to the pool once it has done, statement level work or transaction level work etc. Only during session pooling it keeps the connections unless it disconnects.
So basically lower down the number of connections at PG conf file level and tune all settings in the bouncer.ini.
To answer the second part:
one more question. in pgpouncer configuration, does it right listen_addr = 0.0.0.0 ? or should to be listen_addr = * ?
It depends if you have a standalone deployment, server etc.
basically if its on the server itself and you want it to allow connections from everywhere(incoming) use "*" if you want only the local network to be allowed use "127.0.0.0".
For the rest of your questions check this link: pgbouncer docs
I have tried to share a little of what I know, feel free to ask away if anything was unclear or or correct if it was incorrectly mentioned.
I'm using the configuration below for Ebean, so normally there shouldn't be more than 20 connections open, which is the limit for the Hobby-basic plan that I use in Heroku. Even so, Heroku is throwing the error: FATAL: too many connections for role... from time to time. Any clues?
db.default.partitionCount=1
db.default.maxConnectionsPerPartition=20
db.default.minConnectionsPerPartition=2
db.default.acquireIncrement=1
db.default.acquireRetryAttempts=3
db.default.acquireRetryDelay=30 seconds
db.default.connectionTimeout=30 seconds
db.default.idleMaxAge=5 minutes
db.default.idleConnectionTestPeriod=0
db.default.maxConnectionAge=15 minutes
db.default.initSQL="SELECT 1"
db.default.releaseHelperThreads=0
We got an error from postgres one day.
"connection limit exceeded for non-superusers"
In postgresql.conf setting, the max_connection is 100.
At that time I checked the access with command (select * from pg_stat_activity;)
the result only 17 access.
We used this application for almost 10 years and never done any changed.
This is the first time we received this kind of error.
So, I assume that "not closing the connections properly in program "
is not the cause of this error.
Any tips?
I have a file with a simple query SELECT 1; repeated 1000 times. When I run it through time psql -f test.sql -o /dev/null/, there are next results:
real 0m0.362s
user 0m0.064s
sys 0m0.060s
It's about 1000/0.362 = 2762 queries/sec?
In pgbench for this query I have:
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
tps = 12233.355663 (including connections establishing)
tps = 12239.560512 (excluding connections establishing)
Where psql spends time?
psql is simply and generic software and fact so output is /dev/null doesn't ensure disabling of formatted output generating. And parsing of generic lines needs some time too. For very simple and very fast queries this overhead can be significant.