Cassandra giving TTransportException after some inserts/updates - nosql

Around 15 processes were inserting/updating unique entries in Cassandra. Everything was working fine but after sometime I get this error.
(When I restart the process everything is fine again till sometime)
An attempt was made to connect to each of the servers twice, but none
of the attempts succeeded. The last failure was TTransportException:
Could not connect to 10.204.81.77:9160
I did CPU/Memory Analysis of all the Cassandra machines. CPU usage sometimes goes around 110% and Memory Usage was between 60% - 77%. Not sure if this might be the cause, as it was working fine with such memory and cpu usage most of the time.
p.s.: How to ensure Cassandra update/insertion works error free?

Cassandra will throw an exception if anything goes wrong with your inserts; otherwise, you can assume it was error free.
Connection failures are a network problem, not a Cassandra problem. Some places to start: is the Cassandra process still alive? Does netstat show it still listening on 9160? Can you connect to non-Cassandra services on that machine? Is your server or router configured to firewall off frequent connection attempts?

Related

Jobrunr background server stopped polling

I had a number of jobs scheduled but seems none of the jobs were running. On further debugging, I found that there are no available servers, and in the jobrunr_backgroundjobservers table, it seems that there has not been a heart beat for any of the servers. What would cause this issue? How would I restart a heartbeat? And how would I know when such an issue occurs and the servers go down again, given that schedules are time sensitive?
It will stop polling if the connection to the database was lost or the database goes down for a while.
The JobRunr Pro version adds extra features and one of them is database fault tolerance - if such an issue occurs, JobRunr Pro will go in standby and will start processing again once the connection to the database is stable again.
See https://www.jobrunr.io/en/documentation/pro/database-fault-tolerance/ for more info.

CloudSQL Proxy intermittently refuses connection

I'm using a Cloud SQL proxy sidecar on my nodejs API service.
It appears to work great, except that approximately 1% of my API requests come back with an error indicating that the DB connection failed with:
connect ECONNREFUSED 127.0.0.1:3306
My backend logs show that this was thrown from my ORM when it attempted to connect to the DB.
Sidecar logs show nothing, and the CloudSQL instance in question shows nothing out of the ordinary (17/4000 connections, <1% CPU usage, 1.5/3.5GiB memory usage, <100KiB ingress/egress per time slice on 6 hour window).
What might be causing this?
Edit: additional information:
All my pods have been up for many hours with 0 restarts, so the intermittent failure isn't a transient startup failure.
Logs show that this has been occurring intermittently since 30 days ago.
Here are a few reasons that can cause a Cloud SQL instance to become inaccessible:
1) Connection failure between your instance and the agents Cloud SQL uses to monitor the health of your instance
2) Synchronization of operations between your instance and the Cloud SQL service
3) Underprovisioning of resources, such as CPU cores, RAM, and/or storage, to your Cloud SQL instance (see Cloud SQL's Operational Guidelines [1] for additional information).
Since there are several reasons which could cause connections to be dropped (many of which are intricately related to the specifics of your project's implementation and environment), it's extremely complex to diagnose abnormal connection rejection. Additionally, Cloud SQL continuously monitors for any issues that can make an instance inaccessible and automatically takes action to resolve these issues.
Under normal circumstances, the error rate will not fully go away, but should happen at a very low level [2]. There are, of course, some conditions that can make it worse - both production issues as well as certain combinations of operations.
In any case, the recommendation under such circumstances is to implement a retry strategy for reconnection to the instances with exponential backoff. Some of the client libraries already have supporting code in place, but it depends a bit on what you're exactly using.
[1] https://cloud.google.com/sql/docs/mysql/operational-guidelines
[2] https://cloud.google.com/sql/sla

pg_create_logical_replication_slot hanging indefinitely due to old walsender process

I am testing logical replication between 2 PostgreSQL 11 databases for use on our production (I was able to set it thanks to this answer - PostgreSQL logical replication - create subscription hangs) and it worked well.
Now I am testing scripts and procedure which would set it automatically on production databases but I am facing strange problem with logical replication slots.
I had to restart logical replica due to some changes in setting requiring restart - which of course could happen on replicas also in the future. But logical replication slot on master did not disconnect and it is still active for certain PID.
I dropped subscription on master (I am still only testing) and tried to repeat the whole process with new logical replication slot but I am facing strange situation.
I cannot create new logical replication slot with the new name. Process running on the old logical replication slot is still active and showing wait_event_type=Lock and wait_event=transaction.
When I try to use pg_create_logical_replication_slot to create new logical replication slot I get similar situation. New slot is created - I see it in pg_catalog but it is marked as active for the PID of the session which issued this command and command hangs indefinitely. When I check processes I can see this command active with same waiting values Lock/transaction.
I tried to activate parameter "lock_timeout" in postgresql.conf and reload configuration but it did not help.
Killing that old hanging process will most likely bring down the whole postgres because it is "walsender" process. It is visible in processes list still with IP of replica with status "idle wating".
I tried to find some parameter(s) which could help me to force postgres to stop this walsender. But settings wal_keep_segments or wal_sender_timeout did not change anything. I even tried to stop replica for longer time - no effect.
Is there some way to do something with this situation without restarting the whole postgres? Like forcing timeout for walsender or lock for transaction etc...
Because if something like this happens on production I would not be able to use restart or any other "brute force". Thanks...
UPDATE:
"Walsender" process "died out" after some time but log does not show anything about it so I do not know when exactly it happened. I can only guess it depends on tcp_keepalives_* parameters. Default on Debian 9 is 2 hours to keep idle process. So I tried to set these parameters in postgresql.conf and will see in following tests.
Strangely enough today everything works without any problems and no matter how I try to simulate yesterday's problems I cannot. Maybe there were some network communication problems in the cloud datacenter involved - we experienced some occasional timeouts in connections into other databases too.
So I really do not know the answer except for "wait until walsender process on master dies" - which can most likely be influenced by tcp_keepalives_* settings. Therefore I recommend to set them to some reasonable values in postgresql.conf because defaults on OS are usually too big.
Actually we use it on our big analytical databases (set both on PostgreSQL and OS) because of similar problems. Golang and nodejs programs calculating statistics from time to time failed to recognize that database session ended or died out in some cases and were hanging until OS ended the connection after 2 hours (default on Debian). All of it seemed to be always connected with network communication problems. With proper tcp_keepalives_* setting reaction is much quicker in case of problems.
After old walsender process dies on master you can repeat all steps and it should work. So looks like I just had bad luck yesterday...

What to do when a Google Cloud SQL postgres server upgrade (to a bigger machine) takes considerably longer than "a few minutes"?

We upgraded our Google Cloud SQL postgres server to a bigger machine and the upgrade is not terminating. In our experience, this usually takes less than 5 minutes, but we'ven been waiting for about 1.5 hours now and nothing is happening. There are no logs after the server shut down(except for failed connection attempts). We cannot switch to the failover, because there is already an operation in progress (namely the upgrade that's causing the problem in the first place). Restarting is disabled because the operation is in progress. It seems like there's nothing we can do right now, except maybe apply the last backup, though we're not sure if that's even possible while an operation is in progress.
Is there anything we can do to restart the DB or fix the problem?
When you upgrade a CloudSQL server, the instance is rebooted. It can happen occasionally that rebooting takes more than expected, which seems to be what happened to your server, but this is not unexpected behaviour.
This being said, be sure to check the status of the CloudSQL service. And if upgrades get stuck too often or never finish, contact support.
To reduce the chances of having this issue again:
Configure High Availability for your instance, so it has failover capability.
Make sure that the maintenance window of failover replicas is different from that of the master instance. To change the maintenance schedule, on the GCP console, go to SQL, click on an instance, and "Edit maintenance schedule"->"Set maintenance schedule". Then choose a window.

Slow replication recovery due to communication problems

We had lately several times the same problems on Google compute engine environment with PostgreSQL streaming replication and I would like to understand reasons and if I can repair it in some smoother way.
From time to time we see some communication problems in Google's internal network in GCE datacenter and they always trigger replication lags between our PG master and its replicas. All machines are Debian-8 and PostgreSQL 9.5.
When situation happens everything seems to be OK - no errors in PG logs on master or replicas just communication between master and replicas seems to be incredibly slow or repeatedly failing so new WAL logs are transfered to replicas with big delays and therefore replication lag is still growing.
Restart of replication from within PostgreSQl or restart of PostgreSQL on replica does not really help - after several WAL logs copied using scp in recovery command communication is back in previous incredibly slow status. Only restart of the whole instance help. When whole VM is restarted communication is back to normal and recovery even from lag many hours long is done in a few minutes. So main reason for this behavior seems to be on OS level. I tried to check net traffic but without finding anything significant. I also do not see anything relevant in any OS log.
Could restart of some OS service help? So I do not need to restart the whole VM? Thank you very much for any ideas.