Using MySQL Workbench causes errors on Galera Cluster - mysql-workbench

I have a 5 cluster MariaDB/Galera cluster running in production environment.
I also have a monitor which checks every 20 seconds for cluster size changes. One of our other engineers has been running queries using MySQL Workbench, and when that application is running, I start seeing alerts coming from my monitor where cluster size is 1. It does recover in a few seconds back to the correct size of 5, however it's disconcerting that this client app is causing issues on the cluster. I've requested everyone on our team to not use this app... however I wonder if anyone else has seen this, or knows what it is doing to the cluster.

Related

First 10 long running transactions

I have a fairly small cluster of 6 nodes, 3 client, and 3 server nodes. Important configurations,
storeKeepBinary = true,
cacheMode = Partitioned (some caches's about 5-8, out of 25 have this as TRANSACTIONAL)
AtomicityMode = Atomic
backups = 1
readFromBackups = false
no persistence
When I run the app for some load/performance test on-prem on 2 large boxes, 3 clients on one box, and 3 servers on another box, all within docker containers, I get a decent performance.
However, when I move them over to AWS and run them in EKS, the only change I make is to change the cluster discovery from standard TCP (default) to Kubernetes-based discovery and run the same test.
But now the performance is very bad, I keep getting,
WARN [sys-#145%test%] - [org.apache.ignite] First 10 long-running transactions [total=3]
Here the transactions are running more than a min long.
In other cases, I am getting,
WARN [sys-#196%test-2%] - [org.apache.ignite] First 10 long-running cache futures [total=1]
Here the associated future has been running for > 3 min.
Most of the places 'google search' has taken me, talks flaky/inconsistent n/w as the cause.
The app and the test seem to be ok since on a local on-prem this works just fine and the performance is decent as well.
Wanted to check if others have faced this or when running on Kubernetes in the public cloud something else needs to be done. Like somewhere I read nodes need to be pinned to the host in a cloud/virtual environment, but it's not mandatory.
TIA

Two versions of fluentd fighting over port in my cluster

Somehow, I have 2 versions of fluentd running in my cluster:
They end up fighting over the same port, they just keep cranking away, trying to start up on that port, and it saturates all the CPU in the cluster.
unexpected error error_class=Errno::EADDRINUSE error="Address already in use - bind(2) for 0.0.0.0:24231
/opt/google-fluentd/embedded/lib/ruby/2.6.0/socket.rb:201:in 'bind'
I've tried deleting the daemon sets and deployments, they just keep coming back. Also tried ssh'ing into the machines and killing the process on that port. Nothing seems to work.
Obviously, I only want one version of fluentd to run (and I'm not even sure which one).
I seem to have fixed it. I went to GCP dashboard cluster edit page, Kubernetes Engine Monitoring dropdown was blank. It seems not even the dropdown could decide what to display here.
It seems the automated agent, or whatever, seriously messed up here, and had 2 versions of the logging and monitoring system running, fighting over a port, and crushing the CPU on every machine in the cluster. On top of that, I couldn't delete the daemon sets, pods, or deployments. It seems Google treats these as special somehow, maybe with some kind of automated agent, I don't know.
From the dropdown, I just selected System and workload logging and monitoring, saved, and it applied the changes.
Everything looking good so far, but this whole event has me worried, I didn't do anything. This just....happened.
This is a dev cluster, but if it was a production cluster...

Elastic Cloud APM Server - Queue is full

I have many Java microservices running in a Kubernetes Cluster. All of them are APM agents sending data to an APM server in our Elastic Cloud Cluster.
Everything was working fine but suddenly every microservice received the error below showed in the logs.
I tried to restart the cluster, increase the hardware power and I tried to follow the hints but no success.
Obs: The disk is almost empty and the memory usage is ok.
Everything is in 7.5.2 version
I deleted all the indexes related to APM and everything worked after some minutes.
for better performance u can fine tune these fields in apm-server.yml file
internal queue size increase queue.mem.events=output.elasticsearch.worker * output.elasticsearch.bulk_max_size
default is 4096
output.elasticsearch.worker (increase) default is 1
output.elasticsearch.bulk_max_size (increase) default is 50 very less
Example : for my use case i have used following stats for 2 apm-server nodes and 3 es nodes (1 master 2 data nodes )
queue.mem.events=40000
output.elasticsearch.worker=4
output.elasticsearch.bulk_max_size=10000

What to do when a Google Cloud SQL postgres server upgrade (to a bigger machine) takes considerably longer than "a few minutes"?

We upgraded our Google Cloud SQL postgres server to a bigger machine and the upgrade is not terminating. In our experience, this usually takes less than 5 minutes, but we'ven been waiting for about 1.5 hours now and nothing is happening. There are no logs after the server shut down(except for failed connection attempts). We cannot switch to the failover, because there is already an operation in progress (namely the upgrade that's causing the problem in the first place). Restarting is disabled because the operation is in progress. It seems like there's nothing we can do right now, except maybe apply the last backup, though we're not sure if that's even possible while an operation is in progress.
Is there anything we can do to restart the DB or fix the problem?
When you upgrade a CloudSQL server, the instance is rebooted. It can happen occasionally that rebooting takes more than expected, which seems to be what happened to your server, but this is not unexpected behaviour.
This being said, be sure to check the status of the CloudSQL service. And if upgrades get stuck too often or never finish, contact support.
To reduce the chances of having this issue again:
Configure High Availability for your instance, so it has failover capability.
Make sure that the maintenance window of failover replicas is different from that of the master instance. To change the maintenance schedule, on the GCP console, go to SQL, click on an instance, and "Edit maintenance schedule"->"Set maintenance schedule". Then choose a window.

Slow replication recovery due to communication problems

We had lately several times the same problems on Google compute engine environment with PostgreSQL streaming replication and I would like to understand reasons and if I can repair it in some smoother way.
From time to time we see some communication problems in Google's internal network in GCE datacenter and they always trigger replication lags between our PG master and its replicas. All machines are Debian-8 and PostgreSQL 9.5.
When situation happens everything seems to be OK - no errors in PG logs on master or replicas just communication between master and replicas seems to be incredibly slow or repeatedly failing so new WAL logs are transfered to replicas with big delays and therefore replication lag is still growing.
Restart of replication from within PostgreSQl or restart of PostgreSQL on replica does not really help - after several WAL logs copied using scp in recovery command communication is back in previous incredibly slow status. Only restart of the whole instance help. When whole VM is restarted communication is back to normal and recovery even from lag many hours long is done in a few minutes. So main reason for this behavior seems to be on OS level. I tried to check net traffic but without finding anything significant. I also do not see anything relevant in any OS log.
Could restart of some OS service help? So I do not need to restart the whole VM? Thank you very much for any ideas.