MySQL stop responding - database-performance

In a production environment, our Master MySQL 5.6.26 stop responding.
Our business handles about 1500 transactions per minute but there are times that nothing gets processes for many seconds (18 seconds this time).
We are logging show full processlist output every few seconds and we find the usual queries that normally took a fraction of a second holding for many seconds. But no indication of why.
In the past, we have an issue with storage provider that has almost a second of latency and everything fell apart, but now is not the case, normal 5 to 20 milliseconds latency.
What should I look?

Related

Bad postgres connections + query never finishes

I have a python job that drops and creates a postgres table everyday. The table has roughly the same size everytime. The action takes around 12 minutes, but on some days, it takes many hours and never finishes.
When I check the connections in pgadmin, I see some in red:
When I kill these connections and restart the job, it works.
Note that I always use a context manager when interacting with the database, so no connections should be dangling.
What steps can I take to diagnose this problem?

Slow transaction processing in PostgreSQL

I have been noticing a bad behavior with my Postrgre database, but I still can't find any solution or improvement to apply.
The context is simple, let's say I have two tables: CUSTOMERS and ITEMS. During certain days, the number of concurrent customers increase and so the request of items, they can consult, add or remove the amount from them. However in the APM I can see how any new request goes slower than the previous, pointing at the query response from those tables as the highest time consumer.
If the normal execution of the query is about 200 milliseconds, few moments later it can be about 20 seconds.
I understand the lock process in PostgreSQL as many users can be checking over the same item, they could be even affecting the values of it, but the response from the database it's too slow.
So I would like to know if there are ways to improve the performance in the database.
The first time I used PGTune to get the initial settings and it worked well. I have version 11 with 20Gb for RAM, 4 vCPU and SAN storage, the simultaneous customers (no sessions) can reach over 500.

Dataflow TextIO.write issues with scaling

I created a simple dataflow pipeline that reads byte arrays from pubsub, windows them, and writes to a text file in GCS. I found that with lower traffic topics this worked perfectly, however I ran it on a topic that does about 2.4GB per minute and some problems started to arise.
When kicking off the pipeline I hadn't set the number of workers (as I'd imagined that it would auto-scale as necessary). When ingesting this volume of data the number of workers stayed at 1, but the TextIO.write() was taking 15+ minutes to write a 2 minute window. This would continue to be backed up till it ran out of memory. Is there a good reason why Dataflow doesn't auto scale when this step gets so backed up?
When I increased the the number of workers to 6, the time to write the files started at around 4 mins for a 5 minute window, then moved down to as little as 20 seconds.
Also, when using 6 workers, it seems like there might be an issue for calculating wall time? Mine never seems to go down even when the dataflow has caught up and after running for 4 hours my summary for the write step looked like this:
Step summary
Step name: Write to output
System lag: 3 min 30 sec
Data watermark: Max watermark
Wall time: 1 day 6 hr 26 min 22 sec
Input collections: PT5M Windows/Window.Assign.out0
Elements added: 860,893
Estimated size: 582.11 GB
Job ID: 2019-03-13_19_22_25-14107024023503564121
So for each of your questions:
Is there a good reason why Dataflow doesn't auto scale when this step gets so backed up?
Streaming autoscaling is a beta feature, and must be explicitly enabled for it to work per the documentation here.
When using 6 workers, it seems like there might be an issue for calculating wall time?
My guess would be you ran your 6 worker pipeline for about 5 hours and 4 minutes, thus the "Wall time" presented is Workers*Hours.

pathological spike in postgresql load average after large delete

I ran two deletes on a PostgreSQL 9.3.12 database against a fairly large table. Each one required a table scan and took about ~10 minutes to complete.
While they were running clients weren't impacted. Disk I/O was high, upwards of 70%, but that's fine.
After the second delete finished Disk I/O went to near zero and Load Average shot through the roof. Requests were not being completed in a timely manner and since new requests continued to arrive they all stacked up.
My two theories are:
Something with the underlying I/O layer that caused all I/O requests to block for some period of time, or
Postgres acquired (and held for a non-trivial period of time) a lock needed by clients. Either a global one or one related to the table from which rows were deleted. This table is frequently inserted into by clients; if someone were holding a lock that blocked inserts it would definitely explain this behavior.
Any ideas? Load was in excess of 40, which never happens in our environment even during periods of heavy load. Network I/O was high during/after the deletes but only because they were being streamed to our replication server.

PostgreSQL Performance Worsening

I'm running a PostgreSQL DB on a windows VM. Every morning I run some large customer matching algorithms which run in 10-15 minutes. However, recently, the machine has begun to take longer and longer to run the job.. and creeps all the way up to 6 hours in some cases. When we restart the VM it reduces back down to about 10-15 minutes... but again the creeping begins over the course of the next two weeks until we are forced to restart again.
The machine comfortably operates within the allowed RAM in the first few days after restart but appears to eat up more and more RAM in line with processing time increasing.
My question is, what are the most likely causes of this? We are confident that the code is not doing anything silly but possibly there is a cache somewhere which grows over time..? any suggestions would be hugely appreciated!!