redshift cluster update killing query speed - amazon-redshift

So every night I have a sql build script that runs in Redshift that takes about 30 min. This has been consistent for over a year.
After an amazon cluster update last night the script now takes 6 hours!
Has anyone had this problem before?
Any ideas?
thanks!

Based on the change list in https://forums.aws.amazon.com/ann.jspa?annID=3619 the only thing I see that might impact you is that now you have to specify a sort threshold on vacuums, so your tables might not be getting resorted and it could drastically impact performance. Other than that, I'd run some explain plans on your queries and see what's taking the biggest amount of resources. It could be that growing tables have hit some threshold of slice skew.

Related

Tracking Postgres Autovacuums

We've been experimenting with tweaking the autovacuum thresholds on some of our larger tables, because otherwise they never run, but also build up 10s of thousands of dead tuples. Using a query I found somewhere on SO, looking at the pg_stat_user_tables table, I'm able to see the last run time and the number of runs for the autovacuum, but I can't seem to find a history of the events. We're trying to keep track of how often they are running to get some idea of where a good threshold is, so that sort of info would be useful. Is there another table available for this?
There is no table with the history (unless you have created one of course or deployed some monitoring system that did it for you, but I don't know of such
a one). You can set log_autovacuum_min_duration to zero, then you will have a record going forward in your log files.

When and why should i trigger pg_stat_reset()?

I am trying to understand how to monitor and tune postgresql performance. I started with exploring tables pg_stat_all_tables, pg_stat_statements in order to gather information about live tuples, dead tuples, last autovacuum time etc. There were some usefull information about n_live_tuples (near to real rows count in table) and n_dead_tup util i run pg_stat_reset query. After that i have some strange results - there are less n_live_tup than n_dead_tup. I can't find any articles/docs about why and when (some use cases) should i run pg_stat_reset query. Can somebody explain me that or provide some useful resources?
It is ok to run pg_stat_reset() occasionally, like once per month, to get a fairly up-to-date view on what is going on in your database.
But don't do that too often, as there is a down side to it: the system relevant autovacuum process relies on these statistics, so you will miss a couple of autovacuum (and autoanalyze) runs if you do that. That may or may not be a problem in your database, but at any rate I wouldn't do it too often. If you can, manually VACUUM and ANALYZE the database after calling pg_stat_reset().
There is no such problem with pg_stat_statements_reset(), so run that as often as you please.
The best thing for you would be to have a monitoring software that checks the values of the statistics regularly and provides you with a the development (differences to the previous run). Then you never have to reset the statistics and still have a good overview over what is going on.

Run vacuum by schedule

I'm using Postgres version 9.6
Most of my tables are for queries, update, insert.
Most of them around 200K-700K.
There are bigger (millions) and smaller.
Is that a good idea to perform vacuum (and analyze?) operation once a day? once a week? regardless if there is an autovacuum..
Advantages vs disadvantages?
Autovacuum is done when needed and it only creates statistics that are used when planning a query.
Basically you never need to do this manually, unless you have made vast changes to a table (filled it with data for example), and want to use it in another query within a few milliseconds. In that scenario, old statistics will result in the query planner coming up with a very bad query plan and will lead to a significantly slower query.
What you might want to do once per day / per week, or whatever, is to cluster tables, recreate degraded indexes, on tables that were modified a lot. Research these topics more to decide if / when / how to do it.

Postgres running slow after indexing finished

My postgres was running really slow lately, an aggregation for a month it usually ended up taking more than 1 minute (to be more exact the last one took 7 mins and 23 secs).
Last friday i recreated the servers (master and replica) and reimported the database.
First thing I noticed is that from 133gb now the database is 42gb (the actual data is around 12gb, i guess the rest are the indexes).
Everything was fast as hell for a day, after that the indexing finished (26gb on indexes) and now I'm back to square 1.
A count on ~5 million rows takes 3 mins 42 secs.
Made the autovacuum more aggressive and it looks like it's doing it's job now but the DB is still slow.
I am using the db for an API so it's constantly growing. Atm i have 2 tables one that has around 5 mil rows and the other 28 mil.
So if the master has a lot of activity and let's say that i'm expecting some performance loss, i don't expect it from the replica.
What's curios is that after a restart it's really fast for an hour or so.
Also another thing that i noticed was that on every query I do the IO is 100% while the memory and cpu are almost not used at all.
Any help would be greatly appreciated.
Update
Same database on a smaller machine works like a charm.
Same queries, same indexes.
The only difference is the traffic, not writing or updating that much.
Also i forgot to mention one thing, one of my indexes is clustered.
The live machine is a 5 core with 64gb and 3k IO.
The test machine is a 2 core with 4gb and an SSD.
Update
Found my issue.
Apparently the autovacuum can't get a lock and by the time it gets it the dead tuples increased.
Made the autovacuum more aggresive for now and deleted a bunch of unused indexes.
Still don't know how to fix the lock issue tho.
Update
Looks like something is increasing the estimated row count.
Since my last update here the row count increased by 2 mil.
I guess that by tomorrow the row count will be again around 12 mil and the count will be slow as hell again.
Could this be related to autovacuum?
Update
Well found my issue.
Looks like postgres is losing a lot of speed on a write intensive database.
Had a column that was used as a flag and updated a lot of times per day.
Everything looks really good after the flag and update was removed.
Any clue on how to fix this issue on a write intensive table?
May be the following pointers help:
Are you really sure you want to do a 5mil row Aggregation for an API? Everytime ? Can't you split the data into chunks such that only a small number of chunks actually get most of the new rows (and so the aggregation of all the previous chunks can be reused for the next Query)? Time is one such measure, serial numbers could be another, etc. If so, partitioning the data is an obvious solution you should investigate, it really has a good chance of giving you sub-second query times (assuming you store aggregations for previous chunks smartly).
A hunch about that first hour magic is that although this data fits RAM, concurrent querying pushes that data-set out and then its purely disk I/O... and in that case, CPU / RAM being idle isn't a surprise.
Finally, I think this setup is asking for a re-design where there is only so much you could do with a single SQL, and in that expecting sub-second Query times for data that is not within RAM for a 5mil data-set is probably being too optimistic!
(Nonetheless, do post your findings, if possible)

Can CouchDB handle 15 million records daily?

I'm relatively new to NoSQL databases and I have to evaluate different NoSQL-Solutions for a monitoring tool.
The situation is the following:
One datum is just about 100 Bytes big, but there are really a lot of them. During a day we get about 15 million records... So I'm currently testing with 900 million records (about 15GB as SQL-Insert Script)
My question is: Does Couchdb fit my needs? I need to do range querys (on the date the records were created) and sum up some of the columns acoording to groups definied by "secondary indexes" stored in the datum.)
I know that MapReduce is probably the best solution to calculate that, but is the JavaScript of CouchDB able to do this in an acceptable time?
I already tried MongoDB but it's really poor MapReduce made a crappy job... I also read about HBase and Cassandra. But maybee CouchDB is also a good possibility
I hope I gave you all the needed information... Thank you for your help!
andy
Frankly, at this time, unless you have very good hardware, Apache CouchDB may run into problems. Map/reduce will probably be fine. CouchDB's incremental map/reduce is ideal for your requirements.
As a developer, you will love it! Unfortunately as a sysadmin, you may notice more disk usage and i/o than expected.
I suggest to try it. Being HTTP and Javascript, it's easy to do a feasibility test. Just remember, the initial view build will take a long time (let's assume for argument it takes longer than every other competing database). But that time will never be spent again. Map/reduce runs only once per document (actually per document update).