Do you need to VACUUM SORT on Redshift Materialized Views?

Do you need to VACUUM SORT on Redshift Materialized Views? - amazon-redshift

I can not seem to find any documentation that indicates how the sort order of an incrementally updated materialized view is maintained. Based on the lack of docs I assume this is just taken care of on REFRESH.
Does anyone know if you should be running VACUUM SORT on views?

I would do so, to be safer. On Materialized View Refresh docummentation it also mentions that the autorefresh can be automatically stopped by Redshift internal processes. We can also see some misleading information such as the vacuum_sort_benefit column for that view being NULL.
But after running the vacuum sort only my-mv-view, where my-mv-view is the name returned on svv_table_info it showed that got improvements on its sort.
It is also suggested to vacuum Postgresql materialized views due frozen ids; this behaviour are on versions after 9.4. More on the Routine Vacuuming at Postgres Official Page.
I hope this helps!
Cheers!

Related

REFRESH MATERIALIZED VIEW suddenly taking more time to complete

We have a materialized view in our Postgres DB (11.12, managed by AWS RDS). We have a scheduled task that updates it every 5 minutes using REFRESH MATERIALIZED VIEW <view_name>. At some specific point last week, the time needed to refresh the view suddenly went from ~1s to ~20s. The view contains ~70k rows, with around 15 columns, all of them being integers, booleans or UUIDs.
Prior to this:
There were no changes in the server configuration.
There were no changes to the view itself. In fact, running EXPLAIN ANALYZE <expression used to create the view> returns that the query still gets executed in less than a second. If the query is ran with a client like Postico, it takes ~20s to fetch all the results (a bit consistent with the time needed to materialize it, although we assume this is due to the time needed for network transmission).
There were no changes in the schema or any significant record increase in the contents of the tables needed to compute the view.
RDS Performance Insights indicate that the query is mostly using CPU resources
I know this is probably not enough to get a solution, but:
Are there any server performance metrics or logs that could lead us to understand better this situation?
Is this just the normal time the server needs to persist the view to disk? If so, any idea of possible reasons why it started to take so long recently?
Here is a link to the execution plan.
EDIT: creating another materialized view with the same JOINS but selecting less columns performs as expected (~1s).
EDIT 2: setting enable_nestloop = false greatly speeds up the REFRESH operation (same performance as before). Would this indicate that refactoring the underlying query could solve the issue?

Try REFRESH materialized view concurrently.
When you refresh data for a materialized view, PostgreSQL locks the entire table therefore you cannot query data against it. To avoid this, you can use the CONCURRENTLY option.
REFRESH MATERIALIZED VIEW CONCURRENTLY view_name;
With CONCURRENTLY option, PostgreSQL creates a temporary updated version of the materialized view, compares two versions, and performs INSERT and UPDATE only the differences.
You can query against a materialized view while it is being updated. One requirement for using CONCURRENTLY option is that the materialized view must have a UNIQUE index.

Original poster here. This is more than a year old, but here's what happened and how we eventually fixed it.
TLDR:
-REFRESH MATERIALIZED VIEW <query> started to take much longer than executing the query used to construct the view (~1s vs ~20s).
After a couple of weeks this question was asked, the query itself started to behave similarly (taking ~20s to complete). At this point, the EXPLAIN ANALYZE started to show indications of performance issues with the query. So we ended up optimising the underlying query (the biggest performance gain being replacing some JOINS with a CTE).
After this, the performance of both the REFRESH MATERIALIZED VIEW <query> and the standalone query behaved correctly (execution time < 1s).
A still open question here is why the REFRESH MATERIALIZED VIEW <query> and the standalone query had different performance at some point in time? Was the DB query planner choosing different query plans depending on whether it was going to materialize the view or not? I guess if someone knows if such thing is possible, please comment.

Updates materialized view every time (or every 5 minutes) this is not a good way to refresh materialized. Then the meaning of using materialized view does not remain. Let me explain to you one of the ways I found with my own logic, based on my own experience, so you can find a more optimal way later. Assumed, we used two tables in our materialized view, and we need that is a changed data one of the two tables we will refresh materialized view. To do this during the update or delete table we must insert to the table (for example refresh_materialized table) one record (you can also use the trigger), through which will be performed refreshing materialized view
For example:
insert into refresh_materialized
(
refresh_status,
insert_date,
executed_date
)
values (
false,
now(),
null
)
And so in our schedule task, we can use this query:
select count(*) from refresh_materialized
where refresh_status = false
if the count(*) will be > 0 then we must refresh materialized view else do nothing. After the refreshing materialized view we must update this table:
update refresh_materialized
set
refresh_status = true,
executed_date = now()
where
refresh_status = false;

Postgresql 12 needs manual analyzing for performance

I have a performance drop once in a couple of days. When I manually analyze a table, the performance is back to normal. The table to be analyzed has 2.7M records. The query is very complex, uses a lot of different tables, unions and joins. The view I query just aggregates some data from another view. If I query the main -aggregated- view, it takes about 1.5-3.5 secs. If I query the view one level higher, it takes only 0.2s.
This issue occured when migrated from 9.5 to 12.3. Analyzing one specific table solves it for a couple of days.
Autoanalyze never occurs automatically, so there seems no need to autoanalyze. But the query planner goes wrong somehow. I've increased the statistics_target in the config to 1500.
We never had this issue on 9.5. Maybe it has something to do with the JIT, but disabling it in the session does not seem to solve the issue.

PostgreSQL how to find what is causing Deadlock in vacuum when using --jobs parameter

How to find in PostgreSQL 9.5 what is causing deadlock error/failure when doing full vacuumdb over database with option --jobs to run full vacuum in parallel.
I just get some process numbers and table names... How to prevent this so I could successfuly do full vacuum over database in parallel?

Completing a VACUUM FULL under load is a pretty hard task. The problem is that Postgres is contracting space taken by the table, thus any data manipulation interferes with that.
To achieve a full vacuum you have these options:
Lock access to the vacuumed table. Not sure if acquiring some exclusive lock will help, though. You may need to prevent access to the table on application level.
Use a create new table - swap (rename tables) - move data - drop original technique. This way you do not contract space under the original table, you free it by simply dropping the table. Of course you are rebuilding all indexes, redirecting FKs, etc.
Another question is: do you need to VACUUM FULL? The only thing it does that VACUUM ANALYZE does not is contracting the table on the file system. If you are not very limited by disk space you do not need doing a full vacuum that much.
Hope that helps.

ALTER query very slow on tiny table in PostgreSQL

I've got PostgreSQL 9.2 and a tiny database with just a bit of seed data for a website that I'm working on.
The following query seems to run forever:
ALTER TABLE diagnose_bodypart ADD description text NOT NULL;
diagnose_bodypart is a table with less than 10 rows. I've let the query run for over a minute with no results. What could be the problem? Any recommendations for debugging this?

Adding a column does not require rewriting a table (unless you specify a DEFAULT). It is a quick operation absent any locks. pg_locks is the place to check, as Craig pointed out.
In general the most likely cause are long-running transactions. I would be looking at what work-flows are hitting these tables and how long the transactions are staying open for. Locks of this sort are typically transactional and so committing transactions will usually fix the problem.

How to discover what a Postgresql table's usage is?

Our team is working on a Postgresql database with lots of tables and views, without any referential constraints. The project is undocumented and there appears to be a great number of unused/temporary/duplicate tables/views dirtying the schema.
We need to discover what database objects have real value and are actually used and accessed.
My inital thoughts were to query the Catalog/'data-dictionary'.
Is it possible to query the Postgresql Catalog to find an object's last query time.
Any thoughts, alternative approaches and or tools ideas?

Check the Statistics Collector

Not sure about the last query time but you can adjust your postgresql.conf to log all SQL:
log_min_duration_statement = 0
That will at least give you an idea of current activity.

Reset the statistics, pg_stat_reset(), and then check the pg catalog tables like pg_stat_user_tables and such to see where activity looks like its showing up.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse