Refresh materialized view on Postgresql 11 on RDS

Refresh materialized view on Postgresql 11 on RDS - postgresql

We are currently on Postgres 11.13 on AWS RDS. I am trying to create a materialized view that takes about 6-7 minutes to run. What is the best way to keep this MV mostly up to date? I was thinking of pg_cron but I believe that is only available on PG 12.5+. A trigger on the underlying tables could be an option, but these tables either do not get updated at all, or underlying tables have many many inserts that occur in a short period of time. I don't want to trigger excessive refreshes.
Any suggestions for this scenario?

Related

Create materialized views from Google Cloud PostgreSQL in BigQuery

We have a database in Google Cloud SQL which has partitioned tables created with same naming format every month.
I would like to create and update materialized view from these partitioned tables for last few days (updated every 5-10 mins), last few weeks or months (updated every hour or daily).
Is using BigQuery right way of doing so? I couldn't find a way to pull data in some of the ingestion guides.
If I do it in PostgreSQL instance itself, what is the best way to update (I have created materialized view in the past but didn't update them)?

performance of refreshing postgres materialized view

I am exploring materialized views to create de-normalized view to avoid joining multiple tables for read performance. APIs will read from the materialized views to provide data to clients.
I am using amazon aurora postgres (version 11).
I am using a unique index on the materialized view (MV) so that I can use the “refresh concurrently” option.
What I am noticing though is that when only a fraction of the rows get updated in one of the source tables and I try to refresh the view, it's pretty slow. In fact slower than populating the view for the first time. e.g.: to populate MV first time takes ~30 mins, refresh is taking more than an hour. less than 1% of rows have been updated. The main three tables involved in generating the MV have ~18 million, 27 million & 40 million rows.
The timeliness of the materialized view refresh is important so that data is not stale for too long.
I could go with custom tables to store the denormalized data instead of materialized views but would have to implement logic to refresh data. So planning to avoid that if possible.
Is there anything that can be done to speed up the refresh process of the materialized views?
Please let me know if you need more details.
thanks
Kiran

You can create a second materialised view and update it (not concurrently), and then swap the names.to the tables in a transaction.
I actually have no idea why postgres didn't implement CONCURRENTLY this way.

Refreshing a materialized view is slow even if little has changed, because every time the view is refreshed, the defining query is run.
Using CONCURRENTLY makes the operation even slower, because it is not a wholesale replacement of the materialized view contents, but modification of the existing data.
Perhaps you could create a denormalized table that is updated by a trigger whenever the underlying tables are modified.

Maybe I'm a bit late to the party, but if you would be on postgres 13 or later you could try this extension: https://github.com/sraoss/pg_ivm
They have some limitations but promise much faster rebuilt times.
Here's a bit more on it from pganalyze: https://pganalyze.com/blog/5mins-postgres-15-beta1-incremental-materialized-views-pg-ivm (in contrary to project authors, they want you to be on v14 or later)

How to see changes in a postgresql database

My postresql database is updated each night.
At the end of each nightly update, I need to know what data changed.
The update process is complex, taking a couple of hours and requires dozens of scripts, so I don't know if that influences how I could see what data has changed.
The database is around 1 TB in size, so any method that requires starting a temporary database may be very slow.
The database is an AWS instance (RDS). I have automated backups enabled (these are different to RDS snapshots which are user initiated). Is it possible to see the difference between two RDS automated backups?

I do not know if it is possible to see difference between RDS snapshots. But in the past we tested several solutions for similar problem. Maybe you can take some inspiration from it.
Obvious solution is of course auditing system. This way you can see in relatively simply way what was changed. Depending on granularity of your auditing system down to column values. Of course there is impact on your application due auditing triggers and queries into audit tables.
Another possibility is - for tables with primary keys you can store values of primary key and 'xmin' and 'ctid' hidden system columns (https://www.postgresql.org/docs/current/static/ddl-system-columns.html) for each row before updated and compare them with values after update. But this way you can identify only changed / inserted / deleted rows but not changes in different columns.
You can make streaming replica and set replication slots (and to be on the safe side also WAL log archiving ). Then stop replication on replica before updates and compare data after updates using dblink selects. But these queries can be very heavy.

Slow insert and update commands during mysql to redshift replication

I am trying to make a replication server from MySQL to redshift, for this, I am parsing the MySQL binlog. For initial replication, I am taking the dump of the mysql table, converting it into a CSV file and uploading the same to S3 and then I use the redshift copy command. For this the performance is efficient.
After the initial replication, for the continuous sync when I am reading the binlog the inserts and updates have to be run sequentially which are very slow.
Is there anything that can be done for increasing the performance?
One possible solution that I can think of is to wrap the statements in a transaction and then send the transaction at once, to avoid multiple network calls. But that would not address the problem that single update and insert statements in redshift run very slow. A single update statement is taking 6s. Knowing the limitations of redshift (That it is a columnar database and single row insertion will be slow) what can be done to work around those limitations?
Edit 1:
Regarding DMS: I want to use redshift as a warehousing solution which just replicates our MYSQL continuously, I don't want to denormalise the data since I have 170+ tables in mysql. During ongoing replication, DMS shows many errors multiple times in a day and fails completely after a day or two and it's very hard to decipher DMS error logs. Also, When I drop and reload tables, it deletes the existing tables on redshift and creates and new table and then starts inserting data which causes downtime in my case. What I wanted was to create a new table and then switch the old one with new one and delete old table

Here is what you need to do to get DMS to work
1) create and run a dms task with "migrate and ongoing replication" and "Drop tables on target"
2) this will probably fail, do not worry. "stop" the dms task.
3) on redshift make the following changes to the table
Change all dates and timestamps to varchar (because the options used
by dms for redshift copy cannot cope with '00:00:00 00:00' dates that
you get in mysql)
change all bool to be varchar - due to a bug in dms.
4) on dms - modify the task to "Truncate" in "Target table preparation mode"
5) restart the dms task - full reload
now - the initial copy and ongoing binlog replication should work.
Make sure you are on latest replication instance software version
Make sure you have followed the instructions here exactly
http://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MySQL.html
If your source is aurora, also make sure you have set binlog_checksum to "none" (bad documentation)

Postgres materialized views with hot standby

We are using Postgres (9.3) Hot Standby to build a read-only copy of a database. We have a UI which reads from a materialized view.
When we try to read from the materialized view in the standby database, the query hangs.
The materialized view takes ~10 seconds to rebuild in the master database. We've waited over 30 minutes for the query in the standby database and it seems to never complete.
Notably, the materialized view does exist in the standby database. We can't refresh it of course (since the DB is read only)
We can't find anything in the documentation which indicates that materialized views can't be used in standby databases, but that appears to be the case.
Has anyone got this to work, and/or what is the recommended work-around?

According to PostgreSQL Documentation - Hot Standby there is a way to handle Query Conflicts by assigning proper values to
max_standby_archive_delay & max_standby_streaming_delay
that define the maximum allowed delay in WAL application. In your case a high value may be preferable.