Retention management for time-series table in Redshift - amazon-redshift

I have a table which I use DMS to migrate from Aurora to Redshift. This table is insert only with a lot of data by timestamp.
I would like to have a trimmed version of that table in redshift.
The idea was to use partitions on it and use retention script to keep it with just the last 2 months. However in Redshift there is no partitions and what I find out there time-series table which sounds the same. If I understand it correctly my table should look like:
create table public."bigtable"(
"id" integer NOT NULL DISTKEY,
"date" timestamp,
"name" varchar(256)
)
SORTKEY(date);
However I don't find good documentation how the retention is managed. Would like any corrections and advice :)

A couple of ways this is typically done in Redshift.
For small to medium tables the data can just be DELETEd and the table VACUUMed (usually a delete only vacuum). Redshift is very good at handling large amounts of data and for tables that are really large this works fine. There is some overhead for the delete and vacuum but if these are scheduled on off hours it works just fine and is simple.
When the table in question get really big or there are not low workload times to perform the delete and vacuum, people set up "month" tables for their data and use a view that UNION ALLs these tables together. Then "removing" a month is just redefining the view and dropping the unneeded table. This is very low effort for Redshift to perform but is a bit more complex to set up. Your incoming data needs to be put into the correct table based on month so it is no longer just a copy from Aurora. This process also simplifies UNLOADing the old tables to S3 for history capture purposes.

Related

Creating denormalized tables with triggers too slow

Assume I'm doing everything in one postgresql database. I have 10 source tables I'm using to create one huge denormalized table. These source tables change frequently and have triggers firing after insert/update/delete to modify denormalized table in near-real-time. The problem is, some of these source tables I'm joining are huge (one table has 120M and other 25M rows) and statements for inserting new rows into denormalized table execute for a long time (20+ minutes for 50-100k rows).
So, I was thinking on what would be the best solution for updating(IUD)changes on this denormalized table, based on changes coming to source tables? Should I run these operations on a schedule, should I dedicate a specific database replica just for this, or should I continue trying to use triggers?
I'm open to using a totally different approach, as long as it's doable on the same database.
That sounds like there is no good and simple solution.
Perhaps you don't need that one huge denormalized table, and denormalizing a few attributes would be good enough for your query speed.
If not, you will probably need a kind of data warehouse for the denormalized data, and refresh that daily with increments. Ideally, tables there are already pre-aggregated.

Should I migrate to Redshift?

I'm currently struggling querying be chunk of data that is stored in partitioned table (partition per date)
the data looks like that:
date, product_id, orders
2019-11-01, 1, 100
2019-11-01, 2, 200
2019-11-02, 1, 300
I have hundreds of date-partitions and millions of rows per date.
Now, if I want to query, for instance, total orders for product id 1 and 2 for period of 2 weeks, and group by date (to show in a graph per date), the db has to go to 2 weeks of partitions and fetch the data for them.
That process might be taking a long time when the number of products is big or the time frame required is long.
I have read that AWS Redshift is suitable for this kind of tasks. I'm considering shifting my partitioned tables (aggregated analytics per date) to that technology but I wonder if that's really what I should do to make those queries to run much faster.
Thanks!
As per your use case Redshift is really a good choice for you.
To gain the best performance out of Redshift, it is very important to set proper distribution and sort key. In your case "date" column should be distribution key and "productid" should be sort key. Another important note, Do not encode "date" and "productid" column.
You should get better performance.
If you are struggling with traditional SQL databases, then Amazon Redshift is certainly an option. It can handle tables with billions of rows.
This would involve loading the data from Amazon S3 into Redshift. This will allow Redshift to optimize the way that the data is stored, making it much faster for querying.
Alternatively, you could consider using Amazon Athena, which can query the data directly from Amazon S3. It understands data that is partitioned into separate directories (eg based on date).
Which version of PostgreSQL are you using?
Are you using native partioning or inheritance partitioning trigger-based?
Latest version of postgresql improved partitioning management.
Considering your case Amazon Redshift can be a good choice, so does Amazon Athena. But it is also important to consider your application framework. Are you opt moving to Amazon only for Database or you have other Amazon services in the list too?
Also before making the decision please check the cost of Redshift.

Postgres table partitioning based on table name

I have a table that stores information about weather for specific events and for specific timestamps. I do insert, update and select (more often than delete) on this table. All of my queries query on timestamp and event_id. Since this table is blowing up, I was considering doing table partitioning in postgres.
I could also think of having multiple tables and naming them "table_< event_id >_< timestamp >" to store specific timestamp information, instead of using postgres declarative/inheritance partitioning. But, I noticed that no one on the internet has done or written about any approach like this. Is there something I am missing?
I see that in postgres partitioning, the data is both kept in master as well as child tables. Why keep in both places? It seems less efficient to do inserts and updates to me.
Is there a generic limit on the number of tables when postgres will start to choke?
Thank you!
re 1) Don't do it. Why re-invent the wheel if the Postgres devs have already done it for you by providing declarative partitioning
re 2) You are mistaken. The data is only kept in the partition to which it belongs to. It just looks as if it is stored in the "master".
re 3) there is no built-in limit, but anything beyond a "few thousand" partitions is probably too much. It will still work, but especially query planning will be slower. And sometime the query execution might also suffer because runtime partition pruning is not as efficient any more.
Given your description you probably want to do hash partitioning on the event ID and then create range sub-partitions on the timestamp value (so each partition for an event is again partitioned on the range of the timestamps)

Full Load in Redshift - DROP vs TRUNCATE

As part of daily load in Redshift, I have a couple of tables to drop and full load all of them, (data size is small, less than 1 million).
My question is which of the below two strategies is better in terms of CPU utilization and memory in Redshift:
1) Truncate data
2) DROP and Recreate Table.
If I truncate tables, should I perform Vacuum on tables every day as I have read that frequent drop and recreate tables in the database cause fragmentation of pages.
And one of the tables I would like to enable compression. So, is there any downside creating DDL with encoding every day.
Please advise! Thank you!
If you drop the tables you will lose assigned permissions to these tables. If you have views for these tables they will be obsolete.
Truncate is a better option, truncate does not require vacuum or analyze, it is built for use cases like this.
For further info Redshift Truncate documentation

No merge tables in MariaDB - options for very large tables?

I've been happily using MySQl for years, and have followed the MariahDB fork with interest.
The server for one of my projects is reaching end of life and needs to be rehosted - likely to CentOS 7, which includes MariahDB
One of my concerns is the lack of the merge table feature, which I use extensively. We have a very large (at least by my standards) data set with on the order of 100M records/20 GB (with most data compressed) and growing. I've split this into read only compressed myisam "archive" tables organized by data epoch, and a regular myisam table for current data and inserts. I then span all of these with a merge table.
The software working against this database is then written such that it figures out which table to retrieve data from for the timespan in question, and if the timespan spans multiple tables, it queries the overlying merge table.
This does a few things for me:
Queries are much faster against the smaller tables - unfortunately, the index needed for the most typical query, and preventing duplicate records is relatively complicated
Frees the user from having to query multiple tables and assemble the results when a query spans multiple tables
Allowing > 90% of the data to reside in the compressed tables saves alot of disk space
I can back up the archive tables once - this saves tremendous time, bandwidth and storage on the nightly backups
An suggestions for how to handle this without merge tables? Does any other table type offer the compressed, read-only option that myisam does?
I'm thinking we may have to go with separate tables, and live with the additional complication and changing all the code in the multiple projects which use this database.
MariaDB 10 introduced the CONNECT storage engine that does a lot of different things. One of the table types it provides is TBL, which is basically an expansion of the MERGE table type. The TBL CONNECT type is currently read only, but you should be able to just insert into the base tables as needed. This is probably your best option but I'm not very familiar with the CONNECT engine in general and you will need to do a bit of experimentation to decide if it will work.