how to prevent the duplicate entries in amazon redshift - amazon-redshift

I am working on a redshift project, I don't want to load a row 2 times and redshift is not enforcing any constraints. Is there any other way to do this?

Related

CDC Migration from AWS RDS to AWS Redshift

How to migrate my whole database which is currently in AWS RDS Postgres to AWS Redshift and also can you please help me out how can I keep both these DBs in sync. I want to sync even if any column is updated in RDS so it must get updated in Redshift also.
I know we can achieve it with AWS Glue, but the above scenario is mandatory in my case. Migration task is easy to do but to to the CDC migration is bit challenging. I am also aware about the bookmark key but my situation is bit different, I do not have any sequential column in the tables, but it has updated_at field in all the tables so this column is the only field on which I can check whether the record is processed or not so that duplicate processing may not occur and if any new data is inserted it should also get replicated in RedShift.
So, would anyone help me out to do this even by using pyspark script?
Thanks.

Is there a recommended table size for partitioning in postgresql?

I used RDS Aurora PostreSQL in AWS.
The size of the data I manage and the number of rows are too large (7 billion rows and 4TB), so I am considering table partitioning.
(I also considered the citus of postgresql... but unfortunately it is not available in aws...)
When I request some query in that table, it is very slow...
So I applied table partitioning (10 partitions) and the query performance was there, but still slow.
The site below recommends ‘Tables bigger than 2GB should be considered.’, but in this case, there are too many partitioning tables and it seems difficult to manage.
https://hevodata.com/learn/postgresql-partitions/#t10
What would be the appropriate table size?
And is the pg_partman extension required in this case?
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/PostgreSQL_Partitions.html
Is there any other way to improve query performance other than partitioning if there is too much data in the table?

Retention management for time-series table in Redshift

I have a table which I use DMS to migrate from Aurora to Redshift. This table is insert only with a lot of data by timestamp.
I would like to have a trimmed version of that table in redshift.
The idea was to use partitions on it and use retention script to keep it with just the last 2 months. However in Redshift there is no partitions and what I find out there time-series table which sounds the same. If I understand it correctly my table should look like:
create table public."bigtable"(
"id" integer NOT NULL DISTKEY,
"date" timestamp,
"name" varchar(256)
)
SORTKEY(date);
However I don't find good documentation how the retention is managed. Would like any corrections and advice :)
A couple of ways this is typically done in Redshift.
For small to medium tables the data can just be DELETEd and the table VACUUMed (usually a delete only vacuum). Redshift is very good at handling large amounts of data and for tables that are really large this works fine. There is some overhead for the delete and vacuum but if these are scheduled on off hours it works just fine and is simple.
When the table in question get really big or there are not low workload times to perform the delete and vacuum, people set up "month" tables for their data and use a view that UNION ALLs these tables together. Then "removing" a month is just redefining the view and dropping the unneeded table. This is very low effort for Redshift to perform but is a bit more complex to set up. Your incoming data needs to be put into the correct table based on month so it is no longer just a copy from Aurora. This process also simplifies UNLOADing the old tables to S3 for history capture purposes.

Slow insert and update commands during mysql to redshift replication

I am trying to make a replication server from MySQL to redshift, for this, I am parsing the MySQL binlog. For initial replication, I am taking the dump of the mysql table, converting it into a CSV file and uploading the same to S3 and then I use the redshift copy command. For this the performance is efficient.
After the initial replication, for the continuous sync when I am reading the binlog the inserts and updates have to be run sequentially which are very slow.
Is there anything that can be done for increasing the performance?
One possible solution that I can think of is to wrap the statements in a transaction and then send the transaction at once, to avoid multiple network calls. But that would not address the problem that single update and insert statements in redshift run very slow. A single update statement is taking 6s. Knowing the limitations of redshift (That it is a columnar database and single row insertion will be slow) what can be done to work around those limitations?
Edit 1:
Regarding DMS: I want to use redshift as a warehousing solution which just replicates our MYSQL continuously, I don't want to denormalise the data since I have 170+ tables in mysql. During ongoing replication, DMS shows many errors multiple times in a day and fails completely after a day or two and it's very hard to decipher DMS error logs. Also, When I drop and reload tables, it deletes the existing tables on redshift and creates and new table and then starts inserting data which causes downtime in my case. What I wanted was to create a new table and then switch the old one with new one and delete old table
Here is what you need to do to get DMS to work
1) create and run a dms task with "migrate and ongoing replication" and "Drop tables on target"
2) this will probably fail, do not worry. "stop" the dms task.
3) on redshift make the following changes to the table
Change all dates and timestamps to varchar (because the options used
by dms for redshift copy cannot cope with '00:00:00 00:00' dates that
you get in mysql)
change all bool to be varchar - due to a bug in dms.
4) on dms - modify the task to "Truncate" in "Target table preparation mode"
5) restart the dms task - full reload
now - the initial copy and ongoing binlog replication should work.
Make sure you are on latest replication instance software version
Make sure you have followed the instructions here exactly
http://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MySQL.html
If your source is aurora, also make sure you have set binlog_checksum to "none" (bad documentation)

Most efficient way to extract tables from Redshift?

I have large table (~1e9 rows, ~20 columns) in a AWS Redshift instance. I would like to extract this entire table through PostgreSQL in order to pipe the data into another columnar storage. Ideally, columns would be extracted one column at a time while maintaining an identical row ordering - as it would facilitate a lot of work on the receiving end (columnar).
How can I ensure that the series of SQL queries stay exactly aligned with each other? Thanks!
Ps: I am aware of the unload through S3 option, but I am seeking a PostgreSQL option.