How to properly handle control records in Aurora-DMS-Kinesis-Redshift pipeline? - amazon-redshift

I am really stuck here and spend the last 2 days researching this topic. I have the following data sync pipeline with the full-load-and-cdc migration type configured:
[Aurora MySQL RDS]->[DMS]->[Kinesis Streams]->[Kinesis Firehose]->S3 (intermediate)->[Redshift]
When I start DMS migration task, besides JSON data files for the source table, the pipeline also delivers JSON control records to the intermediate S3 bucket, even the JSON for creating awsdms_apply_exceptions control table. Redshift then, in turn, tries to load these JSON files from S3 and fails with this error:
Error 1213: "Missing data for not-null field"
This happens, I believe, because Redshift tries to parse JSON with control records as source table data records. My questions are:
Is it correct that the JSON for control tables (and other tables DDL) is delivered by Firehose to the intermediate S3 bucket? When I had [Aurora MySQL RDS]->[DMS]->[Firehose] pipeline before, I didn't see any DDL delivered to S3 - only data CSV files.
If #1 is not correct, how can I ensure that only the source table data files (which Redshift can successfully load) are pushed to the intermediate S3 bucket?
And if JSON with DDL is not going through a Kinesis Stream, how would DMS communicate it to Redshift when a new column is added, for example?
I appreciate any input as I completely ran out of any ideas at this point. Thank you very much.
Here is an example of control records I get in S3 delivered by Firehose which make Redshift error out with 1213 above:
{
"metadata": {
"timestamp": "2023-01-09T18:59:13.214656Z",
"record-type": "control",
"operation": "create-table",
"partition-key-type": "task-id",
"schema-name": "public",
"table-name": "awsdms_apply_exceptions"
}
}{
"metadata": {
"timestamp": "2023-01-09T18:59:13.872312Z",
"record-type": "control",
"operation": "create-table",
"partition-key-type": "task-id",
"schema-name": "epulse",
"table-name": "add_voter_preference_options"
}
}

Related

REST V2 ingest of JSON events to a S3 bucket (avoid duplicates)

I would like to ask you for help.
I am trying to ingest events in JSON from a source using REST API (REST V2 connector) in a raw format.
The source allows me to pass parameters "take" and "days" in the headers. The parameter "take" allows us to specify how many records to take, parameter "days", specifies how old events to request.
The job I have created works fine for Data Ingestion to a database, where I map filed to columns in the database.
I tried million things, and the two recent problems I am facing when I try to ingest files into a bucket or database in raw format:
For mass ingestion: there are no incremental jobs options available (for REST V2 source), so I am getting duplicate records, and ingestion never stops.
Is there a way to stop mass ingestion and avoid duplicates when all records are ingested?
For Data integration to a DB: Each record/event I attempt to ingest has multiple fields. Since I DON'T want to separate the records (I WANT entire documents in JSON), I pack all files into an array. The problem is that when I request ten records (or N records), all records get ingested into a single row in a table.
Here is what I mean:
TABLE DB:
ROW1: "array packed" JSON1, JSON2 .... JSNO_N...JSON10 "/array packed"
ROW2: empty
This is what I need (each record in a separate row in raw format)
TABLE DB:
ROW1: JSON1
ROW2: JSON2
ROWN: JSON_N
ROW10: JSON_10
I was also trying to accomplish this using a lambda function. The problem with lambda is that I will have to make sure there is no duplicates (Informatica has this cool option "upsert" that allows me to avoid duplicates).
At the end of the day, I don't care if this will be accomplished using data integration, mass ingestion, or lambda and if ingest will be directly into DB or S3. For now, I am trying to find a working solution.
If somebody can come up with some ideas, I will appreciate the help.

Accessing Aurora Postgres Materialized Views from Glue data catalog for Glue Jobs

I have an Aurora Serverless instance which has data loaded across 3 tables (mixture of standard and jsonb data types). We currently use traditional views where some of the deeply nested elements are surfaced along with other columns for aggregations and such.
We have two materialized views that we'd like to send to Redshift. Both the Aurora Postgres and Redshift are in Glue Catalog and while I can see Postgres views as a selectable table, the crawler does not pick up the materialized views.
Currently exploring two options to get the data to redshift.
Output to parquet and use copy to load
Point the Materialized view to jdbc sink specifying redshift.
Wanted recommendations on what might be most efficient approach if anyone has done a similar use case.
Questions:
In option 1, would I be able to handle incremental loads?
Is bookmarking supported for JDBC (Aurora Postgres) to JDBC (Redshift) transactions even if through Glue?
Is there a better way (other than the options I am considering) to move the data from Aurora Postgres Serverless (10.14) to Redshift.
Thanks in advance for any guidance provided.
Went with option 2. The Redshift Copy/Load process writes csv with manifest to S3 in any case so duplicating that is pointless.
Regarding the Questions:
N/A
Job Bookmarking does work. There is some gotchas though - ensure Connections both to RDS and Redshift are present in Glue Pyspark job, IAM self ref rules are in place and to identify a row that is unique [I chose the primary key of underlying table as an additional column in my materialized view] to use as the bookmark.
Using the primary key of core table may buy efficiencies in pruning materialized views during maintenance cycles. Just retrieve latest bookmark from cli using aws glue get-job-bookmark --job-name yourjobname and then just that in the where clause of the mv as where id >= idinbookmark
conn = glueContext.extract_jdbc_conf("yourGlueCatalogdBConnection")
connection_options_source = { "url": conn['url'] + "/yourdB", "dbtable": "table in dB", "user": conn['user'], "password": conn['password'], "jobBookmarkKeys":["unique identifier from source table"], "jobBookmarkKeysSortOrder":"asc"}
datasource0 = glueContext.create_dynamic_frame.from_options(connection_type="postgresql", connection_options=connection_options_source, transformation_ctx="datasource0")
That's all, folks

How do I identify modified rows using AWS DMS to Redshift?

I thought I had a simple question, but I am having a hard time finding an answer. This makes me suspicious that I'm asking the wrong question...
I am new to Redshift... I am using a DMS migration task to pull data from a Sql Server database sitting on an EC2 instance into my Redshift database. I have it set to do a Full Load with Ongoing Replication. This is working.
However, I want to know specifically which rows have changed after ongoing replication makes its updates. It is replicating to my staging tables, and I do further transformations from there depending on certain changes to columns (eg history tracking), which is why I need to know what changed. I compare the staging tables to the existing facts and dimensions, but I don't want to compare the entire table, just the modified rows.
The source database is older and I can't trust the modification timestamp columns are always updated. I thought that setting the migration task to truncate the table, then ingesting ongoing changes would leave my staging table with just changed rows. In hindsight, maybe that was a silly thought.
The other route I am thinking is to turn on CDC in the source tables, load staging tables on the Sql Server side with the net changes and put DMS on those tables instead. I was hoping that extra step would not be necessary.
Help is appreciated!
Not sure if you're still looking for a answer on this but you could always use a transformation rule to flag rows that have changed from dms. e.g.
{
"rule-type": "transformation",
"rule-id": "1",
"rule-name": "1",
"rule-target": "column",
"object-locator": {
"schema-name": "%",
"table-name": "%"
},
"rule-action": "add-column",
"value": "dms_load_ts",
"expression": "current_timestamp",
"data-type": {
"type": "datetime"
}
}

issue in loading data with Azure data factory

I am trying to load a lot of csv files from blob storage to Azure SQL Data Warehouse through Azure data factory. As I am dealing with massive number of rows, the desired approach is to use PolyBase to bulk loading the data. When I point the source to one single file, SQL DW PolyBase is displayed as true but when I point to all csv files, the SQL DW PolyBase is displayed as false. Does anyone have experienced this issue?
You could always change the allow polybase to true in UI.
Or specify this property in json:
"sink": {
"type": "SqlDWSink",
"allowPolyBase": true
}

Replicate views from Postgres to Postgres on AWS DMS

I'm using AWS DMS to migrate data from one Postgres database to another Postgres database. Everything works fine, except one thing: the views are no replicated on my target database.
I've read that this cannot be done between heterogenous database (i.e. from Oracle to Postgres) using DMS, but I imagine that this is possible somehow when we're using the same database.
Does someone know how to replicate the views using AWS DMS from Postgres to Postgres?
DMS is a data migration service. View is a virtual table(represented by sql code/object) and it does not contain any data by itself like a table does.
If you aren't looking to migrate the actual DDL but instead were happy to replicate the data from views in the source to tables in the target (akin to materialising the views in the target) then you can do this, but you need to jump into JSON configuration and there's some caveats.
Per the Selection rules and actions portion of the DMS documentation, you can add to the object-locator object a table-type field, setting its value to one of view or all to migrate views. It defaults to table which skips views, and there's no configuration field for it in the console aside from modifying the JSON.
For example:
{
"rule-type": "selection",
"rule-id": "123456",
"rule-name": "some-rule",
"object-locator": {
"schema-name": "my_schema",
"table-name": "%",
"table-type": "view"
},
"rule-action": "include",
"filters": []
}
There are a couple of caveats though:
Not all sources support migrating views. Aurora PostgreSQL doesn't, for instance, and saving the task will give an error if you try so you may need to replace the endpoint with a vanilla PostgreSQL one
If you're migrating views, your only option is a full load - CDC isn't going to work, so partial migrations aren't on the cards