AWS Redshift: How to run copy command from Apache NiFi without using firehose? - amazon-redshift

I have flow files with data records in it. I'm able to place it on S3 bucket. From there on I want to run COPY command and update command with joins to achieve MERGE / UPSERT operation. Can anyone suggest ways to solve this as firehose only executes copy command and I can't make UPSERT / MERGE operation as prescribed by AWS docs directly, so has to copy into staging table and update or insert using some conditions.

There are a number of ways to do this but I usually go with a lambda function run every 5 minutes or so that takes the data put in Redshift from firehose and merges it with existing data. Redshift likes to run on larger "chunks" of data and it is most efficient if you build up some size before performing these operations. The best practice is to move the data from the firehose target in an atomic operation like ALTER TABLE APPEND and use this new table as the source for merging. This is so firehose can keep adding data while the merge is in process.

Related

PySaprk- Perform Merge in Synapse using Databricks Spark

We are having a tricky situation while performing ACID operation using Databricks Spark .
We want to perform UPSERT on a Azure Synapse table over a JDBC connection using PySpark . We are aware of Spark providing only 2 mode for writing data . APPEND and OVERWRITE (only these two use full in our case) . So based these two mode we thought of below options:
We will write whole dataframe into a stage table . And we will use this stage table to perform MERGE operation( ~ UPSERT )with final Table .Stage table will be truncated / dropped after that .
We Will bring target table data into Spark also. Inside Spark We will perform MERGE using Delta lake and will generate a final Dataframe .This dataframe will be written back to Target table in OVERWRITE mode.
Considering the cons. sides..
in Option 1 , We have to use two table just to write the final data. And In,case both Stage and target tables are big , then performing MERGE operation inside Synapse is another herculean task and May take time .
in option 2 ,We have to bring the Target table into Spark in-memory. Even though network IO is not much of our concern as both Databricks and Synpse will be in same Azure AZ, It may leads to memory issue in Spark side.
Is there any other feasible options ?? Or any recommendation ??
Answer would depend on many factors not listed in your question. It's a very open ended question.
(Given the way your question is phrased I'm assuming you're using Dedicated SQL Pools and not an On-demand Synapse)
Here are some thoughts:
You'll be using spark cluster's compute in option 1 and Synapse' compute in option 2. Compare cost.
Pick the lower cost.
Read and write to/from Spark to/from Synapse using their driver uses Datalake as stage. I.e. while reading a table from Synapse into a datafrmae in Spark, driver will first make Synapse export data to Datalake (as parquet IIRC) and then read the files in Datalake to create the Dataframe. This scales nicely if you're talking about 10s or million or billions of rows. But the overhead could become a performance overhead if row counts are low (10-100s of thousands).
Test and pick the faster one.
Remember that Synapse is not like a traditional MySQL or SQL-Server. It's an MPP DB.
"performing MERGE operation inside Synapse is another herculean task and May take time" is a wrong statement. It scales just like a Spark cluster.
It may leads to memory issue in Spark side, yes and no. One one hand all data isn't going to be loaded into a single worker node. OTOH yes, you do need enough memory for each node to do it's own part.
Although Synapse can be scaled up and down dynamically, I've seen it take up to 40 minutes to complete a scale up. Databricks on the other hand is fully on-demand and you can probably get away with turning on cluster, do upsert, shutdown cluster. With Synapse you'll probably have other clients using it, so may not be able to shut it down.
So with Synapse either you'll have to live with 40-80 minutes down time for each upsert (scale up, upsert, scale down), OR
pay for high DWU flat-rate all the time, though your usage is high only when you upsert but otherwise it's pretty low.
Lastly, remember that MERGE is in preview at the time of writing this. Means no Sev-A support cases/immediate support if something breaks in your prod because you're using MERGE.
You can always use DELETE + INSERT instead. Assumes the delta you receive has all columns from target table and not just updated ones.
Did you try creating checksum to do merge upsert only for rows that have actual data change?

Cloud SQL: export data to CSV periodically avoiding duplicates

I want to export the data from Cloud SQL (postgres) to a CSV file periodically (once a day for example) and each time the DB rows are exported it must not be exported in the next export task.
I'm currently using a POST request to perform the export task using cloud scheduler. The problem here (or at least until I know) is that it won't be able to export and delete (or update the rows to mark them as exported) in a single http export request.
Is there any possibility to delete (or update) the rows which have been exported automatically with any Cloud SQL parameter in the http export request?
If not, I assume it should be done it a cloud function triggered by a pub/sub (using scheduler to send data once a day to pub/sub) but, is there any optimal way to take all the ID of the rows retrieved from the select statment (which will be use in the export) to delete (or update) them later?
You can export and delete (or update) at the same time using RETURNING.
\copy (DELETE FROM pgbench_accounts WHERE aid<1000 RETURNING *) to foo.txt
The problem would be in the face of crashes. How can you know that foo.txt has been writing and flushed to disk, before the DELETE is allowed to commit? Or the reverse, foo.txt is partially (or fully) written, but a crash prevents DELETE from committing.
Can't you make the system idempotent, so that exporting the same row more than once doesn't create problems?
You could use a set up to achieve what you are looking for: 
1.Create a Cloud Function to extract the information from the database that subscribes to a Pub/Sub topic.
2.Create a Pub/Sub topic to trigger that function.
3.Create a Cloud Scheduler job that invokes the Pub/Sub trigger.
4.Run the Cloud Scheduler job.
5.Then create a trigger which activate another Cloud Function to delete all the data require from the database once the csv has been created.
Here I leave you some documents which could help you if you decide to follow this path.
Using Pub/Sub to trigger a Cloud Function:https://cloud.google.com/scheduler/docs/tut-pub-sub
Connecting to Cloud SQL from Cloud Functions:https://cloud.google.com/sql/docs/mysql/connect-functionsCloud
Storage Tutorial:https://cloud.google.com/functions/docs/tutorials/storage
Another method aside from #jjanes would be to partition your database by date. This would allow you to create an index on the date, making exporting or deleting a days entries very easy. With this implementation, you could also create a Cron Job that deletes all tables older then X days ago.
The documentation provided will walk you through setting up a Ranged partition
The table is partitioned into “ranges” defined by a key column or set of columns, with no overlap between the ranges of values assigned to different partitions. For example, one might partition by date ranges, or by ranges of identifiers for particular business objects.
Thank you for all your answers. There are multiples ways of doing this, so I'm goint to explain how I did it.
In the database I have included a column which contains the date when the data was inserted.
I used a cloud scheduler with the following body:
{"exportContext":{"fileType": "CSV", "csvExportOptions" :{"selectQuery" : "select \"column1\", \"column2\",... , \"column n\" from public.\"tablename\" where \"Insertion_Date\" = CURRENT_DATE - 1" },"uri": "gs://bucket/filename.csv","databases": ["postgres"]}}
This scheduler will be triggered once a day and it will export only the data of the previous day
Also, I have to noticed that in the query I used in cloud scheduler you can choose which columns you want to export, doing this you can avoid to export the column which include the Insertion_Date and use this column only an auxiliary.
Finally, the cloud scheduler will create automatically the csv file in a bucket

AWS Glue Scala Upsert

I am trying to Upsert data into an existing S3 bucket from another using AWS Glue in Scala. Is there a standard way to use this? One of the methods that I found was to use SQL's MERGE method. What are the advantages and disadvantages of using that?
Thanks
You can't really implement 'SQL MERGE' method in s3 since it's not possible to update existing data objects.
A workaround is to load existing rows in a Glue job, merge it with incoming dataset, drop obsolete records and overwrite all objects on s3. If you have a lot of data it would be more efficient to partition it by some columns and then override those partitions that should contain new data only.
If you goal is preventing duplicates then you can do similar: load existing, drop those records from incoming dataset that already exist in s3 (loaded on previous step) and then write to s3 new records only.

How to push a big file data in talend?

I have created a table where I have a text input file which is 7.5 GB in size and there are 65 million records and now I want to push that data into an Amazon RedShift table.
But after processing 5.6 million records it's no longer moving.
What can be the issue? Is there any limitation with tFileOutputDelimited as the job has been running for 3 hours.
Below is the job which I have created to push data in to Redshift table.
tFileInputDelimited(.text)---tMap--->tFilOutputDelimited(csv)
|
|
tS3Put(copy output file to S3) ------> tRedShiftRow(createTempTable)--> tRedShiftRow(COPY to Temp)
The limitation comes from Tmap component, its not the good choice to deal with large amount of data, for your case, you have to enable the option "Store temp data" to overcome the memory consumption limitation of Tmap.
Its well described in Talend Help Center.
Looks like, tFilOutputDelimited(csv) is creating the problem. Any file can't handle after certain amount of data. Not sure thought. Try to find out a way to load only portion of the parent input file and commit it in redshift. Repeat the process till your parent input file gets completely processed.
use AWS Glue to push your file data from S3 to Redshift. AWS Glue will easily push the large data into redshift without any issue.
Steps:
1: Create a connection with your Redshift
2: create a database and two tables.
a: Data-from-S3 (this will use to crawl file data from S3)
b: data-to-redshift ( add redshift connection)
3: Create a Job:
a: In Data source, select the "Data-from-S3" table
b: In Data Target, select the "data-to-redshift" table
4: Run the job.
Note: You can also automate this with lambda and SNS trigger.
You can use copy command option to load large data into aws redshift, if the copy command doesn't support txt file, then we need to have csv file. processing 65 million records will create issue. so we need to perform split and run. for that create 65 iterations and do process 1 million data a a time . To implement this use tloop and set the values inside the component. take the global variables of tloop in header and limit values of tinputdelimited component
job:
tloop----->tinputfiledelimited---->tmap(if needed)--------> tfileoutdelimited
also enable the option "Store temp data" to handle the memory issue

MongoDB into AWS Redshift

We've got a pretty big MongoDB instance with sharded collections. It's reached a point where it's becoming too expensive to rely on MongoDB query capabilities (including aggregation framework) for insight to the data.
I've looked around for options to make the data available and easier to consume, and have settled on two promising options:
AWS Redshift
Hadoop + Hive
We want to be able to use a SQL like syntax to analyze our data, and we want close to real time access to the data (a few minutes latency is fine, we just don't want to wait for the whole MongoDB to sync overnight).
As far as I can gather, for option 2, one can use this https://github.com/mongodb/mongo-hadoop to move data over from MongoDB to a Hadoop cluster.
I've looked high and low, but I'm struggling to find a similar solution for getting MongoDB into AWS Redshift. From looking at Amazon articles, it seems like the correct way to go about it is to use AWS Kinesis to get the data into Redshift. That said, I can't find any example of someone that did something similar, and I can't find any libraries or connectors to move data from MongoDB into a Kinesis stream. At least nothing that looks promising.
Has anyone done something like this?
I ended up coding up our own migrator using NodeJS.
I got a bit irritated with answers explaining what redshift and MongoDB is, so I decided I'll take the time to share what I had to do in the end.
Timestamped data
Basically we ensure that all our MongoDB collections that we want to be migrated to tables in redshift are timestamped, and indexed according to that timestamp.
Plugins returning cursors
We then code up a plugin for each migration that we want to do from a mongo collection to a redshift table. Each plugin returns a cursor, which takes the last migrated date into account (passed to it from the migrator engine), and only returns the data that has changed since the last successful migration for that plugin.
How the cursors are used
The migrator engine then uses this cursor, and loops through each record.
It calls back to the plugin for each record, to transform the document into an array, which the migrator then uses to create a delimited line which it streams to a file on disk. We use tabs to delimit this file, as our data contained a lot of commas and pipes.
Delimited exports from S3 into a table on redshift
The migrator then uploads the delimited file onto S3, and runs the redshift copy command to load the file from S3 into a temp table, using the plugin configuration to get the name and a convention to denote it as a temporary table.
So for example, if I had a plugin configured with a table name of employees, it would create a temp table with the name of temp_employees.
Now we've got data in this temp table. And the records in this temp table get their ids from the originating MongoDB collection. This allows us to then run a delete against the target table, in our example, the employees table, where the id is present in the temp table. If any of the tables don't exist, it gets created on the fly, based on a schema provided by the plugin. And so we get to insert all the records from the temp table into the target table. This caters for both new records and updated records. We only do soft deletes on our data, so it'll be updated with an is_deleted flag in redshift.
Once this whole process is done, the migrator engine stores a timestamp for the plugin in a redshift table, in order to keep track of when the migration last run successfully for it. This value is then passed to the plugin the next time the engine decides it should migrate data, allowing the plugin to use the timestamp in the cursor it needs to provide to the engine.
So in summary, each plugin/migration provides the following to the engine:
A cursor, which optionally uses the last migrated date passed to it
from the engine, in order to ensure that only deltas are moved
across.
A transform function, which the engine uses to turn each document in the cursor into a delimited string, which gets appended to an export file
A schema file, this is a SQL file containing the schema for the table at redshift
Redshift is a data ware housing product and Mongo DB is a NoSQL DB. Clearly, they are not a replacement of each other and can co-exist and serve different purpose. Now how to save and update records at both places.
You can move all Mongo DB data to Redshift as a one time activity.
Redshift is not a good fit for real time write. For Near Real Time Sync to Redshift, you should Modify program that writes into Mongo DB.
Let that program also writes into S3 locations. S3 location to redshift movement can be done on regular interval.
Mongo DB being a document storage engine, Apache Solr, Elastic Search can be considered as possible replacements. But they do not support SQL type querying capabilities.They basically use a different filtering mechanism. For eg, for Solr, you might need to use the Dismax Filter.
On Cloud, Amazon's Cloud Search/Azure Search would be compelling options to try as well.
You can use AWS DMS to migrate data to redshift now easily , you can also realtime ongoing changes with it.