Background: I want to building up the ETL with the drag and drop function in Talend
Question: May I know whether the Talend can compose the ETL built with drag and drop into the Redshift SQL, and run the ETL in the Redshift directly, instead of running the ETL in Talend server runtime environment?
Thank you!
Yes, Talend has a SQL module but you will need to write the SQL for the module.
Related
Might be a stupid question but seems to be very difficult to find information about Synapse with multiple environments.
We have dev/test/prod environment setup and need to create partially-automated CICD pipelines between those. The only problem is now that we cannot build dynamic SQL scripts to query from the respective storage accounts - so those could be identical no matter the environment. So, dev Synapse using data from dev-storage and so on. Dedicated SQL pool can benefit from Stored Procedures, and I could pass parameters there if it works. But what about serverless pool? What is the correct way?
I've tried to look options from OPENROWSET with DATA_SOURCE argument as well as EXTERNAL DATA SOURCE expression without any luck. Also, no one seems to offer any information about this so I'm beginning to think if this whole perspective is wrong.
This kind of "external" file reading is new to me, I may have tried to put this in a SQL Server context in my head.
Thank you for your time!
Okay, Serverless pool does support both procedures and dynamic SQL, yet you currently cannot call that straight from Synapse Pipelines.
You have to either trigger those procedures via Spark notebooks or by creating separate Synapse Analytics Linked Services for each of your databases in a Synapse Serverless pools and work from there.
I have a databricks notebook in pyspark/python . And I have a azure synapse database . I would like to update a single records in a Synapse table . Seems for the original
df.write \.
format("com.databricks.spark.sqldw") \
doesn't have these option . it just have append, overwrite ... so will need another libraries to help ?
I believe you should load to a staging table in Synapse and then use .option("postActions",postActionsSQL) to insert/update/delete into the final table. Here is a full example.
I would load the output temporarily to ADLS to the file (parquet most probably) and then use polybase or OPENROWSET to update records (UPDATE with join or MERGE using mentioned external table).You can create stored procedure and sync with creation of parquet.
I guess Microsoft would suggest not to use databricks as a separate resource, but rather utilize spark pools within Synapse Studio.
I have a file that lands in AWS S3 several times a day. I am using Talend as my ETL tool to populate a warehouse in Snowflake and need it to watch for the file to trigger my job. I've tried tWaitForFile but can't seem to get it to connect to S3. Has anyone done this before?
Can you check below link automate pipeline using S3 and lambda to trigger files to talend job.
Automate S3 File Push
We are planning to go for PostgreSQL RDS in AWS environment. There are some files in S3 which we will need to load every week. I don't see any option in AWS documentation where we can load data from S3 to PostgreSQL RDS. I see it is possible for Aurora but cannot find anything for PostgreSQL.
Any help will be appreciated.
One option is to use AWS Data Pipeline. It's essentially a JSON script that allows you to orchestrate the flow of data between sources on AWS.
There's a template offered by AWS that's setup to move data between S3 and MySQL. You can find it here. You can easily follow this and swap out the MySQL parameters with those associated with your Postgres instance. Data Pipeline simply looks for RDS as the type and does not distinguish between MySQL and Postgres instances.
Scheduling is also supported by Data Pipeline, so you can automate your weekly file transfers.
To start this:
Go to the Data Pipeline service in your AWS console
Select "Build from template" under source
Select the "Load S3 to MySQL table" template
Fill in the rest of the fields and create the pipeline
From there, you can monitor the progress of the pipeline in the console!
I have done migration data between Amazon Redshift clusters using unload/copy commands via s3 interactively. The next step is to automate the process and I'm looking for best approach to do so.
you can use java/ any other language to below steps and automate
1) connect to cluster 1
2) unload data to amazon s3
3) connect to cluster 2
4) copy data from amazon s3 to redshift cluster
you can use shell script or php or simple java program will do.
Here are the two ways that you can try:
Use python or bash script to unload and copy data from one RedShift
cluster to another. In this approach the staging area will be S3. If
you are trying to unload and copy between separate accounts then you
need to have appropriate IAM Roles and trust policies. This can be a
little challenging. You can automate this process by using AWS Data Pipeline.
Take a snapshot and restore a RedShift cluster using the snapshot. Also if you want to share this snapshot to other account then just go to Manage Access and put the Account ID of the destination RedShift Cluster. This is very simple and no need to write any code.