I have a databricks notebook in pyspark/python . And I have a azure synapse database . I would like to update a single records in a Synapse table . Seems for the original
df.write \.
format("com.databricks.spark.sqldw") \
doesn't have these option . it just have append, overwrite ... so will need another libraries to help ?
I believe you should load to a staging table in Synapse and then use .option("postActions",postActionsSQL) to insert/update/delete into the final table. Here is a full example.
I would load the output temporarily to ADLS to the file (parquet most probably) and then use polybase or OPENROWSET to update records (UPDATE with join or MERGE using mentioned external table).You can create stored procedure and sync with creation of parquet.
I guess Microsoft would suggest not to use databricks as a separate resource, but rather utilize spark pools within Synapse Studio.
Related
I have a file that lands in AWS S3 several times a day. I am using Talend as my ETL tool to populate a warehouse in Snowflake and need it to watch for the file to trigger my job. I've tried tWaitForFile but can't seem to get it to connect to S3. Has anyone done this before?
Can you check below link automate pipeline using S3 and lambda to trigger files to talend job.
Automate S3 File Push
Can you help me create DAG for AWS Managed AirFlow to copy data from one scheme to another (there are in one database) in RedShift without an S3 bucket?
Thnx.
INSERT INTO {target_schema}.{table} select * from {source_schema}.{table}
i want to join more than one csv file from aws s3 and move the file into redshift using aws glue
i have tried move a single file to redshift it is working, I have seen solutions via pyspark, can I do the same via gui without using coding
I'm looking for the best way that i can synchronise a on premise MongoDB with an Azure DocumentDB . the idea is this can synchronise on a predetermined time, for example every 2 hours.
I'm using .NET and C#.
I was thinking that I can create a Windows Service that retrieves the documents from de Azure DocumentDB collections and inserts the documents on my on premise MongoDB.
But I'm wondering if there is any better way.
Per my understanding, you could use Azure Cosmos DB Data Migration Tool to Export docs from collections to JSON file, then pick up the exported file(s) and insert / update into your on-premise MongoDB. Moreover, here is a tutorial about using the Windows Task Scheduler to backup DocumentDB, you could follow here.
When executing the export operation, you could export to a local file or Azure Blob Storage. For exporting to the local file, you could leverage the FileTrigger from Azure WebJobs SDK Extensions to monitor the file additions / changes under a particular directory, then pick up the new inserted local file and insert into your MongoDB. For exporting to Blob storage, you could also work with WebJobs SDK and use the BlobTrigger to trigger the new blob file and do the insertion. For the blob approach, you could follow How to use Azure blob storage with the WebJobs SDK.
We are planning to go for PostgreSQL RDS in AWS environment. There are some files in S3 which we will need to load every week. I don't see any option in AWS documentation where we can load data from S3 to PostgreSQL RDS. I see it is possible for Aurora but cannot find anything for PostgreSQL.
Any help will be appreciated.
One option is to use AWS Data Pipeline. It's essentially a JSON script that allows you to orchestrate the flow of data between sources on AWS.
There's a template offered by AWS that's setup to move data between S3 and MySQL. You can find it here. You can easily follow this and swap out the MySQL parameters with those associated with your Postgres instance. Data Pipeline simply looks for RDS as the type and does not distinguish between MySQL and Postgres instances.
Scheduling is also supported by Data Pipeline, so you can automate your weekly file transfers.
To start this:
Go to the Data Pipeline service in your AWS console
Select "Build from template" under source
Select the "Load S3 to MySQL table" template
Fill in the rest of the fields and create the pipeline
From there, you can monitor the progress of the pipeline in the console!