How to get Talend to wait for a file to land in S3 - talend

I have a file that lands in AWS S3 several times a day. I am using Talend as my ETL tool to populate a warehouse in Snowflake and need it to watch for the file to trigger my job. I've tried tWaitForFile but can't seem to get it to connect to S3. Has anyone done this before?

Can you check below link automate pipeline using S3 and lambda to trigger files to talend job.
Automate S3 File Push

Related

Is there a way to deploy scheduled queries to GCP directly through a github action, with a configurable schedule?

Currently using GCP BigQuery UI for scheduled queries, everything is manually done.
Wondering if there's a way to automatically deploy to GCP using a config JSON that contains the scheduled query's parameters and scheduled times through github actions?
So far, this is one option I've found that makes it more "automated":
- store query in a file on Cloud Storage. When invoking Cloud Function, you read the file content and you perform a bigQuery job on it.
- have to update the file content to update the query
- con: read file from storage, then call BQ: 2 api calls and query file to manage
Currently using DBT in other repos to automate and make this process quicker: https://docs.getdbt.com/docs/introduction
Would prefer the github actions version though, just haven't found a good docu yet :)

How does one synchronise on premise MongoDB with an Azure DocumentDB?

I'm looking for the best way that i can synchronise a on premise MongoDB with an Azure DocumentDB . the idea is this can synchronise on a predetermined time, for example every 2 hours.
I'm using .NET and C#.
I was thinking that I can create a Windows Service that retrieves the documents from de Azure DocumentDB collections and inserts the documents on my on premise MongoDB.
But I'm wondering if there is any better way.
Per my understanding, you could use Azure Cosmos DB Data Migration Tool to Export docs from collections to JSON file, then pick up the exported file(s) and insert / update into your on-premise MongoDB. Moreover, here is a tutorial about using the Windows Task Scheduler to backup DocumentDB, you could follow here.
When executing the export operation, you could export to a local file or Azure Blob Storage. For exporting to the local file, you could leverage the FileTrigger from Azure WebJobs SDK Extensions to monitor the file additions / changes under a particular directory, then pick up the new inserted local file and insert into your MongoDB. For exporting to Blob storage, you could also work with WebJobs SDK and use the BlobTrigger to trigger the new blob file and do the insertion. For the blob approach, you could follow How to use Azure blob storage with the WebJobs SDK.

Loading data from S3 to PostgreSQL RDS

We are planning to go for PostgreSQL RDS in AWS environment. There are some files in S3 which we will need to load every week. I don't see any option in AWS documentation where we can load data from S3 to PostgreSQL RDS. I see it is possible for Aurora but cannot find anything for PostgreSQL.
Any help will be appreciated.
One option is to use AWS Data Pipeline. It's essentially a JSON script that allows you to orchestrate the flow of data between sources on AWS.
There's a template offered by AWS that's setup to move data between S3 and MySQL. You can find it here. You can easily follow this and swap out the MySQL parameters with those associated with your Postgres instance. Data Pipeline simply looks for RDS as the type and does not distinguish between MySQL and Postgres instances.
Scheduling is also supported by Data Pipeline, so you can automate your weekly file transfers.
To start this:
Go to the Data Pipeline service in your AWS console
Select "Build from template" under source
Select the "Load S3 to MySQL table" template
Fill in the rest of the fields and create the pipeline
From there, you can monitor the progress of the pipeline in the console!

How to reduce the time for file copy to S3 using Talend

I have created a small job to copy a csv file of 3 million records (350MB) in zip format to Amazon S3 via Talend Data Integration, using tS3put component. The job took around 2hrs 20 min for completion. But when i copy the same file via AWS Cli or Informatica it got completed within an hour.
Do anyone have an idea how to reduce the copy time to S3 using Talend Data Integration Tool?

Triggering a Dataflow job when new files are added to Cloud Storage

I'd like to trigger a Dataflow job when new files are added to a Storage bucket in order to process and add new data into a BigQuery table. I see that Cloud Functions can be triggered by changes in the bucket, but I haven't found a way to start a Dataflow job using the gcloud node.js library.
Is there a way to do this using Cloud Functions or is there an alternative way of achieving the desired result (inserting new data to BigQuery when files are added to a Storage bucket)?
This is supported in Apache Beam starting with 2.2. See Watching for new files matching a filepattern in Apache Beam.
Maybe this post would help on how to trigger Dataflow pipelines from App Engine or Cloud Functions?
https://cloud.google.com/blog/big-data/2016/04/scheduling-dataflow-pipelines-using-app-engine-cron-service-or-cloud-functions