How to reduce the time for file copy to S3 using Talend - talend

I have created a small job to copy a csv file of 3 million records (350MB) in zip format to Amazon S3 via Talend Data Integration, using tS3put component. The job took around 2hrs 20 min for completion. But when i copy the same file via AWS Cli or Informatica it got completed within an hour.
Do anyone have an idea how to reduce the copy time to S3 using Talend Data Integration Tool?

Related

How to get Talend to wait for a file to land in S3

I have a file that lands in AWS S3 several times a day. I am using Talend as my ETL tool to populate a warehouse in Snowflake and need it to watch for the file to trigger my job. I've tried tWaitForFile but can't seem to get it to connect to S3. Has anyone done this before?
Can you check below link automate pipeline using S3 and lambda to trigger files to talend job.
Automate S3 File Push

Apache Nifi missing files when using ListGCSBucket and FetchGCSObject processors

We have a nifi processor group to transfer files between 2 Google Cloud Storage buckets say BucketA and BucketB (in different Google Cloud projects). For this we are using below Nifi processors in order:
ListGCSBucket
FetchGCSObject
PutGCSObject
At the end of the day we have found that there are some files which are not present in BucketB. Our files in BucketA are uploaded each minute and Nifi ListGCS is scheduled to run at every 30 seconds.
Could it be due to some race condition like when ListGCS is working, at the same time a file is uploaded in BucketA and missed by ListGCS and not picked in the next run as well ?

Import Firestore backup to BigQuery programmatically

I have a firestore backup file in GCS with the name: all_namespaces_kind_Rates.export_metadata. I have set up a cron job to update this file every 24 hours. What I need now is to find a way to programmatically send this export_metadata file to BigQuery. BigQuery has the capability of scheduling data transfer from GCS, but only for files with format: CSV, JSON, AVRO, PARQUET AND ORC. How can I transfer my firestore backup files programmatically into BigQuery?
If your cron job can access bq command line tool, have you tried:
bq load --source_format=DATASTORE_BACKUP [DATASET].[TABLE] [PATH_TO_SOURCE]
See more about the command:
https://cloud.google.com/bigquery/docs/loading-data-cloud-firestore#loading_cloud_firestore_export_service_data

Job is not failing in pentaho which i read it through VFS when the file is not available in S3

i have a job in pentaho which read the data from S3 through virtual file system. So the extracted data from source is not regular as its a adhoc basis. Ideally i had to write a loop condition from today's date to the date which matches in S3.
When the file is not available in s3 then i gave a failure condition to parse it through loop condition, but it is not failing. Could anyone suggest the way?

Triggering a Dataflow job when new files are added to Cloud Storage

I'd like to trigger a Dataflow job when new files are added to a Storage bucket in order to process and add new data into a BigQuery table. I see that Cloud Functions can be triggered by changes in the bucket, but I haven't found a way to start a Dataflow job using the gcloud node.js library.
Is there a way to do this using Cloud Functions or is there an alternative way of achieving the desired result (inserting new data to BigQuery when files are added to a Storage bucket)?
This is supported in Apache Beam starting with 2.2. See Watching for new files matching a filepattern in Apache Beam.
Maybe this post would help on how to trigger Dataflow pipelines from App Engine or Cloud Functions?
https://cloud.google.com/blog/big-data/2016/04/scheduling-dataflow-pipelines-using-app-engine-cron-service-or-cloud-functions