Job is not failing in pentaho which i read it through VFS when the file is not available in S3 - pentaho-spoon

i have a job in pentaho which read the data from S3 through virtual file system. So the extracted data from source is not regular as its a adhoc basis. Ideally i had to write a loop condition from today's date to the date which matches in S3.
When the file is not available in s3 then i gave a failure condition to parse it through loop condition, but it is not failing. Could anyone suggest the way?

Related

How to get Talend to wait for a file to land in S3

I have a file that lands in AWS S3 several times a day. I am using Talend as my ETL tool to populate a warehouse in Snowflake and need it to watch for the file to trigger my job. I've tried tWaitForFile but can't seem to get it to connect to S3. Has anyone done this before?
Can you check below link automate pipeline using S3 and lambda to trigger files to talend job.
Automate S3 File Push

is there a way dump a TSV file from Storage Bucket to Cloud MySql in GCP?

is there a way to dump a TSV file from Storage Bucket to Cloud MySql in GCP ?. I have large file of TSV with 4M rows.
I couldn't convert it into CSV.
As of today, Cloud SQL only supports CSV and SQL. Nonetheless, I suggest that you take a look at this solution. I used Python to be able to automate this process in case you really need it to make it more than one time. In this case I tried to reproduce your issue and I code a script that basically:
Downloads the TSV file from the Cloud Storage Bucket specified.
Converts the TSV file to a CSV file. Uploads the CSV file to the
Cloud Storage Bucket specified.
Imports the newly added CSV file to
Cloud SQL.
You can find the code as well as the requirements for running this script here. Furthermore, take into account that you will need to replace those values closed by claudators such as [BUCKET_NAME] before running it. Also keep in mind that this script does not delete the TSV download it as well as the CSV file, therefore you will need to delete it manually or you can modify the code in order to delete the files automatically.
Finally, if you would like to investigate further about the API used on the script section, I will attach the documentation need it here & here.

Import Firestore backup to BigQuery programmatically

I have a firestore backup file in GCS with the name: all_namespaces_kind_Rates.export_metadata. I have set up a cron job to update this file every 24 hours. What I need now is to find a way to programmatically send this export_metadata file to BigQuery. BigQuery has the capability of scheduling data transfer from GCS, but only for files with format: CSV, JSON, AVRO, PARQUET AND ORC. How can I transfer my firestore backup files programmatically into BigQuery?
If your cron job can access bq command line tool, have you tried:
bq load --source_format=DATASTORE_BACKUP [DATASET].[TABLE] [PATH_TO_SOURCE]
See more about the command:
https://cloud.google.com/bigquery/docs/loading-data-cloud-firestore#loading_cloud_firestore_export_service_data

How can i identify processed files in Data flow Job

How can I identify processed files in Data flow Job? I am using a wildcard to read files from cloud storage. but every time when the job runs, it re-read all files.
This is a batch Job and following is sample reading TextIO that I am using.
PCollection<String> filePColection = pipeline.apply("Read files from Cloud Storage ", TextIO.read().from("gs://bucketName/TrafficData*.txt"));
To see a list of files that match your wildcard you can use gsutils, which is the Cloud Storage command line utility. You'd do the following:
gsutils ls gs://bucketName/TrafficData*.txt
Now, when it comes to running a batch job multiple times, your pipeline has no way to know which files it has analyzed already or not. To avoid analyzing new files you could do either of the following:
Define a Streaming job, and use TextIO's watchForNewFiles functionality. You would have to leave your job to run for as long as you want to keep processing files.
Find a way to provide your pipeline with files that have already been analyzed. For this, every time you run your pipeline you could generate a list of files to analyze, put it into a PCollection, read each with TextIO.readAll(), and store the list of analyzed files somewhere. Later, when you run your pipeline again you can use this list as a blacklist for files that you don't need to run again.
Let me know in the comments if you want to work out a solution around one of these two options.

How to reduce the time for file copy to S3 using Talend

I have created a small job to copy a csv file of 3 million records (350MB) in zip format to Amazon S3 via Talend Data Integration, using tS3put component. The job took around 2hrs 20 min for completion. But when i copy the same file via AWS Cli or Informatica it got completed within an hour.
Do anyone have an idea how to reduce the copy time to S3 using Talend Data Integration Tool?