How can i identify processed files in Data flow Job - google-cloud-storage

How can I identify processed files in Data flow Job? I am using a wildcard to read files from cloud storage. but every time when the job runs, it re-read all files.
This is a batch Job and following is sample reading TextIO that I am using.
PCollection<String> filePColection = pipeline.apply("Read files from Cloud Storage ", TextIO.read().from("gs://bucketName/TrafficData*.txt"));

To see a list of files that match your wildcard you can use gsutils, which is the Cloud Storage command line utility. You'd do the following:
gsutils ls gs://bucketName/TrafficData*.txt
Now, when it comes to running a batch job multiple times, your pipeline has no way to know which files it has analyzed already or not. To avoid analyzing new files you could do either of the following:
Define a Streaming job, and use TextIO's watchForNewFiles functionality. You would have to leave your job to run for as long as you want to keep processing files.
Find a way to provide your pipeline with files that have already been analyzed. For this, every time you run your pipeline you could generate a list of files to analyze, put it into a PCollection, read each with TextIO.readAll(), and store the list of analyzed files somewhere. Later, when you run your pipeline again you can use this list as a blacklist for files that you don't need to run again.
Let me know in the comments if you want to work out a solution around one of these two options.

Related

Is there a way to deploy scheduled queries to GCP directly through a github action, with a configurable schedule?

Currently using GCP BigQuery UI for scheduled queries, everything is manually done.
Wondering if there's a way to automatically deploy to GCP using a config JSON that contains the scheduled query's parameters and scheduled times through github actions?
So far, this is one option I've found that makes it more "automated":
- store query in a file on Cloud Storage. When invoking Cloud Function, you read the file content and you perform a bigQuery job on it.
- have to update the file content to update the query
- con: read file from storage, then call BQ: 2 api calls and query file to manage
Currently using DBT in other repos to automate and make this process quicker: https://docs.getdbt.com/docs/introduction
Would prefer the github actions version though, just haven't found a good docu yet :)

how to avoid uploading file that is being partially written - Google storage rsync

I'm using google cloud storage with option rsync
I create a cronjob that sync file every minute.
But there're the problem
Right on a file is being partially written, the cronjob run, then it sync a part of file even though it wasn't done.
Is there the way to settle this problem?
The gsutil rsync command doesn't have any way to check that a file is still being written. You will need to coordinate your writing and rsync'ing jobs such that they operate on disjoint parts of the file tree. For example, you could arrange your writing job to write to directory A while your rsync job rsyncs from directory B, and then switch pointers so your writing job writes to directory B while your rsync job writes to directory A. Another option would be to set up a staging area into which you copy all the files that have been written before running your rsync job. If you put it on the same file system as where they were written you could use hard links so the link operation works quickly (without byte copying).

How can one view the partial output of a job in PBS that has exceeded its walltime?

I'm new to using cluster computers to run experiments. I have a script running in python that should be regularly printing out information, but I find that when my job exceeds its walltime, I get no output at all except the notification that the job has been killed.
I've tried regularly flushing the buffer to no avail, and was wondering if there was something more basic that I'm missing.
Thanks!
I'm guessing you are having issues with a job cleanup script in the epilogue. You may want to ask the admins about it. You may also want to try a different approach.
If you were to redirect your output to a file in a shared filesystem you should be able to avoid data loss. This assumes you have a shared filesystem to work with and you aren't required to stage in and stage out all of your data.
If you reuse your submission script you can avoid clobbering the output of other jobs by including the $PBS_JOBID environment variable in the output filename.
script.py > $PBS_JOBID.out
On mobile so check qsub man page for a list of job environment variables.

Job is not failing in pentaho which i read it through VFS when the file is not available in S3

i have a job in pentaho which read the data from S3 through virtual file system. So the extracted data from source is not regular as its a adhoc basis. Ideally i had to write a loop condition from today's date to the date which matches in S3.
When the file is not available in s3 then i gave a failure condition to parse it through loop condition, but it is not failing. Could anyone suggest the way?

Intermittant file not found using Google Cloud Storage from Dataproc - flushing writes?

I have a series of dataproc jobs that run to import some data received each morning. The process creates a cluster, runs four jobs in sequence, then shuts down the cluster. The input file is read from Google Cloud Storage, and the intermediate results are also saved in Avro form in GCS with the final output going to Cloud SQL.
Fairly often the jobs will fail trying to read the Avro written by the previous job. It appears that GCS hasn't "caught up" and the results from the previous job haven't been fully written. I was getting failures trying to read files that appeared to be from the previous day's run and partway through those files would disappear and be replaced by the new ones. I have changed my script that runs the files to clear the work area before starting the jobs, but still have problems where sometimes it starts reading and all the parts haven't been written fully.
I could change the code to simply store the intermediate files on the cluster, tho I like having them available outside for diagnosing other problems. I could also just write to both locations with the cluster for working and GCS for diagnostics.
But assuming this is some kind of sync issue, is there a way to force GCS to flush writes / be fully synced between jobs? Or is there some check I can do to make sure everything has been written before starting the next job in my chain?
EDIT: To answer the comment below, the sequence of jobs all run on the same cluster. The cluster is started, each job run in turn on that cluster, and then the cluster is shut down.
For now, I have worked around this by having the jobs write to HDFS on the cluster in addition to GCS, and the subsequent jobs reading from the cluster. The GCS output is now strictly for diagnostics in case of a problem. But even tho my immediate problem is (I believe) fixed I still would like to know what's happening and why GCS seems out of sync for a bit.