How to start a DataStage Sequence job when a when file comes to the server - scheduler

I’m looking to build a process that triggers a DataStage Sequencer job when any file comes to the server’s landing zone. CA7 is the scheduler and the file naming convention comes in many different flavors, including the file extensions. Also, some file naming conventions contain date timestamp. I’m new to this activity so please bear with me if I ask silly follow on questions.
Thanks in advance for any help.

Check out the Wait for File stage in the Sequence
It has options to wait fr a file to appear (or disappear) and a timelimit before it times out. So you have to start the job at a certain time but the processing will start ince the file appears.
The stage expects a filename though - but you could do a ls or similar command to get the filename and send that as a parameter to your job.

If you need to process a few seldom files just-in-time, you can use the wait for file stage and schedule it in advance. If it's okay to process the files in bigger intervals, you can just schedule a job to run every interval like once a day, every hour or every minute and then process all files in the folder.
You mentioned that you have to deal with many different file names and extensions. I assume they're also of different structure. Beware of trying to build jobs that can handle all and everything.
Depending on the frequency, type and amount of files you expect to process, you have several methods to achieve best performance: either loop a few files in a sequence file-by-file and do complex stuff on each file, or read many files at once in a parallel job. Looping hundrets of files in a sequence having several jos within the loop could end up in very long coffee breaks.
If the task is to just move the files, maybe a shell script (-> command stage) is your friend.
But if you have tons of files (no matter what name) of the same structure (like csv files) and you need the content in a database, then you can read them all at once in a parallel-job using the sequential file stage and save them directly into a dataset. That stage allows you to select the files by pattern (maning that * is your friend in this case) and it can output the filename to a new field. So you'd end up with a DataSet containing your data and corresponding filenames.
Even if the files do not have the same structure, you can output the whole file content in one lob column and still process all reading in one job.
If you name the dataset dynamically, you can schedule another independend job to process the queue of DataSets in parallel for further processing.

Related

Retention scripts to container data

I'm trying to do something to apply data retention policies to my data stored in container storage in my data lake. The content is structured like this:
2022/06/30/customer.parquet
2022/06/30/product.parquet
2022/06/30/emails.parquet
2022/07/01/customer.parquet
2022/07/01/product.parquet
2022/07/01/emails.parquet
That's basically every day a new file is added, using the copy task from azure data factory. There are in reality more than 3 files per day.
I want to start applying different retention policies to different files. For example, the emails.parquet files, I want to delete the entire file after it is 30 days old. The customer files, I want to anonymise by replacing the contents of certain columns with some placeholder text.
I need to do this in a way that preserves the next stage of data processing - which is where pyspark scripts read all data for a given type (e.g. emails, or customer), transform it and output it to a different container.
So to apply the retention changes mentioned above, I think I need to iteratively look through the container, find each file (each emails file, or each customer file), do the transformations, and then output (overwrite) the original file. I'd plan to use pyspark notebooks for this, but I don't know how to iterate through folder structures in a container.
As for making date comparisons to decide if my data is to be not retained, I can either use the folder structures for the dates (but I don't know how to do this), or there's a "RowStartDate" in every parquet file that I can use too.
Can anybody help point me in the right direction of how to achieve what I wish, either by the route I'm alluding to above (pyspark script to iterate through container folders, add data to data frame, transform, then overwrite original file) or any other means.

Is Apache Beam the right tool for feature pre processing?

So this is a bit of a weird question as it isn't related to how to use the tool but more about why to use it.
I'm deploying a model and thinking of using Apache-beam to run the feature processing tasks using its python API. Documentation is pretty big and complex but I went through most of it, even built a small working pipeline, and it is still not clear this would be the right tool for me.
An example of what I need is the following:
Input data structure:
ID | Timestamp | category
output needed:
category | category count for last 30 minutes (feature example)
This process needs to run every 5 minutes and update the counts.
===> What I fail to understand is if apache can run this pipeline every 5 minutes, read whichever new input data was generated and update the counts of the previous time it ran. And if so, can someone point me in the right direction?
Thank you!
When you run a Beam pipeline manually, it's expected to be started only once. Then it could be either a Bounded (Batch) or Unbounded (Streaming) pipeline. In the first case, it will be stopped after the all your bounded amount of data has been processed, in the second case it will run continuously and expect new data arrival (until it will be stopped manually).
Usually, the type of pipeline depends on data source that you have (Beam IO connectors). For example, if you read from files, then, by default, it's assumed to be a bounded source (limited number of files), but it could be unbounded source as well if you expect to have more new files to arrive and want to read them in the same pipeline.
Also, you can run your batch pipeline periodically with automated tools, like Apache Airflow (or just unix crontab). So, it all depends on your needs and type or data source. I could probably give more specific advice if you could share more details of your data pipeline - type of your data source and environment, an example of input and output results, how often your input data can be updated and so on.

How to Process multiple files in talend one after another and the size of the files are too large?

i want to process the multiple files using talend and one after another and the size of the files are large and while processing one file if another file comes into that directory it has to process that file also.
is there any possible way to do this could you please suggest guys?
You can use tFileList component, which will iterate all the files in a given directory.
You can check the component functionality here
Simple concept would be,
When there is a file in a directory say Folder1, move that file to another location say Folder2.
After processing file in Folder2 again, check Folder1, that is any new files arrived.
If arrived, then again move that file to Folder2 and process it.
If there is no new file, end the job.
A great way to do this in Talend is to setup a file watcher job which is simple to do. Talend provides the tWaitForFile Component which will watch a directory for files. You can configure the max iterations in which it will look for the file and time between polls/scans. Since you said you are loading large files, to avoid DB concurrency issues give enough time between scans to account for this.
In my example below I am watching a directory for new files, scanning every 60 seconds over an 8 hour period. You would want to schedule the job in either the TAC or whatever scheduling tool you use. In my example I simply join to a tJavaRow and display the information about the file that was found.
you can see the output from my tJavaRow here which shows the file info:

spark save simple string to text file

I have a spark job that needs to store the last time it ran to a text file.
This has to work both on HDFS but also on local fs (for testing).
However it seems that this is not at all so straight forward as it seems.
I have been trying with deleting the dir and getting "can't delete" error messages.
Trying to store a simple sting value into a dataframe to parquet and back again.
this is all so convoluted that it made me take a step back.
What's the best way to just store a string (timestamp of last execution in my case) to a file by overwriting it?
EDIT:
The nasty way I use it now is as follows:
sqlc.read.parquet(lastExecution).map(t => "" + t(0)).collect()(0)
and
sc.parallelize(List(lastExecution)).repartition(1).toDF().write.mode(SaveMode.Overwrite).save(tsDir)
This sounds like storing simple application/execution metadata. As such, saving a text file shouldn't need to be done by "Spark" (ie, it shouldn't be done in distributed spark jobs, by workers).
The ideal place for you to put it is in your driver code, typically after constructing your RDDs. That being said, you wouldn't be using the Spark API to do this, you'd rather be doing something as trivial as using a writer or a file output stream. The only catch here is how you'll read it back. Assuming that your driver program runs on the same computer, there shouldn't be a problem.
If this value is to be read by workers in future jobs (which is possibly why you want it in hdfs), and you don't want to use the Hadoop API directly, then you will have to ensure that you have only one partition so that you don't end up with multiple files with the trivial value. This, however, cannot be said for the local storage (it gets stored on the machine where the worker executing the task is running), managing this will simply be going overboard.
My best option would be to use the driver program and create the file on the machine running the driver (assuming it is the same that will be used next time), or, even better, to put it in a database. If this value is needed in jobs, then the driver can simply pass it through.

Recover standard out from a failed Hadoop job

I'm running a large Hadoop streaming job where I process a large list of files with each file being processed as a single unit. To do this, my input to my streaming job is a single file with a list of all the file names on separate lines.
In general, this works well. However, I ran into an issue where I was partially through a large job (~36%) when Hadoop ran into some files with issues and for some reason it seemed to crash the entire job. If the job had completed successfully, what would have been printed to standard out would be a line for each file as it was completed along with some stats from my program that's processing each file. However, with this failed job, when I try to look at the output that would have been sent to standard out, it is empty. I know that roughly 36% of the files were processed (because I'm saving the data to a database), but it's not easy for me to generate a list of which files were successfully processed and which ones remain. Is there anyway to recover this logging to standard out?
One thing I can do is look at all of the log files for the completed/failed tasks, but this seems more difficult to me and I'm not sure how to go about retrieving the good/bad list of files this way.
Thanks for any suggestions.
Hadoop captures system.out data here :
/mnt/hadoop/logs/userlogs/task_id
However, I've found this unreliable, and Hadoop jobs dont usually use standard out for debugging, rather - the convetion is to use counters.
For each of your documents, you can summarize document characteristics : like length, number of normal ascii chars, number of new lines.
Then, you can have 2 counters: a counter for "good" files, and a counter for "bad" files.
It probably be pretty easy to note that the bad files have something in common [no data, too much data, or maybe some non printable chars].
Finally, you obviously will have to look at the results after the job is done running.
The problem, of course, with system.out statements is that the jobs running on various machines can't integrate their data. Counters get around this problem - they are easily integrated into a clear and accurate picture of the overall job.
Of course, the problem with counters is the information content is entirely numeric, but, with a little creativity, you can easily find ways to quantitatively describe the data in a meaningfull way.
WORST CASE SCENARIO : YOU REALLY NEED TEXT DEBUGGING, and you dont want it in a temp file
In this case, you can use MultipleOutputs to write out ancillary files with other data in them. You can emit records to these files in the same way as you would for the part-r-0000* data.
In the end, I think you will find that, ironically, the restriction of having to use counters will increase the readability of your jobs : it is pretty intuitive, once you think about it, to debug using numerical counts rather than raw text --- i find, quite often that much of my debugging print statements are, when cut down to their raw information content, are basically just counters...