How do you treat a sequential file stage that cannot find the file similar to an empty table? - datastage

I have several datastage jobs that will run, but MIGHT not have the source file there. If not, I want the datastage job to complete similar to if I was using a Source DB Connector and the source table has zero rows.
how can this be done?
Thanks

The SequentialFile stage in DataStage expects a file to exists - even it it might be zero bytes in size.
One option would be to place a WaitForFile stage in front of your job to avoid the job run if no file exists. This would save efforts for loading lookup data etc. but is not 100% the behavior of an empty table. You could also touch an empty file in that case to get the behavior you want but I doubt this is a good design.

Related

How is the step execution context loaded from RDBMS batch metadata tables on restarting?

I understand that step execution context is loaded into an in-memory map then to batch_step_execution_context table's short_context column and when the job is restarted, the same in-memory map execution context is loaded automatically for us for the restarted job. But, when the restart is triggered after the in-memory map is wiped-off(eg: application restart), I got to know it is loaded from batch metadata RDBMS tables(precisely, batch_step_execution_context table). My question is - as the column length is 2500, spring batch truncates the data and adds an ellipses to the content, what if the data I put is more then 2500 characters? How is it able to load the original data(not the truncated with ellipses one)?
PS: I use this step execution context to pass my intended partition's identifiers to my readers as shown in most of the examples.
Please help me understand how this is taken care in the framework.
Thanks.
The execution context is deserialized first from the full version of the context, see here. Restart meta-data of your partitioned step should be saved/loaded automatically by default if you use a persistent job repository and restart the same job instance.
After careful observation, I found that the Spring Batch is intelligent enough to put the full content into serialized_context column of the batch_step_execution_context table if the content length is more then 2500 characters. In case the in-memory maps are cleared, they get restored from either of short_context or serialized_context columns.

COPY command runs but no data being copied from Teradata (on-prem)

I am running into an issue where I have a set up a pipeline that gets a list of tables from Teradata using a Lookup activity and then passes those items to a ForEach activity that then copies the data in parallel and saves them as a gzipped file. The requirement is to essentially archive some tables that are no longer being used.
For this pipeline I am not using any partition options as most of the tables are small and I kept it to be flexible.
Pipeline
COPY activity within ForEach activity
99% of the tables ran without issues and were copied as gz files into blob storage, but two tables in particular run for long time (apprx 4 to 6 hours) without any of the data being written into a blob storage account.
Note that the image above says "Cancelled", but that was done by me. Before that I had a run time as described above, but still no data being written. This is affecting only 2 tables.
I checked with our Teradata team and those tables are not being used by any one (hence its not locked). I also looked at "Teradata Viewpoint" (admin tool) and looked at the query monitor and saw that the query was running on Teradata without issues.
Any insight would be greatly apreciated.
Onlooking issue mention it look the data size of table is more than a blob can store ( As you are not using any partition options )
Use partition option for optimize performance and hold the data
Link
Just in case someone else comes across this, the way I solved this was to create a new data store connection called "TD_Prod_datasetname". The purpose of this dataset is to not point to a specific table, but to just accept a "item().TableName" value.
This datasource contains two main values. 1st is the #dataset().TeradataName
Dataset property
I only came up with that after doing a little bit of digging in Google.
I then created a parameter called "TeradataTable" as String.
I then updated my pipeline. As above the main two activities remain the same. I have a lookup and then a ForEach Activity (where for each will get the item values):
However, in the COPY command inside the ForEach activity I updated the source. Instead of getting "item().Name" I am passing through #item().TableName:
This then enabled me to then select the "Table" option and because I am using Table instead of query I can then use the "Hash" partition. I left it blank because according to Microsoft documentation it will automatically find the Primary Key that will be used for this.
The only issue that I ran into when using this was that if you run into a table that does not have a Primary Key then this item will fail and will need to be run through either a different process or manually outside of this job.
Because of this change the previously files that just hung there and did not copy now copied successfully into our blob storage account.
Hope this helps someone else that wants to see how to create parallel copies using Teradata as a source and pass through multiple table values.

Working with PowerShell and file based DB operations

I have a scenario where I have a lot of files in a CSV file i need to do operations on. The script needs to be able to handle if script is stopped or failed, then it should continue where i stopped from. In a database scenario this would be fairly simple. I would have an updated column and update that when operation for the line has completed. I have looked if I somehow could update the CSV on the fly, but I dont think that is possible. I could start having multiple files, but not that elegant. Can anyone recommend some kind of simple file based DB like framework? Where I from PowerShell could create a new database file (maybe json) and read from it and update on the fly.
If your problem is really so complex, that you actually need somewhat of a local database solution, then consider to go with SQLite which was built for such scenarios.
In your case, since you process an CSV row-by-row, I assume storing the info for the current row only will be enough. (Line number, status etc.)

Import data from few csv files to database in Spring Batch

What is the best way to import data from few csv files in Spring Batch? I mean one csv file responds to one table in database.
I created one batch configuration class for each table and every table has its own job and step.
Is there any solution to do this in more elegant way?
There's a variety of ways you could tackle the problem, but the simplest job would look something like:
FlatFileItemWriter reader with a DelmitedLineTokenizer and BeanWrapperFieldSetMapper to read the file
Processor if you need to do any additional validation/filtering/transformation
JDBCBatchItemWriter to insert/update the target table
Here's an example that includes more information around specific dependencies, config, etc. The example uses context file config rather than annotation-based, but it should be sufficient to show you the way.
A more complex solution might be a single job with a partitioned step that scans the input folder for files and, leveraging reference table/schema information, creates a reader/writer step for each file that it finds.
You also may want to consider what to do with the files once you're done... Delete them? Compress them?

Extract Active Directory into SQL database using VBScript

I have written a VBScript to extract data from Active Directory into a record set. I'm now wondering what the most efficient way is to transfer the data into a SQL database.
I'm torn between;
Writing it to an excel file then firing an SSIS package to import it or...
Within the VBScript, iterating through the dataset in memory and submitting 3000+ INSERT commands to the SQL database
Would the latter option result in 3000+ round trips communicating with the database and therefore be the slower of the two options?
Sending an insert row by row is always the slowest option. This is what is known as Row by Agonizing Row or RBAR. You should avoid that if possible and take advantage of set based operations.
Your other option, writing to an intermediate file is a good option, I agree with #Remou in the comments that you should probably pick CSV rather than Excel if you are going to choose this option.
I would propose a third option. You already have the design in VB contained in your VBscript. You should be able to convert this easily to a script component in SSIS. Create an SSIS package, add a DataFlow task, add a Script Component (as a datasource {example here}) to the flow, write your fields out to the output buffer, and then add a sql destination and save yourself the step of writing to an intermediate file. This is also more secure, as you don't have your AD data on disk in plaintext anywhere during the process.
You don't mention how often this will run or if you have to run it within a certain time window, so it isn't clear that performance is even an issue here. "Slow" doesn't mean anything by itself: a process that runs for 30 minutes can be perfectly acceptable if the time window is one hour.
Just write the simplest, most maintainable code you can to get the job done and go from there. If it runs in an acceptable amount of time then you're done. If it doesn't, then at least you have a clean, functioning solution that you can profile and optimize.
If you already have it in a dataset and if it's SQL Server 2008+ create a user defined table type and send the whole dataset in as an atomic unit.
And if you go the SSIS route, I have a post covering Active Directory as an SSIS Data Source