Assigning an SQL result to a Job Parameter in DataStage - datastage

I just started using Datastage (version 11.5) and I am trying to assign the value of a simple SQL query (select max(date_col) from Table) into a Job Parameter so that I can use it as a part of a file produced from the job.
Can anyone point out a simple approach to this, since I am rather lost on how to include SQL queries in parameter values.
Thanks in advance.

There are some options to do this. The one I recommend is:
write the result of you query into a sequential file
Use a Execute Command stage (in a Sequence) to read the file
use it in one of the following Job Activity stages (as job parameter)
An alternative could be the use of Parameter Sets with value files. These value files are real files in the OS and their structure is simple so these could be written by the DataStage job. In this case it cannot be used for Conditions in the Sequence.

Related

Upserting and maintaing postgres table using Apache Airflow

Working on an ETL process that requires me to pull data from one postgres table and update data to another Postgres table in a seperate environment (same columns names). Currently, I am running the python job in a windows EC2 instance, and I am using pangres upsert library to update existing rows and insert new rows.
However, my organization wants me to move the python ETL script in Managed Apache Airflow on AWS.
I have been learning DAGs and most of the tutorials and articles are about querying data from postgres table using hooks or operators.
However, I am looking to understand how to update existing table A incrementally (i.e. upsert) using new records from table B (and ignore/overwrite existing matching rows).
Any chunk of code (DAG) that explains how to perform this simple task would be extremely helpful.
In Apache Airflow, operations are done using operators. You can package any Python code into an operator, but your best bet is always to use a pre-existing open source operator if one already exists. There is an operator for Postgres (https://airflow.apache.org/docs/apache-airflow-providers-postgres/stable/operators/postgres_operator_howto_guide.html).
It will be hard to provide a complete example of what you should write for your situation, but it sounds to be like the best approach for you to take here is to take any SQL present in your Python ETL script and use it with the Postgres operator. The documentation I linked should be a good example.
They demonstrate inserting data, reading data, and even creating a table as a pre-requisite step. Just like how in a Python script, lines execute one at a time, in a DAG, operators execute in a particular order, depending on how they're wired up, like in their example:
create_pet_table >> populate_pet_table >> get_all_pets >> get_birth_date
In their example, populating the pet table won't happen until the create pet table step succeeds, etc.
Since your use case is about copying new data from one table to another, a few tips I can give you:
Use a scheduled DAG to copy the data over in batches. Airflow isn't meant to be used a streaming system for many small pieces of data.
Use the "logical date" of the DAG run (https://airflow.apache.org/docs/apache-airflow/stable/dag-run.html) in your DAG to know the interval of data that run should process. This works well for your requirement that only new data should be copied over during each run. It will also give you repeatable runs in case you need to fix code, then re-run each run (one batch a time) after pushing your fix.

Extract the SQL query that was run in the Oracle connector from a DataStage job in order to add it to a file/table

I have a parallel DataStage job that uses a particular SQL query with some parameters. Once the job is running, I can see in the Director log the exact SQL query that was triggered on the database.
My question is: is there any way I can get this SQL query with all parameters replaced in the Designer job, so I can add the code in a column from a table (a metadata column that will contain the exact query that was used for that particular run). In my job I can have a transformer that will put the query from the Oracle connector as a derivation for a column in the target table.
Thank you!
You can retrieve the query from the log using log-reading functions (in a server routine) or options (in dsjob command). Design the writing part however you want.

How do you treat a sequential file stage that cannot find the file similar to an empty table?

I have several datastage jobs that will run, but MIGHT not have the source file there. If not, I want the datastage job to complete similar to if I was using a Source DB Connector and the source table has zero rows.
how can this be done?
Thanks
The SequentialFile stage in DataStage expects a file to exists - even it it might be zero bytes in size.
One option would be to place a WaitForFile stage in front of your job to avoid the job run if no file exists. This would save efforts for loading lookup data etc. but is not 100% the behavior of an empty table. You could also touch an empty file in that case to get the behavior you want but I doubt this is a good design.

Mongodb input in pentaho

I have a time field in mysql table. based on this time field I need to import data from mongodb collection.
So in Pentaho transformations first I have a Table Input step which gets the required date.
Next I have a mongodb input step.Here how do i filter records based on the output from previous step?
I saw that in mongodb input query it accepts parameters only if its an environment variable or defined in another transformation, but does not recognize variable from previous step.
How do I load from previous step, please help me I am a fresher in Pentaho and trying for this solution since a week.
Thank you,
Deepthi
You've already answered your own question:
I saw that in mongodb input query it accepts parameters only if its an environment variable or defined in another transformation, but does not recognize variable from previous step. How do I load from previous step, please help me I am a fresher in Pentaho and trying for this solution since a week.
If there is no way for a step to accept an input stream, you'll have to do exactly what you describe. In one transformation, access the MySQL table to get the time and store it in a variable. Then in another transformation access that variable in your MongoDB step.
Note that you will have to do this in two transformations to ensure that the variable is set by the time the MongoDB step runs.
Take a look at optiq. This is bleeding edge, but allows sql access to mongodb, so in theory you could use it in a table input step rather than a mongo input step:
http://julianhyde.blogspot.co.uk/2013/06/efficient-sql-queries-on-mongodb.html
It can be achieved vai passing query as parameter.
In Transformation setting Add a parameter (eg : MONGO_QRY )
Example
In MongoDB Query expression (json)
${MONGO_QRY}
It works fine for us try that . If not lets know .

Extract Active Directory into SQL database using VBScript

I have written a VBScript to extract data from Active Directory into a record set. I'm now wondering what the most efficient way is to transfer the data into a SQL database.
I'm torn between;
Writing it to an excel file then firing an SSIS package to import it or...
Within the VBScript, iterating through the dataset in memory and submitting 3000+ INSERT commands to the SQL database
Would the latter option result in 3000+ round trips communicating with the database and therefore be the slower of the two options?
Sending an insert row by row is always the slowest option. This is what is known as Row by Agonizing Row or RBAR. You should avoid that if possible and take advantage of set based operations.
Your other option, writing to an intermediate file is a good option, I agree with #Remou in the comments that you should probably pick CSV rather than Excel if you are going to choose this option.
I would propose a third option. You already have the design in VB contained in your VBscript. You should be able to convert this easily to a script component in SSIS. Create an SSIS package, add a DataFlow task, add a Script Component (as a datasource {example here}) to the flow, write your fields out to the output buffer, and then add a sql destination and save yourself the step of writing to an intermediate file. This is also more secure, as you don't have your AD data on disk in plaintext anywhere during the process.
You don't mention how often this will run or if you have to run it within a certain time window, so it isn't clear that performance is even an issue here. "Slow" doesn't mean anything by itself: a process that runs for 30 minutes can be perfectly acceptable if the time window is one hour.
Just write the simplest, most maintainable code you can to get the job done and go from there. If it runs in an acceptable amount of time then you're done. If it doesn't, then at least you have a clean, functioning solution that you can profile and optimize.
If you already have it in a dataset and if it's SQL Server 2008+ create a user defined table type and send the whole dataset in as an atomic unit.
And if you go the SSIS route, I have a post covering Active Directory as an SSIS Data Source