Extract the SQL query that was run in the Oracle connector from a DataStage job in order to add it to a file/table - datastage

I have a parallel DataStage job that uses a particular SQL query with some parameters. Once the job is running, I can see in the Director log the exact SQL query that was triggered on the database.
My question is: is there any way I can get this SQL query with all parameters replaced in the Designer job, so I can add the code in a column from a table (a metadata column that will contain the exact query that was used for that particular run). In my job I can have a transformer that will put the query from the Oracle connector as a derivation for a column in the target table.
Thank you!

You can retrieve the query from the log using log-reading functions (in a server routine) or options (in dsjob command). Design the writing part however you want.

Related

Upserting and maintaing postgres table using Apache Airflow

Working on an ETL process that requires me to pull data from one postgres table and update data to another Postgres table in a seperate environment (same columns names). Currently, I am running the python job in a windows EC2 instance, and I am using pangres upsert library to update existing rows and insert new rows.
However, my organization wants me to move the python ETL script in Managed Apache Airflow on AWS.
I have been learning DAGs and most of the tutorials and articles are about querying data from postgres table using hooks or operators.
However, I am looking to understand how to update existing table A incrementally (i.e. upsert) using new records from table B (and ignore/overwrite existing matching rows).
Any chunk of code (DAG) that explains how to perform this simple task would be extremely helpful.
In Apache Airflow, operations are done using operators. You can package any Python code into an operator, but your best bet is always to use a pre-existing open source operator if one already exists. There is an operator for Postgres (https://airflow.apache.org/docs/apache-airflow-providers-postgres/stable/operators/postgres_operator_howto_guide.html).
It will be hard to provide a complete example of what you should write for your situation, but it sounds to be like the best approach for you to take here is to take any SQL present in your Python ETL script and use it with the Postgres operator. The documentation I linked should be a good example.
They demonstrate inserting data, reading data, and even creating a table as a pre-requisite step. Just like how in a Python script, lines execute one at a time, in a DAG, operators execute in a particular order, depending on how they're wired up, like in their example:
create_pet_table >> populate_pet_table >> get_all_pets >> get_birth_date
In their example, populating the pet table won't happen until the create pet table step succeeds, etc.
Since your use case is about copying new data from one table to another, a few tips I can give you:
Use a scheduled DAG to copy the data over in batches. Airflow isn't meant to be used a streaming system for many small pieces of data.
Use the "logical date" of the DAG run (https://airflow.apache.org/docs/apache-airflow/stable/dag-run.html) in your DAG to know the interval of data that run should process. This works well for your requirement that only new data should be copied over during each run. It will also give you repeatable runs in case you need to fix code, then re-run each run (one batch a time) after pushing your fix.

Need to join oracle and sql server tables in oledb source without using linked server

My ssis package has an oledb source which joins oracle and sql server to get source data and loads it into sql server oledb destination. Earlier we were using linked server for this purpose but we cannot use linked server anymore.
So I am taking the data from sql server and want to return it to the in clause of the oracle query which i am keeping as sql command oledb source.
I tried parsing an object type variable from sql server and putting it into the in clause of oracle query in oledb source but i get error that oracle cannot have more than 1000 literals in the in statement. So basically I think I have to do something like this:
select * from oracle.db where id in (select id from sqlserver.db).
Since I cannot use linked server so i was thinking if I could have a temp table which can be used throughout the package.
I tried out another way of using merge join in ssis. but my source data set is really large and the merge join is returning fewer rows than expecetd. I am badly stuck at this point. I have tried a number if things nothung seems to be working.
Can someone please help. Any help will be greatly appreciated.
A couple of options to try.
Lookup:
My first instinct was a Lookup Task, but that might not be a great solution depending on the size of your data sets, since all of the records from both tables have to pulled over the wire and stored in memory on the SSIS server. But if you were able to pull off a Merge Join, then a Lookup should also work, but it might be slow.
Set an OLE DB Source to pull the Oracle data, without the WHERE clause.
Set a Lookup to pull the id column from your SQL Server table.
On the General tab of the Lookup, under Specify how to handle rows with no matching entries, select Redirect rows to no-match output.
The output of the Lookup will just be the Oracle rows that found a matching row in your SQL Server query.
Working Table on the Oracle server
If you have the option of creating a table in the Oracle database, you could create a Data Flow Task to pipe the results of your SQL Server query into a working table on the Oracle box. Then, in a subsequent Data Flow, just construct your Oracle query to use that working table as a filter.
Probably follow that up with an Execute SQL Task to truncate that working table.
Although this requires write access to Oracle, it has the advantage of off-loading the heavy lifting of the query to the database machine, and only pulling the rows you care about over the wire.

Assigning an SQL result to a Job Parameter in DataStage

I just started using Datastage (version 11.5) and I am trying to assign the value of a simple SQL query (select max(date_col) from Table) into a Job Parameter so that I can use it as a part of a file produced from the job.
Can anyone point out a simple approach to this, since I am rather lost on how to include SQL queries in parameter values.
Thanks in advance.
There are some options to do this. The one I recommend is:
write the result of you query into a sequential file
Use a Execute Command stage (in a Sequence) to read the file
use it in one of the following Job Activity stages (as job parameter)
An alternative could be the use of Parameter Sets with value files. These value files are real files in the OS and their structure is simple so these could be written by the DataStage job. In this case it cannot be used for Conditions in the Sequence.

How to use Apache Apex to ingest data in batch from DB2 to Vertica

Use Case: Ingest transaction data (e.g. rows = 10,000) in a single batch from DB2 and insert them to a Vertica database.
Question:
Should I get a single row from database or batch of 10k rows, process and then insert into destination database?
Is there any sample code which reads from one database and writes into another database?
You should always prefer batch execution , you will minimized your network roundtrip and improved your load to Vertica .
You can use the JDBC input and output operators to fetch from origin database and destination database. They should have configurable batch sizes. In general batching is faster than tuple by tuple.
Check https://github.com/apache/incubator-apex-malhar/tree/master/library/src/main/java/com/datatorrent/lib/db/jdbc
You can add multiple XML configuration files at src/site/conf in your project and select one of them at launch time.
This is described briefly at http://docs.datatorrent.com/application_packages/ under the section entitled "Adding pre-set configurations"

Advance Job Sheduler for ISeries

I use INavigor system for ad-hoc data extraction from the DB2 database. Only issue is that when it comes to automation. Is there a way I could automate the SQL code to be run on a specific time? I know there is Advance Job Sheduler but I'm not sure how the SQL can be added to the Sheduler. Any one who can help?
IBM added a Run SQL Statements (RUNSQL) CL command at v7.1.
Prior to that, you could store SQL statements in source files and run them with the Run SQL Statements (RUNSQLSTM) command.
Neither of the above allow an SQL Select to be run by itself. For data extraction, you'd want INSERT INTO tbl (SELECT <...> FROM <...>)
For reporting SELECTs, your best bet is to create a Query Manager query (*QMQRY object) and form (*QMFORM object) via Start DB2 UDB Query Manager (STRQM); which can then be run by the Start Query Management Query (STRQMQRY) command. Query Manager (QM) is SQL based, unlike the older Query/400 product. QM manual is here
One last option, is the db2 utility available via QShell.
Don't waste effort creating a day late going out of business because the job scheduler hasn't updated the file system.
Real businesses need real time data.
Just make a SQL view on the iseries that pulls the info you need together.
Query the view externally in real time. Even if you need last 30 days or last month or year to date. These are all simple views to create.