Pentaho transformation swithc/sace doesnt work as suposed to be - postgresql

Pentaho , i have transformatipn step
set_useful_life have select from the table and assign the result to the variable.
them switch/case, check result, and supposed to run SQL dwh_adhoc_useful_life step only if the variable is 2
I 100% sure that variable is 1 but it still execute SQL dwh_adhoc_useful_life step. and at this moment I don't understand how to work around it?
Why am I doing this check here and not in the job, because this transformation receives variables from the previous step, and if I am doing
in the job and execute transformation step, I am losing all variables and this step doesn't execute for each available passed from the previous transformation step

I think your problem is a concept of how PDI works.
In a transformation every step is initialized at the beginning, waiting to process the rows it receives from the stream, so the SQL dwh_adhoc_useful_life step is going to be initialized at the beginning, even if it doesn't receives rows.
Usually that's not a problem because it usually expects something from the stream of rows it receives, but in your case it really doesn't need anything from the rows it receives as input, so it generates its own stream of rows and follows up on it.
One way to fix it would be to get the client_ssr_config column to work as an argument to your SQL script, so it only produces results if the correct value is passed (something in the lines of adding an AND 1=client_ssr_config to your filters in the SQL clause)

Related

AZURE DATA FACTORY - Can I set a variable from within a CopyData task or by using the output?

I have simple pipeline that has a Copy activity to populate a table. That task is based on a query and will only ever return 1 row.
The problem I am having is that I want to reuse the value from one of the columns (batch number) to set a variable so that at the end of the pipeline I can use a Stored Procedure to log that the batch was processed. I would rather avoid running the query a second time in a lookup task so can I make use of the data already being returned?
I have tried duplicating the column in the Copy activity and then mapping that to something like #BatchNo but that fails and have even tried to add a Set Variable task but can't figure out how to take a single column #{activity('Populate Aleprstw').output} does not error but not sure what that will actually do in this case.
Thanks and sorry if its a silly question.
Cheers
Mark
I always do it like this:
Generate a batch number (usually with a proc)
Use a lookup to grab it into a variable
Use the batch number in all activities (might be multiple copes, procs etc.)
Write the batch completion
From your description it seems you have the batch embedded in the data copy from the start which is not typical.
If you must do it this way, is there really an issue with running a lookup again?
Copy activity doesn't return data like that, so you won't be able to capture the results that way. With this design, running the query again in a Lookup is the best option.
Is the query in the Source running on the same Server as the Sink? If so, you could collapse the entire operation into a Stored Procedure that returns the data point you are trying to capture.

Cloud Dataflow: Once trigger not working

I have a Dataflow pipeline reading from unbounded source. My window size is 10 hours, I am trying to test my trigger using a TestStream. My trigger will emit early result if element count reaches at least 2 for the same key within a Window. I have following trigger to achieve this:
input.apply(Window.into(FixedWindows.of(Duration.standardHours(12))) .triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterPane.elementCountAtLeast(2)))
.apply(Count.perElement())
We also tried:
Repeatedly.forever(AfterPane.elementCountAtLeast(2)).orFinally(AfterWatermark.pastEndOfWindow())
I expect early firing when asserting the result, however I don't get all the result in
PAssert.that(pipeline).inWindow(..)..
What am I doing wrong? Also running same test repeatedly yields different result meaning different values are returned from the trigger.
Triggering is non-deterministic. It will give you an early firing some time after the trigger condition is satisfied. It will then give you another early firing some time after the trigger condition is satisfied again.
The actual choice to emit after the trigger is determined by the runner. If you are using a batch runner, it may wait until all the data is available. How much input are you expecting for each key/window? Which runner are you using?

How to pass output from a Datastage Parallel job to input as another job?

My requirement is
Parallel Job1 --I extract data from a table, when row count is more than 0
Parallel job 2 should be triggered in the sequencer only when the row count from source query in Job1 is greater than 0
I want to achieve this without creating any intermediate file in job1.
So basically what you want to do is using information from a data stream (of your Job1) and use it in the "above" sequence as a parameter.
In your case you want to decide on sequence level to run subsequent jobs (if more than 0 rows get returned) or not.
Two options for that:
Job1 writes information to a file which is a value file of a parameterset. These files are stored in a fixed directory. The parameter of the value file could then be used in your sequence to decide your further processing. Details for parameter sets can be found here.
You could use a server job for Job1 and set a user status (basic function DSSetUserStatus) in a transfomer. This is also passed back to the sequence and could be referenced in subsequent stages of the sequence. See the documentation but you will find many other information on the internet as well regarding this topic.
There are more solution to this problem - or let us call it challenge. Other ways may be a script called at sequence level which queries the database and will avoid Job1...

How to increment a number from a csv and write over it

I'm wondering how to increment a number "extracted" from a field in a csv, and then rewrite the file with the number incremented.
I need this counter in a tMap.
Is the design below a good way to do it ?
EDIT: im trying a new method. see the design of my subjob below, but i have an error when i link the tjavarow to my main tmap in the main job
Exception in component tMap_1
java.lang.NullPointerException
at mod_file_02.file_02_0_1.FILE_02.tFileList_1Process(FILE_02.java:9157)
at mod_file_02.file_02_0_1.FILE_02.tRowGenerator_5Process(FILE_02.java:8226)
at mod_file_02.file_02_0_1.FILE_02.tFileInputDelimited_2Process(FILE_02.java:7340)
at mod_file_02.file_02_0_1.FILE_02.runJobInTOS(FILE_02.java:12170)
at mod_file_02.file_02_0_1.FILE_02.main(FILE_02.java:11954)
2014-08-07 12:43:35|bm9aSI|bm9aSI|bm9aSI|MOD_FILE_02|FILE_02|Default|6|Java
Exception|tMap_1|java.lang.NullPointerException:null|1
[statistics] disconnected
enter image description here
You should be able to do this mid flow in a tMap or a tJavaRow.
Simply read the number in as an integer (or other numeric data type) and then add your increment to it.
A really simple example might look like this:
Here we have a tFixedFlowInput that has some hard coded values for the job:
And we run it through a tMap where we add 1 to the age column:
And finally, we output it to the console in a table:
EDIT:
As Gabriele B has pointed out, this doesn't exactly work when reading and writing to the same flat file as Talend claims an exclusive read-write lock on the file when reading and keeps it open throughout the job.
Instead you would have to write the incremented data to some other place such as a temporary file, a database or even just to the buffer and then read that data in to a separate job which would then output the file you want and clean up anything temporary.
The problem with that is you can't do the output in the same process. I've just tried testing reading in the file in one child job, passing the data back to a parent job using a tBufferOutput and then passing that data to another child job as a context variable and then trying to output to the file. Unfortunately the file lock remains on it so you can't do this all in one self contain job (even using a parent job and several child jobs).
If this sounds horrible to you (it is) and you absolutely need this to happen (I'd suggest a database table sounds like a better match for this functionality than a flat file) then you could raise a feature request on the Talend Jira for the tFileInputDelimited to not hold the file open or to not insist on an exclusive read-write lock on the file.
Once again, I strongly recommend that you move to using a database table for this because even without the file lock issue, this is definitely not the right use of a flat file and this use case perfectly fits a database, even something as lightweight as an embedded H2 database.

Combining multiple TSQL change scripts with skipping

I have a bunch of TSQL change scripts, all named appropriately in sequence.
I want to combine these into one big script with a few twists. I include a version number function in the script that I update for each script so that once change 1 is run it returns 1, once 2 is run it returns 2 and so on. This function remains in the database and always returns the version of the schema/database.
I want to wrap each change script with a few lines that prevent a script that has already been run from running again, likewise it should prevent a change script from running on a schema version that is too "low". This allows me to bunch all the change scripts into one, and only missing changescripts will be applied when it all runs.
This is all fine, but I cant find a way to make osql / Query Analyzer / Sql Server Studio skip the parts that have alreay been run.
GOTO won't work across batches (Scripts contain "GO")
IF BEGIN END won't work likewise because of GO
Update: To reiterate, I don't need help remembering the current version number, I need a way to skip parts of the script to prevent already applied updates from reapplying.
I have tried a number of methods:
I can wrap the batch in an db_executeSql statement or EXECUTE but this leads to scoping problems.
I can wrap each batch in a IF dbo.DB_VERSION()!=REQUIRED_VERSION THEN BEGIN .... END construct, but this is messy and makes handling errors difficult.
Encountering situations where the change should not be applied is expected and is not an exceptional situation. So simply RETURNing when not applicable is not Ok.
Any other suggestions?
You can keep the version number if a table, and use the function to check the value in the table and compare in an IF statement to the version of the change you wish to make.