compare current value with previous value in datastage

compare current value with previous value in datastage - datastage

i have input like below
empid salary
10 1000
20 2000
30 3000
40 4000
the output i require in a sequential fie is like below. that is prevsal should have the salary of the previous row
empid salary prevsal
10 1000 null
20 2000 1000
30 3000 2000
40 4000 3000
i tried using a transformer by giving stage variable as prevsal=inputlink.salary and then defining a output column inputlink.salary=prevsal. i know that doesnt work logically and yes it didnt work. can anyone find me a solution for this.

You are on the right way - transformer and stage variables is the way to go.
Remember that within the transformer the data is processed top down. This means the first (top most) stage variable is processed first, then the second and so on and finally the data is put on the output links.
Having you input column: inputlink.salary
Assuming two stagevariables: svPrevSalary (top most)
and a second one svCurrentSalary
Try following assingments in the stage variable section:
1. svCurrentSalary (=) svPrevSalary
2. inputlink.salary (=) svCurrentSalary
Use
svPrevSalary
as derivation of the output link / field.
Please note that the (=) are just the idea you have to specify only svCurrentSalary for the first stage variable.

I was facing the same problem when I started to do.i was not getting the expected result.
For this question, we have to note down 2 things.
1. for which jobs I m doing.like for the server, sequential or parallel.I am working in the parallel environment.
2. pls remember the order of execution of link i.e i/o order.
code- curr_salary -> Prev_Sal,
link.salary -> curr_Salary
link to prev_salary output link to prev_salary
Note
If you working in the Parallel environment then you have to make a sequential mode in execution section in every stage.
Go to Stage-> advanced -> Excution mode-> sequential.
I think it should work. I did this practically.Transformer_Image
thanks

Related

SPSS aggergate on 2 variables

I am trying to compute a N_break that has to "satisfy" a condition. I have a variable which indicates 1 or 0. Lets call that variable "HT". Every lopnr is also labled in every row multiple times. So first 10 rows can be ID nr 1. And next 20 can be ID nr 2 and so on.
My question is: How do i create a N-break with lopnr as breakvariable that has to have HT=1? I am not allowed to select only 1s on variable HT before, since i need the 0s in the file.

A few simple ways to do this:
1 - USE FILTER
filter cases by HT.
aggregate ....
when you get back to original dataset, use:
filter off.
use all.
2 - COPY DATASET
dataset name orig.
dataset copy foragg.
dataset activate foragg.
select if HT.
aggregate....
3 - TEMPORARY SELECTION
temporary.
select if HT.
aggregate....

Is is possible limit the number of rows in the output of a Dataprep flow?

I'm using Dataprep on GCP to wrangle a large file with a billion rows. I would like to limit the number of rows in the output of the flow, as I am prototyping a Machine Learning model.
Let's say I would like to keep one million rows out of the original billion. Is this possible to do this with Dataprep? I have reviewed the documentation of sampling, but that only applies to the input of the Transformer tool and not the outcome of the process.

You can do this, but it does take a bit of extra work in your Recipe--set up a formula in a new column using something like RANDBETWEEN to give you a random integer output between 1 and 1,000 (in this million-to-billion case). From there, you can filter rows based on whatever random integer between 1 and 1,000 as what you'll keep, and then your output will only have your randomized subset. Just have your last part of the recipe remove this temporary column.

So indeed there are 2 approaches to this.
As Courtney Grimes said, you can use one of the 2 functions that create random-number out of a range.
randbetween :
rand :
These methods can be used to slice an "even" portion of your data. As suggested, a randbetween(1,1000) , then pick 1<x<1000 to filter, because it's 1\1000 of data (million out of a billion).
Alternatively, if you just want to have million records in your output, but either
Don't want to rely on the knowledge of the size of the entire table
just want the first million rows, agnostic to how many rows there are -
You can just use 2 of these 3 row filtering methods: (top rows\ range)
P.S
By understanding the $sourcerownumber metadata parameter (can read in-product documentation), you can filter\keep a portion of the data (as per the first scenario) in 1 step (AKA without creating an additional column.
BTW, an easy way of "discovery" of how-to's in Trifacta would be to just type what you're looking for in the "search-transtormation" pane (accessed via ctrl-k). By searching "filter", you'll get most of the relevant options for your problem.
Cheers!

Zabbix trigger of drop user

My question : is it possible to have a trigger on that item that will be activated if there's a difference of xx% between the two last queries ?
Example :
Query at 01:00 -> 2000 users connected
Query at 01:10 -> 2100 users, difference is positive, we don't care
Query at 01:20 -> 2050 users, -50 users, around 2-3%, no big deal
Query at 01:30 -> 800 users, around 60% less connections, there's something wrong here
Is it possible to have a trigger that activates when the difference is, let's say, 20% negative ?

You can use the abschange function:
The amount of absolute difference between last and previous values
to alert for both positive and negative changes.
Or you can use the last function to get the latest values you need:
For example:
last() is always equal to last(#1)
last(#3) - third most recent value (not three latest values)
In both cases you need to compute the % value in your trigger with the usual proportion:
older_value:100 = newer_value:x

Execute pentaho job completely for first row before starting job for second row

My root job which has two steps,
Transformation Executor(to copy rows to results) & a Job Executor(Executing for each input row)
what I want is, that my sub-job should execute completely for first incoming row before it start execution for second row.

Click on the Job executor step and check the box Execute for every input row.
Tell me if it is not what you need.

Unless you specify a different value than 1 on Change Number Of Copies To Start (Right click on any Transformation Entry to see that option), that will always be the expected behavior.
If the number is greater than 1 then the Job Executor will have that number of copies running in parallel, distributing the input rows (for example, 100 input rows, with 10 copies, each copy will execute 10 rows no matter what).

How to get all missing days between two dates

I will try to explain the problem on an abstract level first:
I have X amount of data as input, which is always going to have a field DATE. Before, the dates that came as input (after some process) where put in a table as output. Now, I am asked to put both the input dates and any date between the minimun date received and one year from that moment. If there was originally no input for some day between this two dates, all fields must come with 0, or equivalent.
Example. I have two inputs. One with '18/03/2017' and other with '18/03/2018'. I now need to create output data for all the missing dates between '18/03/2017' and '18/04/2017'. So, output '19/03/2017' with every field to 0, and the same for the 20th and 21st and so on.
I know to do this programmatically, but on powercenter I do not. I've been told to do the following (which I have done, but I would like to know of a better method):
Get the minimun date, day0. Then, with an aggregator, create 365 fields, each has that "day0"+1, day0+2, and so on, to create an artificial year.
After that we do several transformations like sorting the dates, union between them, to get the data ready for a joiner. The idea of the joiner is to do an Full Outer Join between the original data, and the data that is going to have all fields to 0 and that we got from the previous aggregator.
Then a router picks with one of its groups the data that had actual dates (and fields without nulls) and other group where all fields are null, and then said fields are given a 0 to finally be written to a table.
I am wondering how can this be achieved by, for starters, removing the need to add 365 days to a date. If I were to do this same process for 10 years intead of one, the task gets ridicolous really quick.
I was wondering about an XOR type of operation, or some other function that would cut the number of steps that need to be done for what I (maybe wrongly) feel is a simple task. Currently I now need 5 steps just to know which dates are missing between two dates, a minimun and one year from that point.
I have tried to be as clear as posible but if I failed at any point please let me know!

Im not sure what the aggregator is supposed to do?
The same with the 'full outer' join? A normal join on a constant port is fine :) c
Can you calculate the needed number of 'dublicates' before the 'joiner'? In that case a lookup configured to return 'all rows' and a less-than-or-equal predicate can help make the mapping much more readable.
In any case You will need a helper table (or file) with a sequence of numbers between 1 and the number of potential dublicates (or more)
I use our time-dimension in the warehouse, which have one row per day from 1753-01-01 and 200000 next days, and a primary integer column with values from 1 and up ...

You've identified you know how to do this programmatically and to be fair this problem is more suited to that sort of solution... but that doesn't exclude powercenter by any means, just feed the 2 dates into a java transformation, apply some code to produce all dates between them and for a record to be output for each. Java transformation is ideal for record generation

You've identified you know how to do this programmatically and to be fair this problem is more suited to that sort of solution... but that doesn't exclude powercenter by any means, just feed the 2 dates into a java transformation, apply some code to produce all dates between them and for a record to be output for each. Java transformation is ideal for record generation
Ok... so you could override your source qualifier to achieve this in the selection query itself (am giving Oracle based example as its what I'm used to and I'm assuming your data in is from a table). I looked up the connect syntax here
SQL to generate a list of numbers from 1 to 100
SELECT (MIN(tablea.DATEFIELD) + levquery.n - 1) AS Port1 FROM tablea, (SELECT LEVEL n FROM DUAL CONNECT BY LEVEL <= 365) as levquery
(Check if the query works for you - haven't access to pc to test it at the minute)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse