How can I merge two streams with identical columns in Pentaho? - merge

I am new user in Pentaho and maybe my question is very simple. I have two streams with identical columns, e.g. stream S1 has the columns: A, B, C, and stream S2 has columns: A, B, C (same name, same order, same data type). I want to merge or append these two streams into a single stream containing the columns A, B, C. However, when I use merge join (with the option FUL OUTER JOIN) my result is a stream with the columns: A, B, C, A_1, B_1, C_1. It is not what I want. I tried to use the append stream step, but in this case appeared nothing in the preview.

As per your requirement first create two stream.
Here we have taken two streams i.e. "stream1.xls" and "stream2.xls".
Then built the transformation using the "Sorted merge" join
For better understanding please refer the screenshots.

Related

Using merge in sas

I am a little bit confused about merging in SAS. For example, when people are using the merge statement, sometimes (in=a) or (in=b) is followed. What does that do exactly?
To elaborate more on vknowles answer, in=a and in=b are useful in different types of merges. Let's say we have the following data step:
data inner left right outer;
merge have1(in=a) have2(in=b);
by ...;
if a and b then output inner;
else if a and not b then output left;
else if not a and b then output right;
else if not (a or b) then output miss;
run;
The data step will create 4 different datasets which are the basis of an inner join, left join, and right join.
The statement if a and b then output inner; will output only records in which the key is found in the datasets have1 and have2 which is the equivalent of a SQL inner join.
else if a and not b then output left; will output only the records that occur in the have1 dataset and not the have2. This is the equivalent of a left outer join in SQL. If you wanted a full left join you could either append the left dataset to the inner dataset or just change the statement to if (a and b) or (a and not b) then output left.
The third else if is just the opposite of the previous. Here you can perform a right join on the data.
The last else if will output to the outer dataset which is the equivalent of an outer join. This is useful for debugging purposes as the key is unique to each dataset. Thanks to Robert for this addition.
When you see a dataset referenced in SAS with () following it the items inside are called dataset options. The IN= dataset option is valid when reading datasets using statements like SET, MERGE, UPDATE. The word following IN= names a variable that SAS will create that will be true (1) when that dataset contributes to the current observation and false (0) otherwise.
A good example would be if want to use one data set to subset another. For example if you wanted to merge on data from a master lookup table to add an extra variable like an address or account type , but did not what to add in every id from the lookup table.
data want;
merge my_data(in=in1) master_lookup (in=in2);
by id;
if in1 ;
run;
Or if you are stacking or interleaving data from more than one table and wanted to take action depending on which table this record is from.
data want;
set one(in=in1) two(in=in2);
by id;
if in1 then source='ONE';
if in2 then source='TWO';
run;
Let's say you have MERGE A (in=INA) B (in=INB);
When merging two datasets with a BY statement, you can have the following situations:
Dataset A has an observation with a given by-value and dataset B does not. In this case, INA will be true and INB will be false.
Dataset A and B both have an observation with a given by-value. In this case, INA and INB will be true.
[Edited to correct error] Dataset A and B have different numbers of observations with a given by-value. Let's say A has more than B. Even after B runs out of observations, both INA and INB will be true. If you want to know whether B still has observations, you need something like the following code. As #Tom pointed out, you often want to know which dataset is contributing observations.
data want;
ina=0;
inb=0;
merge a (in=ina) b (in=inb);
by mybyvariable;
...
run;
The above code takes advantage of the fact that SAS retains the variables from the last observation contributed to the by-group by the dataset with the smaller number of observations. If you reinitialize any of the variables before the MERGE statement, and there is no new observation, they will keep their reinitialized values; but if there is a new observation, SAS will reset them. I hope that makes sense.
Dataset A does not have an observation with a given by-value and B does. In this case, INA will be false and INB will be true.
Note that this does not cover the situation of merging without a BY statement.

Pair Rx Sequences with one sequence as the master who controls when a new output is published

I'd like to pair two sequences D and A with Reactive Extensions in .NET. The resulting sequence R should pair D and A in a way that whenever new data appears on D, it is paired with the latest value from A as visualized in the following diagram:
D-1--2---3---4---
A---a------b-----
R----2---3---4---
a a b
CombineLatest or Zip does not exactly what I want. Any ideas on how this can be achieved?
Thanks!
You want Observable.MostRecent:
var R = A.Publish(_A => D.SkipUntil(_A).Zip(_A.MostRecent(default(char)), Tuple.Create));
Replace char with whatever the element type of your A observable.
Conceptually, the query above is the same as the following query.
var R = D.SkipUntil(A).Zip(A.MostRecent(default(char)), Tuple.Create));
The problem with this query is that subscribing to R subscribes to A twice. This is undesirable behavior. In the first (better) query above, Publish is used to avoid subscribing to A twice. It takes a mock of A, called _A, that you can subscribe to many times in the lambda passed to Publish, while only subscribing to the real observable A once.

How to merge two streams (without nulls) and apply conditions on pairs?

Consider I have two streams of data, is there a way to merge them and apply conditions on data between these two streams? For example
Stream A : A, B, C, D....
Stream B : -, A, -, -....
Composed : (A,-),(B,A),(C,-),(D,-)....
How to get composed stream above using rxjs? I would like to apply conditions on composed streams to raise some notifications. Also would it be possible to use last known non-null data for example see the composed stream below.
Stream A : A, B, C, D....
Stream B : 1, null, 2, null....
Composed : (A,1),(B,1),(C,2),(D,2)....
I've just started playing with reactive streams idea, so please correct me if I've misunderstood the idea of reactive streams.
There are two operators that can serve for your propose.
Zip:
Reference for RxJs: https://github.com/Reactive-Extensions/RxJS/blob/master/doc/api/core/operators/zip.md
CombineLatest:
Reference for RxJs: https://github.com/Reactive-Extensions/RxJS/blob/master/doc/api/core/operators/combinelatest.md
The images explain the differences between both. Now you have merged the observable's you just need to filter, using where, that will filter if one of the values is null.
Unfortunately neither operators can get this behavior that you describe:
Stream A : A, B, C, D, E....
Stream B : 1, null, 2, null, 3....
Composed : (A,1),(B,1),(C,2),(D,2)....
If you use Zip and Where (filtering null values after), the result will be:
Composed: (A,1),(C,2),(E,3)
If you use Where (filtering null values previously) and Zip, the result will be:
Composed: (A,1),(B,2),(C,3)
If you use CombineLatest will depend of the order that the events happens in the Streams, and of course, where you will put the where operator, the result can be different that what you shown, e.g.:
Stream A : A, B, C, D....
Stream B : 1, null, 2, null....
Composed : (A,1),(B,1),(C,1),(C,2),(D,2).... // OR
Composed : (A,1),(B,1),(B,2),(C,2),(D,2)....
Unless you have more specific requirements, I think one of the options that I mentioned is what you are looking for, feel free to add information.
There are several ways to compose observable's, other operators not mentioned are:
distinctUntilChanged, could be added in the final of the composition, using the key selector function to limit for just part of zip or latest value.
switch, used to combine one observable inside another.

Use Spring Batch to write in different Data Sources

For a project I need to process items from one table and generate 3 different items for 3 different tables, all 3 in a second data source different from the one of the first item. The implementation is done with Spring Batch over Oracle DB. I think this question has something similar to what I need, but in there it is writing at the end only one different item.
To ilustrate the situation:
DataSource 1 DataSource 2
------------ ------------------------------
Table A Table B Table C Table D
The reader should read one item from table A. In the processor, using the information from the item in A, 3 new items will be created of type B, C and D. In addition, the item from table A will be updated.
The writer should be able to write at the same time all 4 items. My first implementation is using a JpaItemWriter to update the item A, but I don't know how the processor could give the other 3 items to the writer in order to save all at the same time.
Can a processor return several items from different types? Would I need to create 4 steps, each one writing one of the items? And in this case, would that be error safe (If there is an error writing D, then A, B, and C would be rollback)?
Thanks in advance for your support!
Your question is really two questions. Let's look at each individually:
Can an ItemProcessor return multiple items
An ItemProcessor can only return one item at a time for each item that is passed in. Because of this, in your specific scenario, you'll need your ItemProcessor to return a wrapper object that wraps items A, B, C, and D.
How can I write different types in the same step
Spring Batch relies heavily on composition in it's programming model. Since your ItemProcessor will be returning a wrapper object, you'll end up writing an ItemWriter that unwraps items A, B, C, and D and delegates the writing of each to the apropriate writer. So in the final solution, you'll end up with 5 ItemWriters: one for each item type and one that wraps all of those. Take a look at our CompositeItemWriter as an example here: https://github.com/spring-projects/spring-batch/blob/master/spring-batch-infrastructure/src/main/java/org/springframework/batch/item/support/CompositeItemWriter.java

Postgres: n:m intermediate table with type

I have a table called "Tag" which consists of an Id, Name and Description column.
Now lets say I have the tables Character (C), Movie (M), Series (S) etc..
And I want to be able to tag entries in C, M, S with multiple tags and one tag may be used for multiple entries.
So I could realize it like this:
T -> TC <- C
T -> TM <- M
T -> TS <- S
Where TC, TM, TS are the intermediate tables.
I was wondering if I could combine TC, TM, TS into one table with a type column added and still use foreign keys.
As of yet I haven't found a way to do it.
Or is this something I shouldn't be doing?
As the comments above suggested you can't combine multiple table into a single one. If you want to have a single view of the "tag relationships" you can pull the needed information into a View. This way, you only need to write a longer query once and are able to use like a single table. Keep in mind that you can't insert data into a view (there are possibilities to do so, but they are a little advanced)