Talend sequence Generation with Data - talend

I Have generated my own sequence based on the data. I need to compare the current sequence with the previous sequence generated from the data.
If both the sequences are same I should not increment the value. If the sequence are different I need to increment the sequence by using Numeric.sequence system routine. How to do that?
Example :
Generated sequence --1234567890 --1
Next sequence --1234567890 --1
If Both has the sequence number generated the value should remain the same.

Store the previous sequence in a variable in order for you to be able to compare, instead of comparing now == next, in talend you need to do now == previous for you to be able to compare both.
For this a tJavaRow should be enough, you can store the previous sequence on a global variable, and compare it in the next iteration

Have a lookup on the target filter the sequence if same
tmap
SOURCE (row1) -> filter
(if(row1.sequence !=row2.sequence))
>insert out
^
|
^
Target (Lookup row2)

Related

Azure Data Factory DataFlow md5 specific columns

First of all, I have my array of columns parameter called $array_merge_keys
$array_merge_keys = ['Column1', 'Column2', 'NoColumnInSomeCases']
So then I am going to hash them, if the third NoColumnInSomeCases is not existed, I would like to treat it as null or some strings else there value.
But actually, when I use them with byNames(), it would return NULL because the last is not existed, even though first and second still have values. So I would expect byNames($array_merge_keys) always return value in order to hash them.
Since that problem cannot be solved, I am back to filter these only existed column
filter(columnNames('', true()), contains(['Column1', 'Column2', 'NoColumnInSomeCases'], #item_1 == #item)) => ['Column1', 'Column2']
But it comes to another problem that byNames() cannot compute on the fly, it said 'byNames' does not accept column or argument
array(byNames(filter(columnNames('', true()), contains(['Column1', 'Column2', 'NoColumnInSomeCases'], #item_1 == #item))))
Spark job failed: { "text/plain":
"{"runId":"649f28bf-35af-4472-a170-1b6ece50c551","sessionId":"a26089f4-b0f4-4d24-8b79-d2a91a9c52af","status":"Failed","payload":{"statusCode":400,"shortMessage":"DF-EXPR-030
at Derive 'CreateTypeFromFile'(Line 35/Col 36): Column name function
'byNames' does not accept column or argument
parameters","detailedMessage":"Failure 2022-04-13 05:26:31.317
failed DebugManager.processJob,
run=649f28bf-35af-4472-a170-1b6ece50c551, errorMessage=DF-EXPR-030 at
Derive 'CreateTypeFromFile'(Line 35/Col 36): Column name function
'byNames' does not accept column or argument parameters"}}\n" } -
RunId: 649f28bf-35af-4472-a170-1b6ece50c551
I have tried lots of ways, even created a new derived column (before that stream name) to store ['Column1', 'Column2']. But it said that column cannot be referenced within byNames() function
Do we have any elegant solution?
It is true that byName() cannot evaluate with late binding. You need to either use a Select transformation to set the columns in the stream you wish to hash first or send in the column names via a parameter. Since that is "early column binding", byName() will work.
You can use a get metadata activity in the pipeline to inspect which columns are present in your source before calling the data flow, allowing you to send a pipeline parameter with just those columns you wish to hash.
Alternatively, you can create a new branch, use a select matching rule, then hash the row based on those columns (see example below).

Azure Data Flow ( Data Flow) - First row field value as custom field to remaining rows

I am creating DataFlow in ADF. my requirement is to read first row one field value and make it as session id for rest of the rows. I looked into the expressions but didn't find much functions that will help on this.
ex: Source file in blob :---------------
time ,phone
2020-01-31 10:00:00,1234567890
2020-01-31 10:10:00,9876543219
Target should be :-----------------
SessionID , time, Phone
20200131100000,2020-01-31 10:00:00,1234567890
20200131100000,2020-01-31 10:10:00,9876543219
SessionIID is a derived column. i need to read first row of time and remember that time and apply to all rows for sessionID.
How to read first row time value and keep it in global variable ?
any inputs are appreciated.
You can use Lookup activity in pipeline(check First row only option) and pass time value to Data Flow parameter. Then use Derived Column transform in Data Flow to add SessionID column.
Details:
check First row only option in Lookup activity
use this expression to get your expected value:
#replace(replace(replace(activity('Lookup1').output.firstRow.time,'-',''),' ',''),':','')
3.pass value of this variable to parameter in Data Flow.
4.add Session column in Data Flow.

Sequence is giving already given numbers

I'm using a sequence datasource_id_seq created by hand to create unique table names by concatenating a string with numbers returned by the sequence via select nextval('datasource_id_seq').
The code to create it is on my very first migration:
create sequence datasource_id_seq;
And there's nothing like that again in the whole code base.
I recently stump into a bug that ended being the sequence was giving numbers already given. It was returning values in the 6xx (six hundreds) while we already have tables with names over 3xxx (three thousands).
Reading the docs (https://www.postgresql.org/docs/12/functions-sequence.html) the only thing I could catch that points to a reset in the sequence is:
If a sequence object has been created with default parameters, successive nextval calls will return successive values beginning with 1
So, the only ways to reset a sequence is to recreate it or to use setval(), none of which are in the code base.
My question is: how can happen a sequence resets? What other means are to reset a sequence
With properly working, bug-free PostgreSQL server it is not possible to allocate the same sequence number two or more times. It's not even "returned" in case of a rollback. The only method to manipulate next value is using the functions you've already mentioned.
Look for the problem during deployment of your app. I'd speculate that the sequence got dropped and recreated.

Autoincrement using Sequences is not working as expected

I am currently working on a job something like this
The design is to,extract some data from customers,(say first name,last name) to one excel file,other data (say address) is to goto other excel file,i added a identity to tMap Numeric("s1",1,1) but it is starting from 1,3,5,7,9,11,13.... and on other excel it getting 2,4,6,8,10,12,...
but i need both excel to have same identity 1,2,3,4,5,6,....N
so that i can map the records
so can somebody guide me on this?
edit:
The autoincrement returns 1,2,3,4,5,6,... this is fine when thers only one tMap component in the job,but not similar when 2 tMaps are used ?
This is because the numeric sequence is static. Since you have only one sequence called "s1", it will be incremented twice at every iteration (one time for each tMap it's invoked in).
Just use some unique labels (ie. "s1" and "s2") to force the use of two independent sequences, thus the solution of your problem.

DB2 Auto generated Column / GENERATED ALWAYS pros and cons over sequence

Earlier we were using 'GENERATED ALWAYS' for generating the values for a primary key. But now it is suggested that we should, instead of using 'GENERATED ALWAYS' , use sequence for populating the value of primary key. What do you think can be the reason of this change? It this just a matter of choice?
Earlier Code:
CREATE TABLE SCH.TAB1
(TAB_P INTEGER NOT NULL GENERATED ALWAYS AS IDENTITY (START WITH 1, INCREMENT BY 1, NO CACHE),
.
.
);
Now it is
CREATE TABLE SCH.TAB1
(TAB_P INTEGER ),
.
.
);
now while inserting, generate the value for TAB_P via sequence.
I tend to use identity columns more than sequences, but I'll compare the two for you.
Sequences can generate numbers for any purpose, while an identity column is strictly attached to a column in a table.
Since a sequence is an independent object, it can generate numbers for multiple tables (or anything else), and is not affected when any table is dropped. When a table with a identity column is dropped, there is no memory of what value was last assigned by that identity column.
A table can have only one identity column, so if you want to want to record multiple sequential numbers into different columns in the same table, sequence objects can handle that.
The most common requirement for a sequential number generator in a database is to assign a technical key to a row, which is handled well by an identity column. For more complicated number generation needs, a sequence object offers more flexibility.
This might probably be to handle ids in case there are lots of deletes on the table.
For eg: In case of identity, if your ids are
1
2
3
Now if you delete record 3, your table will have
1
2
And then if your insert a new record, the ids will be
1
2
4
As opposed to this, if you are not using an identity column and are generating the id using code, then after delete for the new insert you can calculate id as max(id) + 1, so the ids will be in order
1
2
3
I can't think of any other reason, why an identity column should not be used.
Heres something I found on the publib site:
Comparing IDENTITY columns and sequences
While there are similarities between IDENTITY columns and sequences, there are also differences. The characteristics of each can be used when designing your database and applications.
An identity column has the following characteristics:
An identity column can be defined as
part of a table only when the table
is created. Once a table is created,
you cannot alter it to add an
identity column. (However, existing
identity column characteristics might
be altered.)
An identity column
automatically generates values for a
single table.
When an identity
column is defined as GENERATED
ALWAYS, the values used are always
generated by the database manager.
Applications are not allowed to
provide their own values during the
modification of the contents of the
table.
A sequence object has the following characteristics:
A sequence object is a database
object that is not tied to any one
table.
A sequence object generates
sequential values that can be used in
any SQL or XQuery statement.
Since a sequence object can be used
by any application, there are two
expressions used to control the
retrieval of the next value in the
specified sequence and the value
generated previous to the statement
being executed. The PREVIOUS VALUE
expression returns the most recently
generated value for the specified
sequence for a previous statement
within the current session. The NEXT
VALUE expression returns the next
value for the specified sequence. The
use of these expressions allows the
same value to be used across several
SQL and XQuery statements within
several tables.
While these are not all of the characteristics of these two items, these characteristics will assist you in determining which to use depending on your database design and the applications using the database.
I don't know why anyone would EVER use an identity column rather than a sequence.
Sequences accomplish the same thing and are far more straight forward. Identity columns are much more of a pain especially when you want to do unloads and loads of the data to other environments. I not going to go into all the differences as that information can be found in the manuals but I can tell you that the DBA's have to almost always get involved anytime a user wants to migrate data from one environment to another when a table with an identity is involved because it can get confusing for the users. We have no issues when a sequence is used. We allow the users to update any schema objects so they can alter their sequences if they need to.