I'm making a ETL in Talend that add data in a table.
The table is for a old system and the primary key is not auto-incremental.
So I have to get the maximum id plus one every time that I insert a value in the table.
I'm trying to use a var in the Expression Builder:
first I save the max and set in a context variable
then in the Expression Builder:
Context.Max += 1
The problem is that every time I get the same id, I need to save the sum.
Finally I found what was looking for:
Numeric.sequence("var2", Context.Max, 1)
This increment by 1 the Context.Max and save it in "var2".
Related
I am creating DataFlow in ADF. my requirement is to read first row one field value and make it as session id for rest of the rows. I looked into the expressions but didn't find much functions that will help on this.
ex: Source file in blob :---------------
time ,phone
2020-01-31 10:00:00,1234567890
2020-01-31 10:10:00,9876543219
Target should be :-----------------
SessionID , time, Phone
20200131100000,2020-01-31 10:00:00,1234567890
20200131100000,2020-01-31 10:10:00,9876543219
SessionIID is a derived column. i need to read first row of time and remember that time and apply to all rows for sessionID.
How to read first row time value and keep it in global variable ?
any inputs are appreciated.
You can use Lookup activity in pipeline(check First row only option) and pass time value to Data Flow parameter. Then use Derived Column transform in Data Flow to add SessionID column.
Details:
check First row only option in Lookup activity
use this expression to get your expected value:
#replace(replace(replace(activity('Lookup1').output.firstRow.time,'-',''),' ',''),':','')
3.pass value of this variable to parameter in Data Flow.
4.add Session column in Data Flow.
I have a talend job that i require a lookup at the target table.
Naturally the target table is large (a fact table) so I don't want to have to wait to load the whole thing before going to running lookups like this picture below:
Is there a way to have the lookup work DURING the pull from the main source?
The attempt is to speed up the inital loads so things move fast, and attempt to save on memory. as you can see, the lookup is already passed 3 Million rows.
the tLogRow represents the same table as the lookup.
You can achieve what you're looking for by configuring the lookup in your tMap to use "Reload at each row" lookup model, instead of "Load Once". This lookup model allows you to reexecute your lookup query for each incoming row, instead of loading all your lookup table at once, useful for lookups on large tables.
When you select the reload at each row model, you will have to specify a lookup key in the global map sections that will appear under the settings. Create a key with a name like "ORDER_ID", and map it with FromExt.ORDER_ID column. Then modify your lookup query so that it returns a single match for the ORDER_ID like so:
"SELECT col1, col1.. FROM lookup_table WHERE id = '" + (String)globalMap.get("ORDER_ID") + "'".
This is supposing your id column is a string.
What this does is create a global variable called "ORDER_ID" containing the order id for every incoming row from your main connection, then executes the lookup query filtering for that id.
I am pretty new to Pentaho so my query might sound very novice.
I have written a transformation in which am using CSV file input step and table input step.
Steps I followed:
Initially, I created a parameter in transformation properties. The
parameter birthdate doesn't have any default value set.
I have used this parameter in postgresql query in table input step
in the following manner:
select * from person where EXTRACT(YEAR FROM birthdate) > ${birthdate};
I am reading the CSV file using CSV file input step. How do I assign the birthdate value which is present in my CSV file to the parameter which I created in the transformation?
(OR)
Could you guide me the process of assigning the CSV field value directly to the SQL query used in the table input step without the use of a parameter?
TLDR;
I recommend using a "database join" step like in my third suggestion below.
See the last image for reference
First idea - Using Table Input as originally asked
Well, you don't need any parameter for that, unless you are going to provide the value for that parameter when asking the transformation to run. If you need to read data from a CSV you can do that with this approach.
First, read your CSV and make sure your rows are ok.
After that, use a select values to keep only the columns to be used as parameters.
In the table input, use a placeholder (?) to determine where to place the data and ask it to run for each row that it receives from the source step.
Just keep in ming that the order of columns received by the table input (the columns out of the select values) is the same order that it will be used for the placeholders (?). This should not be a problem with your question that uses only one placeholder, but keep that in mind as you ramp up using Pentaho.
Second idea, using a Database Lookup
This is another approach where you can't personalize the query made to the database and may experience a better performance because you can set a "Enable cache" flag and if you don't need to use a function on your where clause this is really recommended.
Third idea, using a Database Join
That is my recommended approach if you need a function on your where clause. It looks a lot like the Table Input approach but you can skip the select values step and select what columns to use, repeat the same column a bunch of times and enable a "outer join" flag that returns the rows without result from the query
ProTip: If you feel the transformation running too slow, try to use multiple copies from the step (documentation here) and obviously make sure the table have the appropriate indexes in place.
Yes there's a way of assigning directly without the use of parameter. Do as follows.
Use Block this step until steps finish to halt the table input step till csv input step completes.
Following is how you configure each step.
Note:
Postgres query should be select * from person where EXTRACT(YEAR
FROM birthdate) > ?::integer
Check Execute for each row and Replace variables in in Table input step.
Select only the birthday column in CSV input step.
I have a db as follows:
score:0
timeScore:86400
totalScore:0
time:1234567777 // Any time stamp
now every time the user votes up
increment($inc) score by +1
Then I update timeScore to be e.g. (86400 / nowTimestamp() - time + 1 )
Then I update totalScore to be (timeScore + score)
e.g: for the final values after 2nd update:
score:1
timeScore:86400
totalScore:86401
time:1234567777
the problem is that during my external calculation, may be another user add +1 to the score and calculated the total and wrote its values before I even update my data so there would be data corruption.
Now how do I solve this or how do I make it Thread safe ?
You should be able to use the mongodb findAndModify function, which allows you to make atomic updates, meaning that while your document is being changed the value cannot be updated by another query. Docs available here.
You may also wish to look at doing as many calculations before you push your data to storage as this will remove the need to read then write.
the one good way is Save With OptimisticConcurrency. Create a version with every collection with that thread safety can be implemented.
How can I calculate the total no. of records in a table? I want to show all table names in a DB along with the no. of records in each table
The fastest method is:
proutil dbname -C tabanalys > dbname.tab
this is an external utility that analyzes the db.
You can also, of course read every record and count them but that tends to be a lot slower.
The way to get the number of records depends on the application you are planning.
Our DBAs just use the progress utilities. In Unix /usr/dlc/bin/proutil -C dbanalys or some variation to get database information and just dump that to a file.
To get the schema information from progress itself you can use the VST tables. Specifically within a particular database you can use the _file table to retrieve all of the table names.
Once you have the table names you can use queries to get the number of records in the table. The fastest way to query a particular table for a record count is to use the preselect.
This will require the usage of a dynamic buffer and query.
So you can do something like the following.
CREATE WIDGET-POOL.
DEF VAR h_predicate AS CHAR NO-UNDO.
DEF VAR h_qry AS HANDLE NO-UNDO.
DEF VAR h_buffer AS HANDLE NO-UNDO.
FOR EACH _file NO-LOCK:
h_predicate = "PRESELECT EACH " + _file._file-name + " NO-LOCK".
CREATE BUFFER h_buffer FOR TABLE _file._file-name .
CREATE QUERY h_qry.
h_qry:SET-BUFFERS( h_buffer ).
h_qry:QUERY-PREPARE( h_predicate ).
h_qry:QUERY-OPEN().
DISP _file._file-name h_qry:NUM-RESULTS.
DELETE OBJECT h_qry.
DELETE OBJECT h_buffer.
END.
An easy one:
Select count(*) from tablename.
A bit more complex:
Def var i as int.
for each table:
i = i + 1.
end.
display i.
For more complex answer, you got the others.
Use CURRENT-RESULT-ROW function with DEFINE QUERY and GET LAST to get the total number of records:
e.g.
DEFINE QUERY qCustomer FOR Customer SCROLLING.
OPEN QUERY qCustomer FOR EACH Customer NO-LOCK.
GET LAST qCustomer.
DISPLAY CURRENT-RESULT-ROW("qCustomer") LABEL "Total number of rows".
...
CLOSE QUERY qCustomer.