OnComponentOrder flow and tMap connections in Talend - talend

I have the following flow:
1 Component that needs to be executed to extract from MYSQL a certain
timestamp
3 MYSQL inputs that needs to use that timestamp
1 tMap which needs to get the 3 mysql input
However, I am not allowed to connect the 3 mysql into the single tMap because they are depending on the first component (through OnComponentOk) but with different order. How do I orchestrate this sort of situations?

You could execute a query and set a global variable using the tSetGlobalVar component (referencing row1.mydate, for example), then in each of your queries going into tMap, reference the global variable like:
SELECT ...
FROM ...
WHERE mydate >= '" + (String) globalMap.get("myDate") + "';"
Two subjobs, one for getting the variable and storing it, and another for doing your three queries into tMap, etc.

Related

How do I query Postgresql with IDs from a parquet file in an Data Factory pipeline

I have an azure pipeline that moves data from one point to another in parquet files. I need to join some data from a Postgresql database that is in an AWS tenancy by a unique ID. I am using a dataflow to create the unique ID I need from two separate columns using a concatenate. I am trying to create where clause e.g.
select * from tablename where unique_id in ('id1','id2',id3'...)
I can do a lookup query to the database, but I can't figure out how to create the list of IDs in a parameter that I can use in the select statement out of the dataflow output. I tried using a set variable and was going to put that into a for-each, but the set variable doesn't like the output of the dataflow (object instead of array). "The variable 'xxx' of type 'Array' cannot be initialized or updated with value of type 'Object'. The variable 'xxx' only supports values of types 'Array'." I've used a flatten to try to transform to array, but I think the sync operation is putting it back into JSON?
What a workable approach to getting the IDs into a string that I can put into a lookup query?
Some notes:
The parquet file has a small number of unique IDs compared to the total unique IDs in the database.
If this were an azure postgresql I could just use a join in the dataflow to do the join, but the generic postgresql driver isn't available in dataflows. I can't copy the entire database over to Azure just to do the join and I need the dataflow in Azure for non-technical reasons.
Edit:
For clarity sake, I am trying to replace local python code that does the following:
query = "select * from mytable where id_number in "
df = pd.read_parquet("input_file.parquet")
df['id_number'] = df.country_code + df.id
df_other_data = pd.read_sql(conn, query + str(tuple(df.id_number))
I'd like to replace this locally executing code with ADF. In the ADF process, I have to replace the transformation of the IDs which seems easy enough if a couple of different ways. Once I have the IDs in the proper format in a column in a dataset, I can't figure out how to query a database that isn't supported by Data Flow and restrict it to only the IDs I need so I don't bring down the entire database.
Due to variables of ADF only can store simple type. So we can define an Array type paramter in ADF and set default value. Paramters of ADF support any type of elements including complex JSON structure.
For example:
Define a json array:
[{"name": "Steve","id": "001","tt_1": 0,"tt_2": 4,"tt3_": 1},{"name": "Tom","id": "002","tt_1": 10,"tt_2": 8,"tt3_": 1}]
Define an Array type paramter and set its default value:
So we will not get any error.

tExtractRegexField unable to act as lookup to tMap in Talend DI

I have a tExtractRegexField which extracts a date from a string of text coming from a ExcelFileInput and will output the dates to tLogRow but I can't connect the same output as a lookup column to a tMap with a 2nd ExcelFileInput as its main input.
If I connect the ExtractRegexField to tMap first I can't then connect the 2nd ExcelFileInput and vis versa
I'm using Talend 6.3.1 and for testing I am able to connect 2 x ExcelFileInput to a tMap so I dont think its a problem with my system setup.
I have also tried tJoin instead of tMap but I encounter the same issue (can't connect both inputs together but can connect "A" or "B" first)
Overview of Process
Problem Area
The tExcelFileInput uses globalMap to get the path to the excel file from the preceding tFlowToIterate
Based on discussions on the talend forum the issue may have been down to a desire by taland DI to avoid circular references
An alternative solution is to extract the regexfield from the header row and store them in a global variable using a tJavaRow and globalMap.put("MyVal", row.Data); and then OnComponentOk read the remaining data from the body rows and in the tMap recall the global variable MyVal and include it as needed in your tMap Output

talend - put a logic before tOutput

I have a talend job that have the following components:
- 1 database input component that fetched data from the database
- 1 output component that writes the fetched data into flat file.
Now, I have a scenario wherein, from two fields on the fetched data, only one should be put on in the output based on some if else logic.
Can anyone assist me in this matter? Is it using tMap?
Your if/else logic could be put in the tMap; as a ternary operator in the output of your tMap.
condition?resultifOK:resultifKO
For example, you want to insert in your file the value of ColumnA if it is "MY CONDITION", otherwise you insert value of ColumnB :
You'll have, directly in the output row of tMap:
"MY CONDITION".equals(row1.ColumnA)?row1.ColumnA:row1.ColumnB
You can't directly use a if/else in tMap. If you have several conditions , consider using a Routine instead of multiple ternary operators.

How to assign csv field value to SQL query written inside table input step in Pentaho Spoon

I am pretty new to Pentaho so my query might sound very novice.
I have written a transformation in which am using CSV file input step and table input step.
Steps I followed:
Initially, I created a parameter in transformation properties. The
parameter birthdate doesn't have any default value set.
I have used this parameter in postgresql query in table input step
in the following manner:
select * from person where EXTRACT(YEAR FROM birthdate) > ${birthdate};
I am reading the CSV file using CSV file input step. How do I assign the birthdate value which is present in my CSV file to the parameter which I created in the transformation?
(OR)
Could you guide me the process of assigning the CSV field value directly to the SQL query used in the table input step without the use of a parameter?
TLDR;
I recommend using a "database join" step like in my third suggestion below.
See the last image for reference
First idea - Using Table Input as originally asked
Well, you don't need any parameter for that, unless you are going to provide the value for that parameter when asking the transformation to run. If you need to read data from a CSV you can do that with this approach.
First, read your CSV and make sure your rows are ok.
After that, use a select values to keep only the columns to be used as parameters.
In the table input, use a placeholder (?) to determine where to place the data and ask it to run for each row that it receives from the source step.
Just keep in ming that the order of columns received by the table input (the columns out of the select values) is the same order that it will be used for the placeholders (?). This should not be a problem with your question that uses only one placeholder, but keep that in mind as you ramp up using Pentaho.
Second idea, using a Database Lookup
This is another approach where you can't personalize the query made to the database and may experience a better performance because you can set a "Enable cache" flag and if you don't need to use a function on your where clause this is really recommended.
Third idea, using a Database Join
That is my recommended approach if you need a function on your where clause. It looks a lot like the Table Input approach but you can skip the select values step and select what columns to use, repeat the same column a bunch of times and enable a "outer join" flag that returns the rows without result from the query
ProTip: If you feel the transformation running too slow, try to use multiple copies from the step (documentation here) and obviously make sure the table have the appropriate indexes in place.
Yes there's a way of assigning directly without the use of parameter. Do as follows.
Use Block this step until steps finish to halt the table input step till csv input step completes.
Following is how you configure each step.
Note:
Postgres query should be select * from person where EXTRACT(YEAR
FROM birthdate) > ?::integer
Check Execute for each row and Replace variables in in Table input step.
Select only the birthday column in CSV input step.

Passing value from TpostgresSql to context variable

I need to pass value from Tpostgressql to context variable,so that context variable value can be used in other components
The query used in tpostgres is :
select max(started_on) started_on from etl_log
I have created a context variable started_on_date (date datatype)
In the Tjavarow :-
context.started_on_date =row1.started_on
But it throws
error created_on variable cannot be resolved or is not a field
Have you defined the schema in the tPostrgesqlInput component ? If not, that needs to be done first. Afterward, synchronize the schema of the tJavaRow. You can use the Java row's code generation feature, if appropriate .
Question / if you want to do row-based processing in the same job, there is likely no need to put the started date in the context.
If you want to do non-row based processing, you can used the tJavaRow component to put the value in the globalMap. This assumes there is only one row of data or that you only care about the last row. Then, you can use that value in other components which are not processing a flow (rows). tJava is an example of that.