Comparing records in same column and performing concatenation - datastage

My sample file is
101,name1,gold
102,name2,gold
101,name1,house.
I need to compare the names, if they are the same then the third column has to be concatenated using pipe deimiter
For ex: 101,name1,gold|house
I need to achieve this in datastage transformer.
Please help on this

Sort by Col1 and Col2 before you enter the Transformer.
Use a stage variable to concat col3 to previous Col3 (stored in an other stage variable) anfd reset it when LastRowInGroup is reached
Use LastRowInGroup functionality as an condition to output data.

Related

Azure Data Factory : returns an array of dates from a specified range

I'm trying to returns an array of dates in data factory. But i just want the user to specify a date range with two parameters, startDate and endDate :
I want to return this array by specifying "12-08-2020" and "12-13-2020" in trigger :
["12-08-2020","12-09-2020","12-10-2020","12-12-2020","12-13-2020"]
Do did not find a simple way to do it yet.
One way i thought about would be :
add a lookup activity on a date dimension,
then add two filters to select only items greater than startDate and lower than endDate.
But this seems to be cumbersome and overkill. Is there a simpler way to do it ?
EDIT :
This answer seems to be relevant (i did not see it at first) : Execute azure data factory foreach activity with start date and end date
I think we can use recursive query in Lookup activity.
The pseudo code is as follows:
In sql we can use this query to get a table:
;with temp as
(
select CONVERT(varchar(100),'12-08-2020', 110) as dt
union all
select CONVERT(varchar(100), DATEADD(day,1,dt), 110) from temp
where datediff(day,CONVERT(varchar(100), DATEADD(day,1,dt), 110),'12-13-2020')>=0
) select * from temp
The result is as follows:
So in ADF, I think we can use a Lookup sql query to return the result what you want.
According to this official document, we only need to replace the parameters of the sql statement.
Next,I will use '#{pipeline().parameters.startDate}' to return a date string, note: There is a pair of single quotes outside.
I set two parameters as follows:
Type the following code into a Lookup activity.
;with temp as
(
select CONVERT(varchar(100),'#{pipeline().parameters.startDate}', 110) as dt
union all
select CONVERT(varchar(100), DATEADD(day,1,dt), 110) from temp
where datediff(day,CONVERT(varchar(100), DATEADD(day,1,dt), 110),'#{pipeline().parameters.endDate}')>=0
) select * from temp
Don't select First row only.
The debug result is as follows:
I had similar use case and ended up using Until with little changes.
Two parameters which takes two parameters start_day and end_day
Also have to introduce two variables for implementing counter logic. more details can be found at how-to-increment-parameter-data-factory
and finally the expression in the untill block is
#less(int(adddays(pipeline().parameters.end_day, 0, 'yyyyMMdd')), int(adddays(pipeline().parameters.start_day, int(variables('counter')), 'yyyyMMdd')))
final note the until block executes when the expression returns false and loops out on true
I managed to get something similar to work with a combination of derived column transformation using a mapLoop() function followed by a flatten transformation.
The derived column expression first calculates an array of dates in a single column.
mapLoop(toInteger((To_Date - From_Date)/86400000)),toDate(addDays(From_Date,#index)))
where 86400000 is the number of miliseconds in 24 hours
The flatten transformation uses this column to unroll the array into separate rows.

Fetch adjacent column word?

I want condition like "if max(col1) then col2 end "in tableau calculated feild, output will be "e".
thanks in advance
Simply use
If [col1] = {max([col1])} then col2 end
Drop this in view and Hide nulls, you'll get desired value.
I've tried it with these steps and it looks to be working :)
# first find the max. Using fixed lod calculation to do it across the whole dataset. The 1 is the same value across all so its doing a group max, but the group is the full data
MaxCol1: {fixed 1: MAX(Col1)}
# pick out rows where max is found in Col1
Col1Match: Col1=[MaxCol1]
# find Col2 where there's a match and fill rest with NULL's
Col2Value: IF [Col1Match] then Col2 else NULL end
# This will be a bunch of NULL's and "e"
# Finally, do the same as in the first calculation to get the result
Output: {fixed 1 : max([Col2Value])}
You can try combining some of these steps together now to clean up the space a bit. Does this suffice or is your real data more complicated?
Best,
Jonny

Discard blanks in datastage

I have a Datastage job which takes the data from a file to a Dataset and for a column I would like to make a transformation in order to exclude the rows where that columns has no value:
For example I use in the transfomer the following rule where I put 0 everytime I find no value in the column lcvInstalmentOriginalStr, but I need this row to be discarded from the begining.
If lcvInstalmentOriginalStr <> "" Then StringToDecimal(lcvInstalmentOriginalStr) Else 0
Thank you
You can use a condition within a transformer stage (for example) to put just those rows on the output link (to the Dataset) that hold data.
The condition could look like this
lcvInstalmentOriginalStr <> ""

In a data flow task, how do I restrict rows flowing using a value from another source?

I have an excel sheet with many tabs. Say one is called wsMain and the other is called wsDate.
In my data flow transformation I am able to successfully load the data from wsMain to my table.
Now I have to update this transformation where I have to fetch the maximum date from the worksheet wsDate and only load data from wsMain where the date is less than on equal to the maximum date in wsDate (that is the only column available).
So for I have figured out that I need to create a new Excel connection manager to read the data from wsDate and I have used the Aggregate transformatioin to get the maximum date.
Now the question is how do I use this date to restrict the rows coming from wsMain?
I understand from the link below that you can store the value in a variable but what do I do next?:
SSIS set result set from data flow to variable
I have tried using a merge join but not sure if I am doing it right.
Here is what it looks like now:
I could not achieve the above but would be interested to know if that is possible. As a work around I have created a separate dataflow where I have stored the valued in a variable and then used the variable in the conditional split to filter the required rows:
Here is a step by step guide I followed to write the variable:
https://www.proteanit.com/2008/12/11/ssis-writing-to-a-package-variable-in-a-dataflow/
You can obtain the maximum value of the wsDate column first, this use this as a filter to avoid introducing unnecessary records into the data flow which which would be discarded by the Conditional Split. An overview of this process is below. I'd also recommend confirming the data types for all columns involved.
Create an SSIS DateTime variable and name this something descriptive such as MaxDate.
Create a Data Flow Task before the current one with an Excel Source component. Use the SQL command option for the Data Access Mode and enter a SQL statement to return the max value of the wsDate column. In the following example ExcelSource is the name of the sheet that you're pulling from. I'd suggested confirming the query with the Preview button on the Excel Source as well.
Add a Script Component (not Task) after the Excel Source. Add the MaxDate variable in the ReadWriteVariables field on the main page of the Script Component. On the Inputs and Outputs pane add the output column from the Excel Source as an Input Column with the ReadOnly usage Type. Example C# code for this is below. Note that variables can only be written to in the PostExecute method. The Input0_ProcessInputRow method is called once for each row that passes through, however there will only be the single row in this case. On the following code MaxExcelDate is the name of the output column from the Excel Source.
On the Excel Source component in the Data Flow Task where the records are imported from Excel, change the Data Access Mode to SQL command and enter a SQL statement to return records that have a date less than or equal to the maximum wsDate value. This is the last example and the ? is a placeholder for the parameter. After entering this SQL, click the Parameters button and select Parameter0 for the Parameters field, the MaxDate variable for Variables field, and a direction of Input. The Conditional Split can then be removed since these records will now be filtered out.
Excel MAX wsDate SELECT:
SELECT MAX(wsDate) AS MaxExcelDate FROM ExcelSource
C# Script Component:
DateTime maxDate;
public override void PostExecute()
{
base.PostExecute();
Variables.MaxDate = maxDate;
}
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
maxDate = Row.MaxExcelDate;
}
Excel Command with Date Filter:
SELECT
Column1,
Column2,
Column3
FROM ExcelSheet
WHERE DateColumn <= ?
Yes, it is possible. In the data flow, you will need to determine the max date, which you already have. Next, you will need to MERGE JOIN the two data flows on the date column. From there, you will feed it into a CONDITIONAL SPLIT and split where the date columns match [i.e., !ISNULL()] versus do not match [i.e., ISNULL()]. In your case, you only want the matches. The non-matches will be disregarded.
Note: if you use an INNER JOIN on the MERGE JOIN where there is only one date (i.e., MaxDate) to join on, then this will take care of the row filtering for you. You will not need a CONDITIONAL SPLIT.
Welcome to ETL.
Update
It is a real pain that SSIS's MERGE JOINs only perform joins on EQUAL operations as opposed to LESS THAN and GREATER THAN operations. You will need to separate the data flows.
Use a script component to scan the excel file for the MAX Date and assign that value to a package variable in SSIS. Alternatively, you can have a dates table in SQL Server and then use an Execute SQL Command in SSIS to retrieve the MAX Date from the table and assign that value to a package variable
Modify your existing data flow to remove the reading of the Excel date file completely. Then add a DERIVED COLUMN transformation and add a new column that is mapped to the package variable in SSIS that stores the MAX date. You can name the Derived Column Name 'MaxDate'
Add a conditional split transformation with the following CONDITION logic: [AsOfDt] <= [MaxDate]
Set the Output Name to Insert Records
Note: The CONDITIONAL SPLIT creates a new output data flow with restricted/filtered rows. It does not create a new column within the existing data flow. Think of this as a transposition of data flow output from column modification to row modification. Only those rows that match the condition will be sent to the output that you desire. I assume you only want to Insert these records, so I named it that. You can choose whatever naming convention you prefer
Note 2: Sorry for not making the Update my original answer - I haven't used the AGGREGATE transformation before so I was not aware that it restricts row output as opposed to reading a value in the data flow and then assigning it to a variable. That would be a terrific transformation for Microsoft to add to SSIS. It appears that the ROWCOUNT and SCRIPT COMPONENT transformations are the only ones that have the ability to set a package variable value within the data flow.

Splitting a column data as per delimiter

I have a Spark (1.4) dataframe where the data in a column is like "1-2-3-4-5-6-7-8-9-10-11-12". I want to split the data into multiple columns. Please note that the number of fields can vary from 1 to 12, its not fixed.
P.S. we are using Scala API.
Edit:
Editing over the original question. I have the delimited string as below:
"ABC-DEF-PQR-XYZ"
From this string I need to create delimited strings in separate columns as below. Please note that this string is in a column in DF.
Original column: ABC-DEF-PQR-XYZ
New col1 : ABC
New col2 : ABC-DEF
New col3 : ABC-DEF-PQR
New col4 : ABC-DEF-PQR-XYZ
Please note that there can be 12 such new columns which needs to get derived from original field. Also, the string in original column might vary i.e. some times 1 column, some time 2 but max can be 12.
Hope I have articulated the problem statement clearly.
Thanks!
You can use explode and pivot. Here is some sample data:
df=sc.parallelize([["1-2-3-4-5-6-7-8-9-10-11-12"], ["1-2-3-4"], ["1-2-3-4-5-6-7-8-9-10"]]).toDF(schema=["col"])
Now add a unique id to rows so that we can keep track of which row the data belongs to:
df=df.withColumn("id", f.monotonically_increasing_id())
Then split the columns by delimiter - and then explode to get a long-form dataset:
df=df.withColumn("col_split", f.explode(f.split("col", "\-")))
Finally pivot on id to get back to wide form:
df.groupby("id")
.pivot("col_split")
.agg(f.max("col_split"))
.drop("id").show()