Datastage: Looping with multiple values - datastage

I have multiple Date and Store values in an excel, I need to loop the datastage parallel job based on Date and Store. Parallel job has SQL query based on Date and Store so i need to pass these values from Sequence job.
I developed a Sequence job with looping condition but i was able to loop only with 1 column(Either Date or Store). Is there anyway i can pass both Date and store to the parallel job?
I clubbed both Date and store into a single column and try to pass to the parallel job but i am not able to split the parameter value and run the SQL query.
Is there any suggestions on this please?

I assume you 'clubbed' your values by concatenating them with a split character of your choice. In the sequence job, use a User Variable Stage in the loop (between the start-loop stage and your parallel job) to split the pair into two variables e.g. by using the field function. You can then pass them as two seperate params to the parallel job.

Related

Group by on multiple column one by one

I am new to pyspark so I wanted to know, is there any better way to do group by on multiple columns one by one instead of using loop over all columns? currenctly, I am using loop to iterate over all required group by columns but it is taking very long time. I have around 50-60 columns for which I need to group one by one using aggregration on fixed columns.
current code using loop:
for name in req_string_columns:
tmp=Selected_data.groupBy(name).agg(mean("ABC"),mean("XYZ"),count("ABC")
,count("XYZ")).withColumnRenamed(name,'Category')
Is there any better way to do it?

Data Factory / Data Flow - conditional split based on a number of records

I need to split a huge dataset into multiple files and each file must not have more than 100 000 rows.
I don't know if this is possible with Data Flow and the conditional split?
If you want simply split by a fixed number of rows, I've created a simple test.
Declare a parameter inside the dataflow to store the row count of your source dataset. If your source dataset is Azure sql, you can use Lookup activity to get the max Row_No. If your source dataset is Azure storage, you can use Azure Function activity to get the max Row_No. Then pass the value to the parameter.
Here for test, set a static default value.
Then we can set Number of partitions expression $RowCount/10, if you want 10 lines per file.
We can set file names after division here.
My source dataset contains 50 lines, so ADF will split it to 5 files. Judging by the Id column, it has randomly taken 10 rows of data.
You can achieve this with 2 dataflows. 1 to get the row count and another to partition. This can also be achieved in 1 dataflow using a cache sink in the future.

How to add a date range in Azure Data Factory data flow

Working info
I have two different sources of data set so I have created a dataflow in data factory in which for first data(A) set I am doing some transformation and loading into sink,in another data set(B) similarly am performing some transformation and loading into another sink.
Issue
Now I have some requirements in which there is date column DT_COLUMN_A(11-04-2020 01:17:40) in first data set(A)which needs to be compared with a date column DT_COLUMN_B(01-01-2020 16:32:00) in second data set (B) and store the compared output as a column in second dataset(B).
So I need the min and max(date range) of date column from dataset A ,apply it to min and max of date column to dataset B and find the dates which are matching in A and B and store it as YES if not matching NO.
Code approach thought
Logic needed:
if(min(DT_COLUMN_A) and max(DT_COLUMN_A) == min(DT_COLUMN_B) and max(DT_COLUMN_B) then YES else No.
I am trying to achieve this in ADF data flow but unable to do it.
To get MIN and MAX of a dataset in ADF, you will need the Aggregate transformation. Create new columns called MinA, MinB, MaxA, MaxB from each of the relative streams in your data flow using Aggregate. Set the aggregate function to MIN and MAX appropriately for each. Then, you'll be able to set an iif() expression afterward, or use a Filter or Conditional Split transformation that uses those stored min & max values.
I managed to get something similar to work using using a mapLoop() expression to first build an array of dates in a derived column transformation followed by a flatten transformation
https://stackoverflow.com/a/73453351/12592985

In a data flow task, how do I restrict rows flowing using a value from another source?

I have an excel sheet with many tabs. Say one is called wsMain and the other is called wsDate.
In my data flow transformation I am able to successfully load the data from wsMain to my table.
Now I have to update this transformation where I have to fetch the maximum date from the worksheet wsDate and only load data from wsMain where the date is less than on equal to the maximum date in wsDate (that is the only column available).
So for I have figured out that I need to create a new Excel connection manager to read the data from wsDate and I have used the Aggregate transformatioin to get the maximum date.
Now the question is how do I use this date to restrict the rows coming from wsMain?
I understand from the link below that you can store the value in a variable but what do I do next?:
SSIS set result set from data flow to variable
I have tried using a merge join but not sure if I am doing it right.
Here is what it looks like now:
I could not achieve the above but would be interested to know if that is possible. As a work around I have created a separate dataflow where I have stored the valued in a variable and then used the variable in the conditional split to filter the required rows:
Here is a step by step guide I followed to write the variable:
https://www.proteanit.com/2008/12/11/ssis-writing-to-a-package-variable-in-a-dataflow/
You can obtain the maximum value of the wsDate column first, this use this as a filter to avoid introducing unnecessary records into the data flow which which would be discarded by the Conditional Split. An overview of this process is below. I'd also recommend confirming the data types for all columns involved.
Create an SSIS DateTime variable and name this something descriptive such as MaxDate.
Create a Data Flow Task before the current one with an Excel Source component. Use the SQL command option for the Data Access Mode and enter a SQL statement to return the max value of the wsDate column. In the following example ExcelSource is the name of the sheet that you're pulling from. I'd suggested confirming the query with the Preview button on the Excel Source as well.
Add a Script Component (not Task) after the Excel Source. Add the MaxDate variable in the ReadWriteVariables field on the main page of the Script Component. On the Inputs and Outputs pane add the output column from the Excel Source as an Input Column with the ReadOnly usage Type. Example C# code for this is below. Note that variables can only be written to in the PostExecute method. The Input0_ProcessInputRow method is called once for each row that passes through, however there will only be the single row in this case. On the following code MaxExcelDate is the name of the output column from the Excel Source.
On the Excel Source component in the Data Flow Task where the records are imported from Excel, change the Data Access Mode to SQL command and enter a SQL statement to return records that have a date less than or equal to the maximum wsDate value. This is the last example and the ? is a placeholder for the parameter. After entering this SQL, click the Parameters button and select Parameter0 for the Parameters field, the MaxDate variable for Variables field, and a direction of Input. The Conditional Split can then be removed since these records will now be filtered out.
Excel MAX wsDate SELECT:
SELECT MAX(wsDate) AS MaxExcelDate FROM ExcelSource
C# Script Component:
DateTime maxDate;
public override void PostExecute()
{
base.PostExecute();
Variables.MaxDate = maxDate;
}
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
maxDate = Row.MaxExcelDate;
}
Excel Command with Date Filter:
SELECT
Column1,
Column2,
Column3
FROM ExcelSheet
WHERE DateColumn <= ?
Yes, it is possible. In the data flow, you will need to determine the max date, which you already have. Next, you will need to MERGE JOIN the two data flows on the date column. From there, you will feed it into a CONDITIONAL SPLIT and split where the date columns match [i.e., !ISNULL()] versus do not match [i.e., ISNULL()]. In your case, you only want the matches. The non-matches will be disregarded.
Note: if you use an INNER JOIN on the MERGE JOIN where there is only one date (i.e., MaxDate) to join on, then this will take care of the row filtering for you. You will not need a CONDITIONAL SPLIT.
Welcome to ETL.
Update
It is a real pain that SSIS's MERGE JOINs only perform joins on EQUAL operations as opposed to LESS THAN and GREATER THAN operations. You will need to separate the data flows.
Use a script component to scan the excel file for the MAX Date and assign that value to a package variable in SSIS. Alternatively, you can have a dates table in SQL Server and then use an Execute SQL Command in SSIS to retrieve the MAX Date from the table and assign that value to a package variable
Modify your existing data flow to remove the reading of the Excel date file completely. Then add a DERIVED COLUMN transformation and add a new column that is mapped to the package variable in SSIS that stores the MAX date. You can name the Derived Column Name 'MaxDate'
Add a conditional split transformation with the following CONDITION logic: [AsOfDt] <= [MaxDate]
Set the Output Name to Insert Records
Note: The CONDITIONAL SPLIT creates a new output data flow with restricted/filtered rows. It does not create a new column within the existing data flow. Think of this as a transposition of data flow output from column modification to row modification. Only those rows that match the condition will be sent to the output that you desire. I assume you only want to Insert these records, so I named it that. You can choose whatever naming convention you prefer
Note 2: Sorry for not making the Update my original answer - I haven't used the AGGREGATE transformation before so I was not aware that it restricts row output as opposed to reading a value in the data flow and then assigning it to a variable. That would be a terrific transformation for Microsoft to add to SSIS. It appears that the ROWCOUNT and SCRIPT COMPONENT transformations are the only ones that have the ability to set a package variable value within the data flow.

Annualize data - Tableau

I'm trying to annualise my data in tableau, but get an error in the Calculated Field.
"Cannot mix aggregate and non-aggregate arguments to function"
my formula is
sum(profit)/month(selected date) *12
How do I get an integer for the current month? That seems to be the problem, it tries to aggregate the month as well.
Thanks.
Short answer: wrap the call to month in a call to min() -- which works well if you have MONTH([selected date]) on the visualization as a dimension.
There are three types of calculated fields in Tableau:
row level calculations which act on a single data row. They can read from values of other fields in the same row and return a single value per row.
aggregate calculations which act on a partition or block of data rows. They can reference the result of aggregating the values for a field across the entire partition, using a an aggregate function like SUM() or MIN().
table calculations which act on an entire table of aggregated results.
You can't mix and match. Everything in a calculated field must be all at one level or another -- either all referenced fields must use aggregation functions (for aggregate calculated fields) or no referenced fields must use aggregation functions (for data row level calculated fields).
Hence the error message you saw.
Sometimes you know that all values for a field will be the same in a partition based on your visualization, so the aggregation function seems unnecessary. But Tableau still requires you to be explicit about how to turn a block of values into a single value, because the calculation must be defined even when the visualization is partitioned differently. In these cases, you can use min(), max(), avg(), or perhaps attr() because they all return the same value for a list of identical values.
The first two types are typically executed on the server (i.e. they are implemented by Tableau emitting SQL to send to the database server). Table calculations are executed by Tableau on the client site to post-process the results from the database server.
Table calcs are the most complicated type, but can be very useful. Explaining them is a post for another day.