I have problem in getting data perfectly in adf - azure-data-factory

Cconsider one CSV file with employee details and attendance marking attendance with 0 and 1. For example, 1 indicates the employee is present, 0 indicates employee is absent. My problem is to get the working date of employee if they are present (1). It should be the same day where the employee is absent (0). It should be next working day by reading the previous row.
emp id
working
working day
123
1
11/14/2022
123
0
11/15/2022
123
1
11/14/2022
I have tried using data flow in ADF, but it is not getting. Please provide solution for me in Azure Data Factory.

To manipulate the csv data in data factory we have to use Data Flow activity in Azure data factory
I reproduce your scenario in my environment please follow below steps to get issue resolved:
I took one sample file as source in Data flow activity similar to data you provided as below (assuming you don't have date values when employee is absent):
Then I took windows transformation activity over emp id and sort by emp id and created windows column working date with updating Date based on previous row value with expression:
lag(addDays({working day}, 1),1)
Windows transformation data preview:
Now I took derived column transformation to get date working date of employee if they are present or absent. I am updating the column working day if working is 0 then value should be working date else value will working day.
iif(working==0,{working date},{working day})
Derived column transformation data preview:
Now with select activity delete unnecessary columns and store the data in sink.
Select transformation data preview:

Related

Selecting record based on dates

I have a sample source table here and currently, it is returning multiple records due to multiple tutors.
I want to select only the tutor_name if the exam_date is between the effective date and the next record's effective date.
Sample desired output:

How to Create Date Incrementing Trigger in Postgresql

I'm not very experienced in Postgresql, and in a personal project I have the table column ID (Primary Key) being a date with format yyyy-mm-dd, like this
ID
---
2020-07-12
2020-07-13
2020-07-14
I want to create a trigger so that ID value gets automatically incremented (based on latest ID value) when a new record is inserted. Also what is the recommended data type for the column ID in this case?
My goal
The table stores Daily Report summaries, it's already known that new record will be the next day and I want to eliminate the need to specify the report's Date manually.
Report data is inserted using service call to application, and reducing the payload params to the minimum is important.

Redshift - Extracting data from table based on filter condition

I have some sales data that shows if a store has done a sale or not. I am trying to pull out all stores that have no sale done till date. Given below is the query and the sample data I am working with.
store_name,sale_made,count
store_a,0,100
store_a,1,23
store_b,1,18
store_c,0,32
store_d,0,50
store_d,1,70
Expected output:
store_name,sale_made,count
store_c,0,32
Reason being only store_c in that list has sale_made = 0 and no sale_made = 1
you need something like this
SELECT store_name FROM table
GROUP BY 1
HAVING SUM(sale_made)=0

Inaccurate COUNT DISTINCT Aggregation with Date dimension in Google Data Studio

When I aggregate values in Google Data Studio with a date dimension on a PostgreSQL Connector, I see buggy behaviour. The symptom is that performing COUNT(DISTINCT) returns the same value as COUNT():
My theory is that it has something to do with the aggregation on the data occurring after the count has already happened. If I attempt the exact same aggregation on the same data in an exported CSV instead of directly from a PostgreSQL Connector Data Source, the issue does not reproduce:
My PostgreSQL Connector is connecting to Amazon Redshift (jdbc:postgresql://*******.eu-west-1.redshift.amazonaws.com) with the following custom query:
SELECT
userid,
submissionid,
date
FROM mytable
Workaround
If I stop using the default date field for the Date Dimension and aggregate my own dates directly in within the SQL query (date_byweek), the COUNT(DISTINCT) aggregation works as expected:
SELECT
userid,
submissionid,
to_char(date,'YYYY-IW') as date_byweek
FROM mytable
While this workaround solves my immediate problem, it sucks because I miss out on all the date functionality provided by Data Studio (Hierarchy Drill Down, Date Range filtering, etc.). Not to mention reducing my confidence at what else may be "buggy" within the product 😞
How to Reproduce
If you'd like to re-create the issue, using the following data as a PostgreSQL Data Source should suffice:
> SELECT * FROM mytable
userid submissionid
-------- -------------
1 1
2 2
1 3
1 4
3 5
> COUNT(DISTINCT userid) -- ERROR: Returns 5 when data source is PostgreSQL
> COUNT(DISTINCT userid) -- EXPECTED: Returns 3 when data source is CSV (exported from same PostgreSQL query above)
I'm happy to report that as of Sep 17 2020, there's a workaround.
DataStudio added the DATETIME_TRUNC function (see here https://support.google.com/datastudio/answer/9729685?), that allows you to add a custom field that truncs the original date to whatever granularity you want, without causing the distinct bug.
Attempting to set the display granularity in the report still causes the bug (i.e., you'll still set Oct 1 2020 12:00:00 instead of Oct 2020).
This can be solved by creating a SECOND custom field, which just returns the first, and then you can add IT to the report, change the display granularity, and everything will work OK.
I have the same issue with MySQL Connector. But my problem is solved, when I change date field format in DB from DATETIME (YYYY-MM-DD HH:MM:SS) to INT (Unixtimestamp). After connection this table to the Googe Datastudio I set type for this field as Date (YYYYMMDD) and all works, as expected. Hope, this may help you :)
In this Google forum there is a curious solution by Damien Choizit that involves combining your data source with itself. It works well for me.
https://support.google.com/datastudio/thread/13600719?hl=en&msgid=39060607
It says:
I figured out a solution in my case: I used a Blend Data joining twice the same data source with corresponding join key(s), then I specified a data range dimension only on the left side and selected the columns I wanted to CTD aggregate as "dimensions" (and not metric!) on the right side.

In a data flow task, how do I restrict rows flowing using a value from another source?

I have an excel sheet with many tabs. Say one is called wsMain and the other is called wsDate.
In my data flow transformation I am able to successfully load the data from wsMain to my table.
Now I have to update this transformation where I have to fetch the maximum date from the worksheet wsDate and only load data from wsMain where the date is less than on equal to the maximum date in wsDate (that is the only column available).
So for I have figured out that I need to create a new Excel connection manager to read the data from wsDate and I have used the Aggregate transformatioin to get the maximum date.
Now the question is how do I use this date to restrict the rows coming from wsMain?
I understand from the link below that you can store the value in a variable but what do I do next?:
SSIS set result set from data flow to variable
I have tried using a merge join but not sure if I am doing it right.
Here is what it looks like now:
I could not achieve the above but would be interested to know if that is possible. As a work around I have created a separate dataflow where I have stored the valued in a variable and then used the variable in the conditional split to filter the required rows:
Here is a step by step guide I followed to write the variable:
https://www.proteanit.com/2008/12/11/ssis-writing-to-a-package-variable-in-a-dataflow/
You can obtain the maximum value of the wsDate column first, this use this as a filter to avoid introducing unnecessary records into the data flow which which would be discarded by the Conditional Split. An overview of this process is below. I'd also recommend confirming the data types for all columns involved.
Create an SSIS DateTime variable and name this something descriptive such as MaxDate.
Create a Data Flow Task before the current one with an Excel Source component. Use the SQL command option for the Data Access Mode and enter a SQL statement to return the max value of the wsDate column. In the following example ExcelSource is the name of the sheet that you're pulling from. I'd suggested confirming the query with the Preview button on the Excel Source as well.
Add a Script Component (not Task) after the Excel Source. Add the MaxDate variable in the ReadWriteVariables field on the main page of the Script Component. On the Inputs and Outputs pane add the output column from the Excel Source as an Input Column with the ReadOnly usage Type. Example C# code for this is below. Note that variables can only be written to in the PostExecute method. The Input0_ProcessInputRow method is called once for each row that passes through, however there will only be the single row in this case. On the following code MaxExcelDate is the name of the output column from the Excel Source.
On the Excel Source component in the Data Flow Task where the records are imported from Excel, change the Data Access Mode to SQL command and enter a SQL statement to return records that have a date less than or equal to the maximum wsDate value. This is the last example and the ? is a placeholder for the parameter. After entering this SQL, click the Parameters button and select Parameter0 for the Parameters field, the MaxDate variable for Variables field, and a direction of Input. The Conditional Split can then be removed since these records will now be filtered out.
Excel MAX wsDate SELECT:
SELECT MAX(wsDate) AS MaxExcelDate FROM ExcelSource
C# Script Component:
DateTime maxDate;
public override void PostExecute()
{
base.PostExecute();
Variables.MaxDate = maxDate;
}
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
maxDate = Row.MaxExcelDate;
}
Excel Command with Date Filter:
SELECT
Column1,
Column2,
Column3
FROM ExcelSheet
WHERE DateColumn <= ?
Yes, it is possible. In the data flow, you will need to determine the max date, which you already have. Next, you will need to MERGE JOIN the two data flows on the date column. From there, you will feed it into a CONDITIONAL SPLIT and split where the date columns match [i.e., !ISNULL()] versus do not match [i.e., ISNULL()]. In your case, you only want the matches. The non-matches will be disregarded.
Note: if you use an INNER JOIN on the MERGE JOIN where there is only one date (i.e., MaxDate) to join on, then this will take care of the row filtering for you. You will not need a CONDITIONAL SPLIT.
Welcome to ETL.
Update
It is a real pain that SSIS's MERGE JOINs only perform joins on EQUAL operations as opposed to LESS THAN and GREATER THAN operations. You will need to separate the data flows.
Use a script component to scan the excel file for the MAX Date and assign that value to a package variable in SSIS. Alternatively, you can have a dates table in SQL Server and then use an Execute SQL Command in SSIS to retrieve the MAX Date from the table and assign that value to a package variable
Modify your existing data flow to remove the reading of the Excel date file completely. Then add a DERIVED COLUMN transformation and add a new column that is mapped to the package variable in SSIS that stores the MAX date. You can name the Derived Column Name 'MaxDate'
Add a conditional split transformation with the following CONDITION logic: [AsOfDt] <= [MaxDate]
Set the Output Name to Insert Records
Note: The CONDITIONAL SPLIT creates a new output data flow with restricted/filtered rows. It does not create a new column within the existing data flow. Think of this as a transposition of data flow output from column modification to row modification. Only those rows that match the condition will be sent to the output that you desire. I assume you only want to Insert these records, so I named it that. You can choose whatever naming convention you prefer
Note 2: Sorry for not making the Update my original answer - I haven't used the AGGREGATE transformation before so I was not aware that it restricts row output as opposed to reading a value in the data flow and then assigning it to a variable. That would be a terrific transformation for Microsoft to add to SSIS. It appears that the ROWCOUNT and SCRIPT COMPONENT transformations are the only ones that have the ability to set a package variable value within the data flow.