Flagging records after meeting a condition using Spark Scala

Flagging records after meeting a condition using Spark Scala - scala

I need some expert opinion on the below scenario:
I have following dataframe df1:
+------------+------------+-------+-------+
| Date1 | OrderDate | Value | group |
+------------+------------+-------+-------+
| 10/10/2020 | 10/01/2020 | hostA | grp1 |
| 10/01/2020 | 09/30/2020 | hostB | grp1 |
| Null | 09/15/2020 | hostC | grp1 |
| 08/01/2020 | 08/30/2020 | hostD | grp1 |
| Null | 10/01/2020 | hostP | grp2 |
| Null | 09/28/2020 | hostQ | grp2 |
| 07/11/2020 | 08/08/2020 | hostR | grp2 |
| 07/01/2020 | 08/01/2020 | hostS | grp2 |
| NULL | 07/01/2020 | hostL | grp2 |
| NULL | 08/08/2020 | hostM | grp3 |
| NULL | 08/01/2020 | hostN | grp3 |
| NULL | 07/01/2020 | hostO | grp3 |
+------------+------------+-------+-------+
Each group is ordered by OrderDate in descending order. Post ordering, Each value having Current_date < (Date1 + 31Days) or Date1 as NULL needs to be flagged as valid until Current_date > (Date1 + 31Days).
Post that, every Value should be marked as Invalid irrespective of Date1 value.
If for a group, all the records are NULL, all the Value should be tagged as Valid
My output df should look like below:
+------------+------------+-------+-------+---------+
| Date1 | OrderDate | Value | group | Flag |
+------------+------------+-------+-------+---------+
| 10/10/2020 | 10/01/2020 | hostA | grp1 | Valid |
| 10/01/2020 | 09/30/2020 | hostB | grp1 | Valid |
| Null | 09/15/2020 | hostC | grp1 | Valid |
| 08/01/2020 | 08/30/2020 | hostD | grp1 | Invalid |
| Null | 10/01/2020 | hostP | grp2 | Valid |
| Null | 09/28/2020 | hostQ | grp2 | Valid |
| 07/11/2020 | 08/08/2020 | hostR | grp2 | Invalid |
| 07/01/2020 | 08/01/2020 | hostS | grp2 | Invalid |
| NULL | 07/01/2020 | hostL | grp2 | Invalid |
| NULL | 08/08/2020 | hostM | grp3 | Valid |
| NULL | 08/01/2020 | hostN | grp3 | Valid |
| NULL | 07/01/2020 | hostO | grp3 | Valid |
+------------+------------+-------+-------+---------+
My approach:
I created row_number for each group after ordering by OrderDate.
Post that i am getting the min(row_number) having Current_date > (Date1 + 31Days) for each group and save it as new dataframe dfMin.
I then join it with df1 and dfMin on group and filter based on row_number(row_number < min(row_number))
This approach works for most cases. But when for a group all values of Date1 are NULL, this approach fails.
Is there any other better approach to include the above scenario as well?
Note: I am using pretty old version of Spark- Spark 1.5. Also windows function won't work in my environment(Its a custom framework and there are many restrictions in place). For row_number, i used zipWithIndex method.

Related

How to add a validation on delta table column dynamically?

I'm working on a transformation and stuck with a common problem. Any assist is well appreciated.
Scenario:
Step-1: Reading from a delta table.
+--------+------------------+
| emp_id | str |
+--------+------------------+
| 1 | name=qwerty. |
| 2 | age=22 |
| 3 | job=googling |
| 4 | dob=12-Jan-2001 |
| 5 | weight=62.7. |
+--------+------------------+
Step-2: I'm refining the data and outputting it into another delta table dynamically (No predefined schema). Let's say I'm adding null if the column name is not found.
+--------+--------+------+----------+-------------+--------+
| emp_id | name | age | job | dob | weight |
+--------+--------+------+----------+-------------+--------+
| 1 | qwerty | null | null | null | null |
| 2 | null | 22 | null | null | null |
| 3 | null | null | googling | null | null |
| 4 | null | null | null | 12-Jan-2001 | null |
| 5 | null | null | null | null | 62.7 |
+--------+--------+------+----------+-------------+--------+
Is there a way to apply validation in step-2 based on the column name? I'm splitting it by = while deriving the above table. Or do I have to do validation in step-3 while working on the new df?
Second question: Is there a way to achieve the following table?
+--------+--------+------+----------+-------------+--------+---------------------+
| emp_id | name | age | job | dob | weight | missing_attributes |
+--------+--------+------+----------+-------------+--------+---------------------+
| 1 | qwerty | null | null | null | null | age,job,dob,weight |
| 2 | null | 22 | null | null | null | name,job,dob,weight |
| 3 | null | null | googling | null | null | name,age,dob,weight |
| 4 | null | null | null | 12-Jan-2001 | null | name,age,job,weight |
| 5 | null | null | null | null | 62.7 | name,age,job,dob |
+--------+--------+------+----------+-------------+--------+---------------------+

Do multiple UPDATE using CTE

I'm trying to use a CTE to do two update statements in Postgres and I'm not sure if there is a better approach to what I'm trying to do.
-- This script will update user records where the duplicate record value will overwrite the old value
WITH
update_users_Cte
AS
(
SELECT
new.id new_user_id,
old.id old_user_id,
new.employee_id new_employee_id,
old.employee_id old_employee_id,
new.display_name new_display_name,
old.display_name old_display_name,
new.first_name new_first_name,
old.first_name old_first_name,
new.last_name new_last_name ,
old.last_name old_last_name,
new.phone_number new_phone_number,
old.phone_number old_phone_number,
new.job_type new_job_type,
old.job_type old_job_type,
new.department_id new_department_id,
old.department_id old_department_id
FROM users old
JOIN users new ON (CONCAT(0, new.employee_id) = old.employee_id)
WHERE LENGTH(new.employee_id) = 5 AND
new.display_name != old.display_name OR
new.first_name != old.first_name OR
new.last_name != old.last_name OR
new.phone_number != old.phone_number OR
new.job_type != old.job_type OR
new.department_id != old.department_id
)
UPDATE
users
SET
phone_number = NULL
FROM
update_users_Cte
WHERE
employee_id = update_users_Cte.new_employee_id
UPDATE
users
SET
display_name = update_users_Cte.new_display_name,
first_name = update_users_Cte.new_first_name,
last_name = update_users_Cte.new_last_name,
phone_number = update_users_Cte.new_phone_number,
job_type = update_users_Cte.new_job_type,
department_id = update_users_Cte.new_department_id
FROM
update_users_Cte
WHERE
employee_id = update_users_Cte.old_employee_id
This is the error:
ERROR: syntax error at or near "UPDATE"
LINE 41: UPDATE
I would like to be able to do both UPDATEs and use the CTE as I need to check it in both cases. I'm not sure if I have to wrap the whole thing in a transaction.
Any help would be appreciated.
new_user_id | old_user_id | new_employee_id | old_employee_id | new_display_name | old_display_name | new_first_name | old_first_name | new_last_name | old_last_name | new_phone_number | old_phone_number | new_updated_at | old_updated_at | new_job_type | old_job_type | new_department_id | old_department_id
-------------+-------------+-----------------+-----------------+------------------+------------------+----------------+----------------+---------------+---------------+------------------+------------------+---------------------+---------------------+--------------+---------------+-------------------+-------------------
474 | 19710 | 35275 | 035275 | | | David | David | Coyle | Coyle | +447584208902 | | 2017-06-22 17:09:43 | 2021-01-27 15:14:43 | | | 418 | 418
19701 | 432 | 21239 | 021239 | | | Piotr | Piotr | Mierniczek | Mierniczek | | +447404050330 | 2021-02-08 14:36:59 | 2017-06-22 17:09:42 | | | 249 | 73
19702 | 479 | 35568 | 035568 | | | Manjita | Manjita | Kunwar | Kunwar | | +447847370860 | 2021-01-15 15:51:44 | 2021-01-15 15:45:20 | | | 317 | 317
19707 | 19680 | 11111 | 011111 | | Sarika | Sarika | Sarika | Sharma | Sharma | | +447700000000 | 2021-01-20 12:46:09 | 2021-01-20 12:45:12 | | C.S. Employee | |

How to convert row into column in PostgreSQL of below table

I was trying to convert the trace table to resulted table in postgress. I have hug data in the table.
I have table with name : Trace
entity_id | ts | key | bool_v | dbl_v | str_v | long_v |
---------------------------------------------------------------------------------------------------------------
1ea815c48c5ac30bca403a1010b09f1 | 1593934026155 | temperature | | | | 45 |
1ea815c48c5ac30bca403a1010b09f1 | 1593934026155 | operation | | | Normal | |
1ea815c48c5ac30bca403a1010b09f1 | 1593934026155 | period | | | | 6968 |
1ea815c48c5ac30bca403a1010b09f1 | 1593933202984 | temperature | | | | 44 |
1ea815c48c5ac30bca403a1010b09f1 | 1593933202984 | operation | | | Reverse | |
1ea815c48c5ac30bca403a1010b09f1 | 1593933202984 | period | | | | 3535 |
Trace Table
convert the above table into following table in PostgreSQL
Output Table: Result
entity_id | ts | temperature | operation | period |
----------------------------------------------------------------------------------------|
1ea815c48c5ac30bca403a1010b09f1 | 1593934026155 | 45 | Normal | 6968 |
1ea815c48c5ac30bca403a1010b09f1 | 1593933202984 | 44 | Reverse | 3535 |
Result Table

Have you tried this yet?
select entity_id, ts,
max(long_v) filter (where key = 'temperature') as temperature,
max(str_v) filter (where key = 'operation') as operation,
max(long_v) filter (where key = 'period') as period
from trace
group by entity_id, ts;

T-SQL : Pivot table without aggregate

I am trying to understand how to pivot data within T-SQL but can't seem to get it working. I have the following table structure
+-------------------+-----------------------+
| Name | Value |
+-------------------+-----------------------+
| TaskId | 12417 |
| TaskUid | XX00044497 |
| TaskDefId | 23 |
| TaskStatusId | 4 |
| Notes | |
| TaskActivityIndex | 0 |
| ModifiedBy | Orange |
| Modified | /Date(1554540200000)/ |
| CreatedBy | Apple |
| Created | /Date(2121212100000)/ |
| TaskPriorityId | 40 |
| OId | 2 |
+-------------------+-----------------------+
I want to pivot the name column to be columns expected output
+--------+------------------------+-----------+--------------+-------+-------------------+------------+-----------------------+-----------+-----------------------+----------------+-----+
| TASKID | TASKUID | TASKDEFID | TASKSTATUSID | NOTES | TASKACTIVITYINDEX | MODIFIEDBY | MODIFIED | CREATEDBY | CREATED | TASKPRIORITYID | OID |
+--------+------------------------+-----------+--------------+-------+-------------------+------------+-----------------------+-----------+-----------------------+----------------+-----+
| | | | | | | | | | | | |
| 12417 | XX00044497 | 23 | 4 | | 0 | Orange | /Date(1554540200000)/ | Apple | /Date(2121212100000)/ | 40 | 2 |
+--------+------------------------+-----------+--------------+-------+-------------------+------------+-----------------------+-----------+-----------------------+----------------+-----+
Is there an easy way of doing it? The columns are fixed (not dynamic).
Any help appreciated

Try this:
select * from yourtable
pivot
(
min(value)
for Name in ([TaskID],[TaskUID],[TaskDefID]......)
) as pivotable
You can also use case statements.
You must use the aggregate function in the pivot table.
If you want to learn more, here is the reference:
https://learn.microsoft.com/en-us/sql/t-sql/queries/from-using-pivot-and-unpivot?view=sql-server-2017
Output (I only tried three columns):
DB<>Fiddle

SQL server left join not returning expected records from left table

I have two objects within a SQl Server 2008 R2 database, which I am trying to join together with a left join but I am unable to get the left join to return all records from the table.
1 table - tt_activityoccurrence
1 view - vw_academicweeks
The vw_academicweeks, is a view that contains for each academic year a week number, and the first day and last day of the week and contains 52 records for each academic year.
tt_activityoccurrence is a table which contains occurrences of lessons within a year, lessons will not occur in all 52 weeks of the year.
With my query I am trying to return all instances from the vw_academicweeks view to return the following information
+------------+------------+------------+------------+---------+
| ActivityID | WeekStart | StartTime | EndTime | week_no |
+------------+------------+------------+------------+---------+
| 59936 | 04/09/2017 | 05/09/2017 | 05/09/2017 | 6 |
| 59936 | 11/09/2017 | 12/09/2017 | 12/09/2017 | 7 |
| 59936 | 18/09/2017 | 19/09/2017 | 19/09/2017 | 8 |
| 59936 | 25/09/2017 | 26/09/2017 | 26/09/2017 | 9 |
| 59936 | 02/10/2017 | 03/10/2017 | 03/10/2017 | 10 |
| 59936 | 09/10/2017 | 10/10/2017 | 10/10/2017 | 11 |
| 59936 | 16/10/2017 | 17/10/2017 | 17/10/2017 | 12 |
| 59936 | Null | Null | Null | 13 |
| 59936 | 30/10/2017 | 31/10/2017 | 31/10/2017 | 14 |
| 59936 | 06/11/2017 | 07/11/2017 | 07/11/2017 | 15 |
| 59936 | 13/11/2017 | 14/11/2017 | 14/11/2017 | 16 |
| 59936 | 20/11/2017 | 21/11/2017 | 21/11/2017 | 17 |
| 59936 | 27/11/2017 | 28/11/2017 | 28/11/2017 | 18 |
| 59936 | 04/12/2017 | 05/12/2017 | 05/12/2017 | 19 |
| 59936 | 11/12/2017 | 12/12/2017 | 12/12/2017 | 20 |
| 59936 | 18/12/2017 | 19/12/2017 | 19/12/2017 | 21 |
| 59936 | Null | Null | Null | 22 |
| 59936 | Null | Null | Null | 23 |
+------------+------------+------------+------------+---------+
With the left join I can return all values except the nulls, so that the week_no column is missing rows, 13,22 and 23. I have also tried this with an outer join but receive the same information.
I feel I am missing something obvious but it is escaping me at the moment.
select
ttao.ActivityID
,dateadd(dd,datediff(dd,0,DATEADD(dd, -(DATEPART(dw, ttao.StartTime)-1), ttao.StartTime)),0) WeekStart
,ttao.StartTime
,ttao.EndTime
,aw.week_no
from
vw_AcademicWeeks AW
left join TT_ActivityOccurrence TTAO on
(dateadd(dd,datediff(dd,0,DATEADD(dd, -(DATEPART(dw, ttao.StartTime)-1), ttao.StartTime)),0))=aw.ay_start
where
ay_code='1718' and
TTAO.ActivityID='59936'
order by aw.week_no asc

Your where clause makes it an inner join by eliminating rows outside of the scope of your join. You need to move this logic up to your join statement. Note, I didn't validate your join condiditon (the dateadd...datediff logic)
select
ttao.ActivityID
,dateadd(dd,datediff(dd,0,DATEADD(dd, -(DATEPART(dw, ttao.StartTime)-1), ttao.StartTime)),0) WeekStart
,ttao.StartTime
,ttao.EndTime
,aw.week_no
from
vw_AcademicWeeks AW
left join TT_ActivityOccurrence TTAO on
(dateadd(dd,datediff(dd,0,DATEADD(dd, -(DATEPART(dw, ttao.StartTime)-1), ttao.StartTime)),0)) = aw.ay_start
and ay_code='1718'
and TTAO.ActivityID='59936'
order by aw.week_no asc