Overwriting group of values with in same column another set of group based on other column group - pyspark

Input:
Name GroupId Processed NewGroupId NgId
Mike 1 N 9 NULL
Mikes 1 N 9 NULL
Miken 5 Y 9 5
Mikel 5 Y 9 5
Output:
Name GroupId Processed NewGroupId NgId
Mike 1 N 9 5
Mikes 1 N 9 5
Miken 5 Y 9 5
Mikel 5 Y 9 5
below query worked in sql server, due to correlated subquery same is not working in spark sql.
Is there any alternate either with spark sql or pyspark dataframe.
SELECT Name,groupid,IsProcessed,ngid,
CASE WHEN ngid IS NULL THEN
COALESCE((SELECT top 1 ngid FROM temp D
WHERE D.NewGroupId = T.NewGroupId AND
D.ngid IS NOT NULL ), null)
ELSE ngid
END AS ngid
FROM temp T

worked with below in sparksql.
spark.sql("select LKUP,groupid,IsProcessed,NewGroupId ,coalesce((select Max(D.ngid) from test2 D where D.NewGroupId = T.NewGroupId AND D.ngidis not null),null) as ngid from test2 T")

Related

How to collect rows in a batch

I have a table that looks like this:
id
values
1
a
2
b
3
c
4
d
5
e
6
f
and I need to generate group_id column to be able to collect rows in a batch using
select collect_list(values) from table group by group_id
For example, for batchSize = 2
id
values
group_id
1
a
1
2
b
1
3
c
2
4
d
2
5
e
3
6
f
3
to get it out:
group_id
collect_list(values)
1
[a, b]
2
[c, d]
3
[e, f]
or, for batchSize = 3
id
values
group_id
1
a
1
2
b
1
3
c
1
4
d
2
5
e
2
6
f
2
out
group_id
collect_list(values)
1
[a, b, c]
2
[d, e, f]
How do I generate this column group_id so I can collect the values and group by group_id?
You could use row_number and DIV to generate group_id
To expand my answer, we use the Integer division properties to get group id
Row_number will give consecutive Numbers from 1 to N
But we need the numbers to start with 0, so we subtract 1 of the row number
rownumber Div (3)
0 0
1 0
2 0
3 1
4 1
5 1
6 2
This can be proven to be true for all integers to infinity
As group_id must start with 1(not necessary actually) we need to add anither 1 to the result
After with the resulting Group-id you can collect_list(values) tpo get your arrays
SELECT
id, `values`,
((ROW_NUMBER() OVER (ORDEr By id) -1) DIV 3) + 1 group_id
FROM tab1
id
values
group_id
1
a
1
2
b
1
3
c
1
4
d
2
5
e
2
6
f
2
7
g
3
8
h
3
9
i
3
SELECT
id, `values`,
((ROW_NUMBER() OVER (ORDEr By id) -1) DIV 2) + 1 group_id
FROM tab1
id
values
group_id
1
a
1
2
b
1
3
c
2
4
d
2
5
e
3
6
f
3
7
g
4
8
h
4
9
i
5
From my understanding what you're trying to do is chunking your selection.
Then this should do the trick: https://stackoverflow.com/a/29975781/21188126

SQL Renumbering index after group by

I have the following input table:
Seq Group GroupSequence
1 0
2 4 A
3 4 B
4 4 C
5 0
6 6 A
7 6 B
8 0
Output table is:
Line NewSeq GroupSequence
1 1
2 2 A
3 2 B
4 2 C
5 3
6 4 A
7 4 B
8 5
The rules for the input table are:
Any positive integer in the Group column indicates that the rows are grouped together. The entire field may be NULL or blank. A null or 0 indicates that the row is processed on its own. In the above example there are two groups and three 'single' rows.
the GroupSequence column is a single character that sorts within the group. NULL, blank, 'A', 'B' 'C' 'D' are the only characters allowed.
if Group has a positive integer, there must be alphabetic character in GroupSequence.
I need a query that creates the output table with a new column that sequences as shown.
External apps needs to iterate through this table in either Line or NewSeq order(same order, different values)
I've tried variations on GROUP BY, PARTITION BY, OVER(), etc. WITH no success.
Any help much appreciated.
Perhaps this will help
The only trick here is Flg which will indicate a new Group Sequence (values will be 1 or 0). Then it is a small matter to sum(Flg) via a window function.
Edit - Updated Flg method
Example
Declare #YourTable Table ([Seq] int,[Group] int,[GroupSequence] varchar(50))
Insert Into #YourTable Values
(1,0,null)
,(2,4,'A')
,(3,4,'B')
,(4,4,'C')
,(5,0,null)
,(6,6,'A')
,(7,6,'B')
,(8,0,null)
Select Line = Row_Number() over (Order by Seq)
,NewSeq = Sum(Flg) over (Order By Seq)
,GroupSequence
From (
Select *
,Flg = case when [Group] = lag([Group],1) over (Order by Seq) then 0 else 1 end
From #YourTable
) A
Order By Line
Returns
Line NewSeq GroupSequence
1 1 NULL
2 2 A
3 2 B
4 2 C
5 3 NULL
6 4 A
7 4 B
8 5 NULL

Recursive Cumulative Sum up to a certain value Postgres

I have my data that looks like this:
user_id touchpoint_number days_difference
1 1 5
1 2 20
1 3 25
1 4 10
2 1 2
2 2 30
2 3 4
I would like to create one more column that would create a cumulative sum of the days_difference, partitioned by user_id, but would reset whenever the value reaches 30 and starts counting from 0. I have been trying to do it, but I couldn't figure it out how to do it in PostgreSQL, because it has to be recursive.
The outcome I would like to have would be something like:
user_id touchpoint_number days_difference cum_sum_upto30
1 1 5 5
1 2 20 25
1 3 25 0 --- new count all over again
1 4 10 10
2 1 2 2
2 2 30 0 --- new count all over again
2 3 4 4
Do you have any cool ideas how this could be done?
This should do what you want:
with cte as (
select t.a, t.b, t.c, t.c as sumc
from t
where b = 1
union all
select t.a, t.b, t.c,
(case when t.c + cte.sumc > 30 then 0 else t.c + cte.sumc end)
from t join
cte
on t.b = cte.b + 1 and t.a = cte.a
)
select *
from cte
order by a, b;
Here is a rextester.

how to add the previous values using subquery in Oracle

Hi I have a scenario to add a all previous values ...
Input is this of a column of a table
Col
3
5
4
6
9
7
8
And I need output in this manner:
Col Col2
3 3
5 8
4 12
6 18
9 27
7 34
8 42
Kindly reply asap
Regards,
Neeraj
As long as you have a field to order by, you can use SUM ... OVER to do the running sum;
SELECT Col, SUM(Col) OVER (ORDER BY id) Col2
FROM Table1
ORDER BY id;
An SQLfiddle to test with.

Querying sql table with multiple values

I would like to query a sql table from below
ID Val
-------------
1 5
1 7
1 8
1 9
2 5
2 7
2 9
3 1
3 5
that would return the following set of results
query > select distinct ID from dbo.table where val in (5,7,9)
result
--------
ID
1
2
I run into a problem where a single row can match only one val from the subset and not all of them...
Assuming the rows are distinct:
SELECT ID
FROM your_table
WHERE Val IN (5,7,9)
GROUP BY ID
HAVING COUNT(*) = 3