Addition of columns after doing arithmetic operations in pyspark - scala

I am actually new to pyspark and i am trying to do some data manipulations with it.
I have a DataFrame like below example:
Trxn Cust_ID Group
3370 A 1
8809 C 2
3525 B 3
8260 A 3
6349 B 3
3359 C 3
3701 NULL 3
5572 NULL 2
2580 A 1
In this DF, Trxn's are unique and the cust_id's can be repetitive and every cust_id belongs to some group. I need a Final Dataframe with the new group column names like array(Group_1, Group_2.. so on) where I do have a count of cust_id's belong to each group. Below is the output example:
Trxn Cust_ID Group Group_1 Group_2 Group_3
3370 A 1 2 0 1
8809 C 2 0 1 1
3525 B 3 0 0 2
8260 A 3 2 0 1
6349 B 3 0 0 2
3359 C 3 0 1 1
3701 NULL 3 0 1 1
5572 NULL 2 0 1 1
2580 A 1 2 0 1
Can someone let me know how to get this exact output in pyspark? Any help or hints would be greatly appreciated.

Seems like you are trying to do pivot here.
https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html

Related

query customer retention over range

I am trying to find the best way to accomplish the following.
Get the beginning customer count, which carries from the previous day
Get New Customer count
Get the number of Customers who have not come in since the prior month
Get the number of Customers who have come back after lapsing
Get the number of total customers
The following example
Customer ID
Store ID
Date
Amount
1
1
1/2/22
1.00
2
2
1/2/22
2.00
1
1
2/2/22
1.00
3
2
3/2/22
1.00
2
2
3/2/22
1.00
1
1
3/2/22
1.00
1
1
4/2/22
1.00
4
1
4/2/22
1.00
2
2
4/2/22
1.00
The result would be
Date
Store
Beginning
New
Dropped
Returned
Total
1/2/22
1
0
1
0
0
1
1/2/22
2
0
1
0
0
1
2/2/22
1
1
0
0
0
1
2/2/22
2
1
0
1
0
0
3/2/22
1
1
0
0
0
1
3/2/22
2
0
1
0
1
2
4/2/22
1
1
1
0
0
2
4/2/22
2
2
0
1
0
1
I kind of have a query, but it's not getting the right results
WITH customerset AS (
SELECT
location_id,
date,
array_agg(DISTINCT customer_id ORDER BY customer_id ASC) AS customer_ids
FROM customer_orders
GROUP BY
location_id,
date
)
SELECT
cset.location_id,
cset.date,
array_length(cset2.customer_ids, 1) AS beginning,
array_length((past2.customer_ids - cset.customer_ids), 1) AS dropped,
array_length((cset.customer_ids - past2.customer_ids), 1) AS returned
FROM
(
SELECT
ords.location_id,
ords.date,
array_agg(DISTINCT ords.customer_id ORDER BY ords.customer_id ASC) AS customers_id
FROM customer_orders ords
GROUP BY
ords.location_id,
ords.date
) cset
JOIN
customerset cset2 ON cset.date - '1 month'::interval = cset2.date
AND cset2.location_id = cset.location_id
GROUP BY
cset.location_id,
cset.date,
cset2.customer_ids,
cset.customer_ids
ORDER BY
cset.date ASC

Map the column value against same sequences in PostgreSQL

I want to write a query as per below input and output
Input :-
Num Sr_no Exp_no
NULL 1 1
NULL 2 1
ABC_1 3 1
NULL 4 1
NULL 1 2
NULL 2 2
ABC_2 3 2
NULL 4 4
Expected Output:-
Num Sr_no Exp_no
ABC_1 1 1
ABC_1 2 1
ABC_1 3 1
ABC_1 4 1
ABC_2 1 2
ABC_2 2 2
ABC_2 3 2
ABC_2 4 4
As there is no details in question, this answer is on below assumptions
you want to fill num field based on exp_no grouping.
Assuming there is only one value in a exp_no group.
Try this:
with cte as
(
select distinct on (num,exp_no) num, exp_no
from test
where num is not null
order by 1)
select
coalesce(t1.num, cte.num),
t1.sr_no,
t1.exp_no
from test t1 left join cte on t1.exp_no=cte.exp_no
DEMO

A dictionary with a single value and multiple keys

What are the dictionaries with a single value and multiple keys stands for?
What are their purposes?
I've accidentally created one, but can not do anything with it:
q)type (`a`b`c)!(`d)
99h
q)((`a`b`c)!(`d))[`a]
'par
That special form usually denotes the flip of a partitioned table, where the keys represent the column names and the value represents the table name:
q)/load a database with partitioned table part_tab
q)flip part_tab
`ncej`jogn`ciha`hkpb`aeaj`blmj`ooei`jgjm`cflm`bpmc!`part_tab
This dictionary is not intended to be looked up in the usual manner and not in the way that you've attempted.
It would be completely ill-advised but it is possible to restrict columns of a partitioned table by manipulating this dictionary:
q)select from part_tab where date=2020.01.02
date ncej jogn ciha hkpb aeaj blmj ooei jgjm cflm bpmc
------------------------------------------------------------
2020.01.02 0 0 0 0 0 0 0 0 0 0
2020.01.02 1 1 1 1 1 1 1 1 1 1
2020.01.02 2 2 2 2 2 2 2 2 2 2
2020.01.02 3 3 3 3 3 3 3 3 3 3
...
q)part_tab:flip`ncej`jogn`ciha!`part_tab
q)select from part_tab where date=2020.01.02
date ncej jogn ciha
-------------------------
2020.01.02 0 0 0
2020.01.02 1 1 1
2020.01.02 2 2 2
...
Again don't try this on any large/production tables, it's an undocumented quirk.
Splay table have a similar dictionary when flipped:
q)flip splay
`ncej`jogn`ciha`hkpb`aeaj`blmj`ooei`jgjm`cflm`bpmc!`:splay/
The difference being that the table name has a "/" at the end and is hsym'd. This is how .Q.qp determines if a table is partitioned or splayed.

select top n posts by score count

I am trying to get the top n users by post using hive. The table looks like this.
Score User
10 1
20 2
50 1
20 2
0 3
3 1
40 2
...
I want to generate output which shows like
Rows Users
3 1
3 2
1 3
here is my query
SELECT * FROM (SELECT COUNT(score) as Score, UserID AS COUNT FROM A WHERE UserID IS NOT NULL GROUP BY UserID,score LIMIT 10) A;
The output I get is something like this
0 0
0 1
0 2
0 3
0 4
0 5
0 6
0 7
0 8
0 9
Can someone guide me where I am going wrong.
SELECT COUNT(score) as Score, UserID FROM A WHERE UserID IS NOT NULL GROUP BY UserID LIMIT 10

Reorder Ranked rows

Recently i needed to implement a way to allow for Table Records to be Ranked.
Initially i deployed an Update statement to seed the ranks:
;with cte as (
select
t.id,
Rank() Over (
Partition by t.field2
Order by t.id
) as [Rank],
t.index,
t.field2,
t.field3 ,
t.field4
from dbo.Table t
where t.field2 = #fldValue
) Update cte
set index = [Rank]
But now i need to be able to have the end-user re-order the ranks. Any suggestions on how to allow an end-user to take Rank value 92 to Rank value 15 and have everything be re-ranked appropriately.
I had thought about doing this via cursor but am trying to do this via Set based operation.
My first goto was to do a Procedural based operation but need to get more inline with Set based operation.
Table Schema
Table:
id bigint
field2 int
field3 int ---> This field will be the key pivoting column for ranking
field4 int
Data:
id field2 field3 field4
1 0 1 1
2 0 1 1
3 0 1 1
4 0 1 2
5 0 1 2
6 0 1 1
7 0 1 1
8 0 1 1
9 0 1 1
10 0 1 2
11 0 1 2
12 0 1 1
13 0 1 1
14 0 1 1
15 0 1 2
16 0 1 1
17 0 1 2
18 0 1 2
19 0 1 1