How do I write in pyspark in an aws gluejob? - pyspark

I have this data in a DataFrame1:
cust_no
cust_name
FIRST
SECOND
THIRD
FOURTH
FIFTH
SIXTH
5432
Smith
0
2
3
6
3
0
3657
John
4
0
0
0
8
0
3562
Rebecca
7
0
9
2
0
1
9863
Sam
0
1
0
0
0
6
scorerules_df is:
score_rule
value
score_rule1
SECOND
score_rule2
FOURTH
score_rule3
FIRST
score_rule4
THIRD
score_rule5
SIXTH
output_df should be like below:
rule
col1
col2
col3
col4
cust_no
5432
3657
3562
9863
cust_name
Smith
John
Rebecca
Sam
score_rule1
SECOND
NULL
NULL
SECOND
score_rule2
FOURTH
NULL
FOURTH
NULL
score_rule3
NULL
FIRST
FIRST
NULL
score_rule4
THIRD
NULL
THIRD
NULL
score_rule5
NULL
NULL
SIXTH
SIXTH
Edit:
scorerules_df just says which column to consider for each scorerule. scorerule1 is 'SECOND'. So, if somebody's value in SECOND column is anything other than NULL, the column name i.e SECOND should be entered in output_df

Related

Map the column value against same sequences in PostgreSQL

I want to write a query as per below input and output
Input :-
Num Sr_no Exp_no
NULL 1 1
NULL 2 1
ABC_1 3 1
NULL 4 1
NULL 1 2
NULL 2 2
ABC_2 3 2
NULL 4 4
Expected Output:-
Num Sr_no Exp_no
ABC_1 1 1
ABC_1 2 1
ABC_1 3 1
ABC_1 4 1
ABC_2 1 2
ABC_2 2 2
ABC_2 3 2
ABC_2 4 4
As there is no details in question, this answer is on below assumptions
you want to fill num field based on exp_no grouping.
Assuming there is only one value in a exp_no group.
Try this:
with cte as
(
select distinct on (num,exp_no) num, exp_no
from test
where num is not null
order by 1)
select
coalesce(t1.num, cte.num),
t1.sr_no,
t1.exp_no
from test t1 left join cte on t1.exp_no=cte.exp_no
DEMO

A dictionary with a single value and multiple keys

What are the dictionaries with a single value and multiple keys stands for?
What are their purposes?
I've accidentally created one, but can not do anything with it:
q)type (`a`b`c)!(`d)
99h
q)((`a`b`c)!(`d))[`a]
'par
That special form usually denotes the flip of a partitioned table, where the keys represent the column names and the value represents the table name:
q)/load a database with partitioned table part_tab
q)flip part_tab
`ncej`jogn`ciha`hkpb`aeaj`blmj`ooei`jgjm`cflm`bpmc!`part_tab
This dictionary is not intended to be looked up in the usual manner and not in the way that you've attempted.
It would be completely ill-advised but it is possible to restrict columns of a partitioned table by manipulating this dictionary:
q)select from part_tab where date=2020.01.02
date ncej jogn ciha hkpb aeaj blmj ooei jgjm cflm bpmc
------------------------------------------------------------
2020.01.02 0 0 0 0 0 0 0 0 0 0
2020.01.02 1 1 1 1 1 1 1 1 1 1
2020.01.02 2 2 2 2 2 2 2 2 2 2
2020.01.02 3 3 3 3 3 3 3 3 3 3
...
q)part_tab:flip`ncej`jogn`ciha!`part_tab
q)select from part_tab where date=2020.01.02
date ncej jogn ciha
-------------------------
2020.01.02 0 0 0
2020.01.02 1 1 1
2020.01.02 2 2 2
...
Again don't try this on any large/production tables, it's an undocumented quirk.
Splay table have a similar dictionary when flipped:
q)flip splay
`ncej`jogn`ciha`hkpb`aeaj`blmj`ooei`jgjm`cflm`bpmc!`:splay/
The difference being that the table name has a "/" at the end and is hsym'd. This is how .Q.qp determines if a table is partitioned or splayed.

T_SQL counting particular values in one row with multiple columns

I have little problem with counting cells with particular value in one row in MSSMS.
Table looks like
ID
Month
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
11
12
13
14
15
16
...
31
5000
1
null
null
1
1
null
1
1
null
null
2
2
2
2
2
null
null
3
3
3
3
3
null
...
1
I need to count how many cells in one row have value for example 1. In this case it would be 5.
Data represents worker shifts in a month. Be aware that there is a column named month (FK with values 1-12), i don't want to count that in a result.
Column ID is ALWAYS 4 digit number.
Possibility is to use count(case when) but in examples there are only two or three columns not 31. Statement will be very long. Is there any other option to count it?
Thanks for any advices.
I'm going to strongly suggest that you abandon your current table design, and instead store one day per month, per record, not column. That is, use this design:
ID | Date | Value
5000 | 2021-01-01 | NULL
5000 | 2021-01-02 | NULL
5000 | 2021-01-03 | 1
5000 | 2021-01-04 | 1
5000 | 2021-01-05 | NULL
...
5000 | 2021-01-31 | 5
Then use this query:
SELECT
ID,
CONVERT(varchar(7), Date, 120),
COUNT(CASE WHEN Value = 1 THEN 1 END) AS one_cnt
FROM yourTable
GROUP BY
ID,
CONVERT(varchar(7), Date, 120);

Create Pivot Table using PostgreSQL

I have a table like this:
type code desc store Sales/Day Stock
-----------------------------------------------
1 AA1 abc 101 3 6
1 AA2 abd 101 4 0
1 AA3 abf 101 4 3
2 BA1 bba 101 5 1
2 BA2 bbc 101 2 1
1 AA1 abc 102 1 4
1 AA2 abd 102 2 0
2 BA1 bba 102 4 2
2 BA2 bbc 102 5 5
etc.
How I can show the result table like this:
type code desc Store 101 Store 102
Sales/Day | Stock Sales/Day | Stock
--------------------------------------------------------------
1 AA1 abc 3 6 1 4
1 AA2 abd 4 0 2 0
1 AA3 abf 4 3 0 0
2 BA1 bba 5 1 4 2
2 BA2 bbc 2 1 5 5
etc.
Note:
Colspan is only display.
demo:db<>fiddle
First way: FILTER
SELECT
type,
code,
"desc",
COALESCE(SUM(sales_day) FILTER (WHERE store = 101)) as sales_day_101,
COALESCE(SUM(stock) FILTER (WHERE store = 101), 0) as stock_101,
COALESCE(SUM(sales_day) FILTER (WHERE store = 102), 0) as sales_day_102,
COALESCE(SUM(stock) FILTER (WHERE store = 102), 0) as stock_102
FROM mytable
GROUP BY type, code, "desc"
ORDER BY type, code
Aggregating your values. I took SUM but in your case with distinct rows many other aggregate functions would do it. FILTER allows you to aggregate only one store.
The COALESCE is to avoid NULL values if no values are present for one aggregation (like AA3 in store 102).
Second way, CASE WHEN
SELECT
type,
code,
"desc",
SUM(CASE WHEN store = 101 THEN sales_day ELSE 0 END) as sales_day_101,
SUM(CASE WHEN store = 101 THEN stock ELSE 0 END) as stock_101,
SUM(CASE WHEN store = 102 THEN sales_day ELSE 0 END) as sales_day_102,
SUM(CASE WHEN store = 102 THEN stock ELSE 0 END) as stock_102
FROM mytable
GROUP BY type, code, "desc"
ORDER BY type, code
The idea is the same, but the newer FILTER function is replace by the more common CASE clause.
Notice that "desc" is a reserved word in Postgres. So I strictly recommend to rename your column.

Addition of columns after doing arithmetic operations in pyspark

I am actually new to pyspark and i am trying to do some data manipulations with it.
I have a DataFrame like below example:
Trxn Cust_ID Group
3370 A 1
8809 C 2
3525 B 3
8260 A 3
6349 B 3
3359 C 3
3701 NULL 3
5572 NULL 2
2580 A 1
In this DF, Trxn's are unique and the cust_id's can be repetitive and every cust_id belongs to some group. I need a Final Dataframe with the new group column names like array(Group_1, Group_2.. so on) where I do have a count of cust_id's belong to each group. Below is the output example:
Trxn Cust_ID Group Group_1 Group_2 Group_3
3370 A 1 2 0 1
8809 C 2 0 1 1
3525 B 3 0 0 2
8260 A 3 2 0 1
6349 B 3 0 0 2
3359 C 3 0 1 1
3701 NULL 3 0 1 1
5572 NULL 2 0 1 1
2580 A 1 2 0 1
Can someone let me know how to get this exact output in pyspark? Any help or hints would be greatly appreciated.
Seems like you are trying to do pivot here.
https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html