Flattening Nested CSV in Hierarchal order using Scala Spark

Flattening Nested CSV in Hierarchal order using Scala Spark - scala

I am trying to flatten a nested csv in hierarchal order where 1 is the root node that can be repeatable . Rest of the nodes are children to its corresponding immediate parent and each node is having unique number(Id)
Example two 2s are siblings and 3 is the child of immediate last 2. Similarly node 3(1465) has child 4 (id:2345) and another 3 (0987) has child 4(3450)
label
desc
id
1
jhfdjhfd
1234
2
hdjhfdf
3456
2
mnvmcvn
8765
3
nvbvhv
1456
4
bvnvnv
2345
3
yrtuy
0987
4
uoio
3450
The output should be in following format
label
desc
id
L0
L1
L2
L3
L4
1
jhfdjhfd
1234
null
null
null
null
null
2
hdjhfdf
3456
1234
null
null
null
null
2
mnvmcvn
8765
1234
null
null
null
null
3
nvbvhv
1456
1234
8765
null
null
null
4
bvnvnv
2345
1234
8765
1456
null
null
3
yrtuy
0987
1234
8765
null
null
null
4
uoio
3450
1234
8765
0987
null
null

Related

How do I write in pyspark in an aws gluejob?

I have this data in a DataFrame1:
cust_no
cust_name
FIRST
SECOND
THIRD
FOURTH
FIFTH
SIXTH
5432
Smith
0
2
3
6
3
0
3657
John
4
0
0
0
8
0
3562
Rebecca
7
0
9
2
0
1
9863
Sam
0
1
0
0
0
6
scorerules_df is:
score_rule
value
score_rule1
SECOND
score_rule2
FOURTH
score_rule3
FIRST
score_rule4
THIRD
score_rule5
SIXTH
output_df should be like below:
rule
col1
col2
col3
col4
cust_no
5432
3657
3562
9863
cust_name
Smith
John
Rebecca
Sam
score_rule1
SECOND
NULL
NULL
SECOND
score_rule2
FOURTH
NULL
FOURTH
NULL
score_rule3
NULL
FIRST
FIRST
NULL
score_rule4
THIRD
NULL
THIRD
NULL
score_rule5
NULL
NULL
SIXTH
SIXTH
Edit:
scorerules_df just says which column to consider for each scorerule. scorerule1 is 'SECOND'. So, if somebody's value in SECOND column is anything other than NULL, the column name i.e SECOND should be entered in output_df

How can i get all rows from two tables Postgres

I have a problem with JOIN of two tables.
CREATE table appointment(
idappointment serial primary key,
idday int references days(idday),
worktime text
);
create table booking(
idbooking serial,
idappointment int references appointment(idappointment),
date date,
primary key(idappointment)
);
appointment
idappointment
idday
worktime
1
1
07:00-08:00
2
1
08:00-09:00
3
1
09:00-10:00
4
2
09:00-10:00
booking
idbooking
idappointment
date
1
1
2021-08-22
1
2
2021-08-2
And I want :
idbooking
idappointment
date
idbooking
idappointment
date
1
1
07:00-08:00
null
null
null
2
1
08:00-09:00
null
null
null
3
1
09:00-10:00
null
null
null
4
2
09:00-10:00
null
null
null
null
null
null
1
1
2021-08-22
null
null
null
1
2
2021-08-2
1
1
07:00-08:00
1
1
2021-08-22
2
1
08:00-09:00
1
2
2021-08-2
How can I get it ?

Map the column value against same sequences in PostgreSQL

I want to write a query as per below input and output
Input :-
Num Sr_no Exp_no
NULL 1 1
NULL 2 1
ABC_1 3 1
NULL 4 1
NULL 1 2
NULL 2 2
ABC_2 3 2
NULL 4 4
Expected Output:-
Num Sr_no Exp_no
ABC_1 1 1
ABC_1 2 1
ABC_1 3 1
ABC_1 4 1
ABC_2 1 2
ABC_2 2 2
ABC_2 3 2
ABC_2 4 4

As there is no details in question, this answer is on below assumptions
you want to fill num field based on exp_no grouping.
Assuming there is only one value in a exp_no group.
Try this:
with cte as
(
select distinct on (num,exp_no) num, exp_no
from test
where num is not null
order by 1)
select
coalesce(t1.num, cte.num),
t1.sr_no,
t1.exp_no
from test t1 left join cte on t1.exp_no=cte.exp_no
DEMO

T_SQL counting particular values in one row with multiple columns

I have little problem with counting cells with particular value in one row in MSSMS.
Table looks like
ID
Month
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
11
12
13
14
15
16
...
31
5000
1
null
null
1
1
null
1
1
null
null
2
2
2
2
2
null
null
3
3
3
3
3
null
...
1
I need to count how many cells in one row have value for example 1. In this case it would be 5.
Data represents worker shifts in a month. Be aware that there is a column named month (FK with values 1-12), i don't want to count that in a result.
Column ID is ALWAYS 4 digit number.
Possibility is to use count(case when) but in examples there are only two or three columns not 31. Statement will be very long. Is there any other option to count it?
Thanks for any advices.

I'm going to strongly suggest that you abandon your current table design, and instead store one day per month, per record, not column. That is, use this design:
ID | Date | Value
5000 | 2021-01-01 | NULL
5000 | 2021-01-02 | NULL
5000 | 2021-01-03 | 1
5000 | 2021-01-04 | 1
5000 | 2021-01-05 | NULL
...
5000 | 2021-01-31 | 5
Then use this query:
SELECT
ID,
CONVERT(varchar(7), Date, 120),
COUNT(CASE WHEN Value = 1 THEN 1 END) AS one_cnt
FROM yourTable
GROUP BY
ID,
CONVERT(varchar(7), Date, 120);

Replace empty strings with NULL instead of empty strings when using JOIN

I have two tables:
table_a
id name
1 john
2 dave
3 tim
4 marta
5 jim
table_b
id sum random_metric
1 10.50 abc
3 11.5 efg
5 5.76 ghj
I have joined them on id
SELECT ...
FROM table_a
LEFT JOIN table_b ON table_a.id = table_b.id
and I get:
id name sum random_metric
1 john 10.5 abc
2 dave
3 tim 11.5 efg
4 marta
5 jim 5.76 ghj
Then I want to convert the sum column to double precision but since it has empty strings in rows 2, 4 it does not work.
How could I join tables so that I would have this:
id name sum random_metric
1 john 10.5 abc
2 dave NULL NULL
3 tim 11.5 efg
4 marta NULL NULL
5 jim 5.76 ghj

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Flattening Nested CSV in Hierarchal order using Scala Spark - scala

Related

How do I write in pyspark in an aws gluejob?

How can i get all rows from two tables Postgres

Map the column value against same sequences in PostgreSQL

T_SQL counting particular values in one row with multiple columns

Replace empty strings with NULL instead of empty strings when using JOIN

Categories

Resources