I have the following table with the parent_path array:
Id | Account Name | parent_path
1 A {1}
2 B {2,1}
3 C {3,2,1}
4 D {4,3,2,1}
What I'm looking to do is to have a recursive left join in order to create 1 column per item in the parent_path array
Id | Account Name | parent_path | parent_name1 | parent_name2 | parent_name3
1 A NULL NULL NULL NULL
2 B {1} A NULL NULL
3 C {2,1} B A NULL
4 D {3,2,1} C B A
Thanks!
This is an abuse of SQL, but here goes:
with get_names as (
select h.id, h.account_name, h.parent_path, array_agg(h2.account_name order by p.rn) as name_path
from hier h
cross join lateral unnest(h.parent_path) with ordinality as p(path_id, rn)
join hier h2 on h2.id = p.path_id
group by h.id, h.account_name, h.parent_path
)
select id, account_name, parent_path,
name_path[2] as parent_name1,
name_path[3] as parent_name2,
name_path[4] as parent_name3,
name_path[5] as parent_name4,
name_path[6] as parent_name5,
name_path[7] as parent_name6,
name_path[8] as parent_name7,
name_path[9] as parent_name8
from get_names;
id | account_name | parent_path | parent_name1 | parent_name2 | parent_name3 | parent_name4 | parent_name5 | parent_name6 | parent_name7 | parent_name8
----+--------------+-------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------
1 | A | {1} | | | | | | | |
2 | B | {2,1} | A | | | | | | |
3 | C | {3,2,1} | B | A | | | | | |
4 | D | {4,3,2,1} | C | B | A | | | | |
(4 rows)
There is a cleaner solution, using PostgreSQL intarray, instead. It works best for small(ish) tables, since it's not optimised for performance:
CREATE EXTENSION intarray;
SELECT id,
name,
path - id as parents,
(SELECT name FROM hierarchy h2 WHERE h2.id = (h.path - h.id)[1]) as parent_1,
(SELECT name FROM hierarchy h2 WHERE h2.id = (h.path - h.id)[2]) as parent_2,
(SELECT name FROM hierarchy h2 WHERE h2.id = (h.path - h.id)[3]) as parent_3,
(SELECT name FROM hierarchy h2 WHERE h2.id = (h.path - h.id)[4]) as parent_4,
(SELECT name FROM hierarchy h2 WHERE h2.id = (h.path - h.id)[5]) as parent_5,
(SELECT name FROM hierarchy h2 WHERE h2.id = (h.path - h.id)[6]) as parent_6,
(SELECT name FROM hierarchy h2 WHERE h2.id = (h.path - h.id)[7]) as parent_7
FROM hierarchy h
In my code, it produced the following (truncated) output:
+--+-----------+-------+--------+--------+--------+
|id|name |parents|parent_1|parent_2|parent_3|
+--+-----------+-------+--------+--------+--------+
|1 |Europe | |NULL |NULL |NULL |
|2 |Germany |{1} |Europe |NULL |NULL |
|4 |Netherlands|{1} |Europe |NULL |NULL |
|7 |Africa | |NULL |NULL |NULL |
|10|France |{1} |Europe |NULL |NULL |
|12|America | |NULL |NULL |NULL |
|17|Finland |{1} |Europe |NULL |NULL |
|3 |Berlin |{1,2} |Europe |Germany |NULL |
+--+-----------+-------+--------+--------+--------+
Related
I have a pyspark dataframe with multiple map columns. I want to flatten all map columns recursively. personal and financial are map type columns. Similarly, we might have more map columns.
Input dataframe:
-------------------------------------------------------------------------------------------------------
| id | name | Gender | personal | financial |
-------------------------------------------------------------------------------------------------------
| 1 | A | M | {age:20,city:Dallas,State:Texas} | {salary:10000,bonus:2000,tax:1500}|
| 2 | B | F | {city:Houston,State:Texas,Zipcode:77001} | {salary:12000,tax:1800} |
| 3 | C | M | {age:22,city:San Jose,Zipcode:940088} | {salary:2000,bonus:500} |
-------------------------------------------------------------------------------------------------------
Output dataframe:
--------------------------------------------------------------------------------------------------------------
| id | name | Gender | age | city | state | Zipcode | salary | bonus | tax |
--------------------------------------------------------------------------------------------------------------
| 1 | A | M | 20 | Dallas | Texas | null | 10000 | 2000 | 1500 |
| 2 | B | F | null | Houston | Texas | 77001 | 12000 | null | 1800 |
| 3 | C | M | 22 | San Jose | null | 940088 | 2000 | 500 | null |
--------------------------------------------------------------------------------------------------------------
use map_concat to merge the map fields and then explode them. exploding a map column creates 2 new columns - key and value. pivot the key column with value as values to get your desired output.
data_sdf. \
withColumn('personal_financial', func.map_concat('personal', 'financial')). \
selectExpr(*[c for c in data_sdf.columns if c not in ['personal', 'financial']],
'explode(personal_financial)'
). \
groupBy([c for c in data_sdf.columns if c not in ['personal', 'financial']]). \
pivot('key'). \
agg(func.first('value')). \
show(truncate=False)
# +---+----+------+-----+-------+----+-----+--------+------+----+
# |id |name|gender|State|Zipcode|age |bonus|city |salary|tax |
# +---+----+------+-----+-------+----+-----+--------+------+----+
# |1 |A |M |Texas|null |20 |2000 |Dallas |10000 |1500|
# |2 |B |F |Texas|77001 |null|null |Houston |12000 |1800|
# |3 |C |M |null |940088 |22 |500 |San Jose|2000 |null|
# +---+----+------+-----+-------+----+-----+--------+------+----+
I have a problem with pivot tables ....
I don't understand what to do ...
My table is as follows:
|CODART|MONTH|QT |
|------|-----|----|
|ART1 |1 |100 |
|ART2 |1 |30 |
|ART3 |1 |30 |
|ART1 |2 |10 |
|ART4 |2 |40 |
|ART3 |4 |50 |
|ART5 |4 |60 |
I would like to get a summary table by month:
|CODART|1 |2 |3 |4 |5 |6 |7 |8 |9 |10 |11 |12 |
|------|---|---|---|---|---|---|---|---|---|---|---|---|
|ART1 |100|10 | | | | | | | | | | |
|ART2 |30 | | | | | | | | | | | |
|ART3 |30 | | |50 | | | | | | | | |
|ART4 | |2 | | | | | | | | | | |
|ART5 | | | |60 | | | | | | | | |
|TOTAL |160|12 | |110| | | | | | | | |
Too many requests? :-)
Thanks for the support
WITH MYTAB (CODART, MONTH, QT) AS
(
VALUES
('ART1', 1, 100)
, ('ART2', 1, 30)
, ('ART3', 1, 30)
, ('ART1', 2, 10)
, ('ART4', 2, 40)
, ('ART3', 4, 50)
, ('ART5', 4, 60)
)
SELECT
CASE GROUPING (CODART) WHEN 0 THEN CODART ELSE 'TOTAL' END AS CODART
, SUM (CASE MONTH WHEN 1 THEN QT END) AS "1"
, SUM (CASE MONTH WHEN 2 THEN QT END) AS "2"
, SUM (CASE MONTH WHEN 3 THEN QT END) AS "3"
, SUM (CASE MONTH WHEN 4 THEN QT END) AS "4"
---
, SUM (CASE MONTH WHEN 12 THEN QT END) AS "12"
FROM MYTAB T
GROUP BY ROLLUP (T.CODART)
ORDER BY GROUPING (T.CODART), T.CODART
CODART
1
2
3
4
12
ART1
100
10
ART2
30
ART3
30
50
ART4
40
ART5
60
TOTAL
160
50
110
I have a task with working with hierarchical data, but the source data contains errors in the hierarchy, namely: some parent-child links are broken. I have an algorithm for reestablishing such connections, but I have not yet been able to implement it on my own.
Example:
Initial data is
+------+----+----------+-------+
| NAME | ID | PARENTID | LEVEL |
+------+----+----------+-------+
| A1 | 1 | 2 | 1 |
| B1 | 2 | 3 | 2 |
| C1 | 18 | 4 | 3 |
| C2 | 3 | 5 | 3 |
| D1 | 4 | NULL | 4 |
| D2 | 5 | NULL | 4 |
| D3 | 10 | 11 | 4 |
| E1 | 11 | NULL | 5 |
+------+----+----------+-------+
Schematically it looks like:
As you can see, connections with C1 and D3 are lost here.
In order to restore connections, I need to apply the following algorithm for this table:
if for some NAME the ID is not in the PARENTID column (like ID = 18, 10), then create a row with a 'parent' with LEVEL = (current LEVEL - 1) and PARENTID = (current ID), and take ID and NAME such that the current ID < ID of the node from the LEVEL above.
Result must be like:
+------+----+----------+-------+
| NAME | ID | PARENTID | LEVEL |
+------+----+----------+-------+
| A1 | 1 | 2 | 1 |
| B1 | 2 | 3 | 2 |
| B1 | 2 | 18 | 2 |#
| C1 | 18 | 4 | 3 |
| C2 | 3 | 5 | 3 |
| C2 | 3 | 10 | 3 |#
| D1 | 4 | NULL | 4 |
| D2 | 5 | NULL | 4 |
| D3 | 10 | 11 | 4 |
| E1 | 11 | NULL | 5 |
+------+----+----------+-------+
Where rows with # - new rows created.And new schema looks like:
Are there any ideas on how to do this algorithm in spark/scala? Thanks!
You can build a createdRows dataframe from your current dataframe that you union with your current dataframe to obtain your final dataframe.
You can build this createdRows dataframe in several step:
The first step is to get the IDs (and LEVEL) that are not in PARENTID column. You can use a self left anti join to do that.
Then, you renameID column to PARENTID and updating LEVEL column, decreasing it by 1.
Then, you take ID and NAME columns of new rows by joining it with your input dataframe on the LEVEL column
Finally, you apply your condition ID < PARENTID
You end up with the following code, dataframe is the dataframe with your initial data:
import org.apache.spark.sql.functions.col
val createdRows = dataframe
// if for some NAME the ID is not in the PARENTID column (like ID = 18, 10)
.select("LEVEL", "ID")
.filter(col("LEVEL") > 1) // Remove root node from created rows
.join(dataframe.select("PARENTID"), col("PARENTID") === col("ID"), "left_anti")
// then create a row with a 'parent' with LEVEL = (current LEVEL - 1) and PARENTID = (current ID)
.withColumnRenamed("ID", "PARENTID")
.withColumn("LEVEL", col("LEVEL") - 1)
// and take ID and NAME
.join(dataframe.select("NAME", "ID", "LEVEL"), Seq("LEVEL"))
// such that the current ID < ID of the node from the LEVEL above.
.filter(col("ID") < col("PARENTID"))
val result = dataframe
.unionByName(createdRows)
.orderBy("NAME", "PARENTID") // Optional, if you want an ordered result
And in result dataframe you get:
+----+---+--------+-----+
|NAME|ID |PARENTID|LEVEL|
+----+---+--------+-----+
|A1 |1 |2 |1 |
|B1 |2 |3 |2 |
|B1 |2 |18 |2 |
|C1 |18 |4 |3 |
|C2 |3 |5 |3 |
|C2 |3 |10 |3 |
|D1 |4 |null |4 |
|D2 |5 |null |4 |
|D3 |10 |11 |4 |
|E1 |11 |null |5 |
+----+---+--------+-----+
I want to create a new column end_date for an id with the value of start_date column of the updated record for the same id using Spark Scala
Consider the following Data frame:
+---+-----+----------+
| id|Value|start_date|
+---+---- +----------+
| 1 | a | 1/1/2018 |
| 2 | b | 1/1/2018 |
| 3 | c | 1/1/2018 |
| 4 | d | 1/1/2018 |
| 1 | e | 10/1/2018|
+---+-----+----------+
Here initially start date of id=1 is 1/1/2018 and value is a, while on 10/1/2018(start_date) the value of id=1 became e. so i have to populate a new column end_date and populate value for id=1 in the beginning to 10/1/2018 and NULL values for all other records for end_date column
Result should be like below:
+---+-----+----------+---------+
| id|Value|start_date|end_date |
+---+---- +----------+---------+
| 1 | a | 1/1/2018 |10/1/2018|
| 2 | b | 1/1/2018 |NULL |
| 3 | c | 1/1/2018 |NULL |
| 4 | d | 1/1/2018 |NULL |
| 1 | e | 10/1/2018|NULL |
+---+-----+----------+---------+
I am using spark 2.3.
Can anyone help me out here please
With Window function "lead":
val df = List(
(1, "a", "1/1/2018"),
(2, "b", "1/1/2018"),
(3, "c", "1/1/2018"),
(4, "d", "1/1/2018"),
(1, "e", "10/1/2018")
).toDF("id", "Value", "start_date")
val idWindow = Window.partitionBy($"id")
.orderBy($"start_date")
val result = df.withColumn("end_date", lead($"start_date", 1).over(idWindow))
result.show(false)
Output:
+---+-----+----------+---------+
|id |Value|start_date|end_date |
+---+-----+----------+---------+
|3 |c |1/1/2018 |null |
|4 |d |1/1/2018 |null |
|1 |a |1/1/2018 |10/1/2018|
|1 |e |10/1/2018 |null |
|2 |b |1/1/2018 |null |
+---+-----+----------+---------+
How I can sort a hierarchical table with CTE query ?
sample table :
|ID|Name |ParentID|
| 0| |-1 |
| 1|1 |0 |
| 2|2 |0 |
| 3|1-1 |1 |
| 4|1-2 |1 |
| 5|2-1 |2 |
| 6|2-2 |2 |
| 7|2-1-1 |5 |
and my favorite result is :
|ID|Name |ParentID|Level
| 0| |-1 |0
| 1|1 |0 |1
| 3|1-1 |1 |2
| 4|1-2 |1 |2
| 2|2 |0 |1
| 5|2-1 |2 |2
| 7|2-1-1 |5 |3
| 6|2-2 |2 |2
another Sample :
an other sample :
|ID|Name |ParentID|
| 0| |-1 |
| 1|Book |0 |
| 2|App |0 |
| 3|C# |1 |
| 4|VB.NET |1 |
| 5|Office |2 |
| 6|PhotoShop |2 |
| 7|Word |5 |
and my favorite result is :
|ID|Name |ParentID|Level
| 0| |-1 |0
| 1|Book |0 |1
| 3|C# |1 |2
| 4|VB.NET |1 |2
| 2|App |0 |1
| 5|Office |2 |2
| 7|Word |5 |3
| 6|PhotoShop |2 |2
The hierarchyid datatype is able to represent hierarchical data, and already has the desired sorting order. If you can't replace your ParentID column, then you can convert to it on the fly:
(Most of this script is data setup, the actual answer is quite small)
declare #t table (ID int not null,Name varchar(10) not null,ParentID int not null)
insert into #t(ID,Name,ParentID)
select 0,'' ,-1 union all
select 1,'Book' ,0 union all
select 2,'App' ,0 union all
select 3,'C#' ,1 union all
select 4,'VB.NET' ,1 union all
select 5,'Office' ,2 union all
select 6,'PhotoShop' ,2 union all
select 7,'Word' ,5
;With Sensible as (
select ID,Name,NULLIF(ParentID,-1) as ParentID
from #t
), Paths as (
select ID,CONVERT(hierarchyid,'/' + CONVERT(varchar(10),ID) + '/') as Pth
from Sensible where ParentID is null
union all
select s.ID,CONVERT(hierarchyid,p.Pth.ToString() + CONVERT(varchar(10),s.ID) + '/')
from Sensible s inner join Paths p on s.ParentID = p.ID
)
select
*
from
Sensible s
inner join
Paths p
on
s.ID = p.ID
order by p.Pth
ORDER BY Name should work as desired:
WITH CTE
AS(
SELECT parent.*, 0 AS Level
FROM #table parent
WHERE parent.ID = 0
UNION ALL
SELECT parent.*, Level+1
FROM #table parent
INNER JOIN CTE prev ON parent.ParentID = prev.ID
)
SELECT * FROM CTE
ORDER BY Name
Here's your sample data(add it next time yourself):
declare #table table(ID int,Name varchar(10),ParentID int);
insert into #table values(0,'',-1);
insert into #table values(1,'1',0);
insert into #table values(2,'2',0);
insert into #table values(3,'1-1',1);
insert into #table values(4,'1-2',1);
insert into #table values(5,'2-1',2);
insert into #table values(6,'2-2',2);
insert into #table values(7,'2-1-1',5);
Result:
ID Name ParentID Level
0 -1 0
1 1 0 1
3 1-1 1 2
4 1-2 1 2
2 2 0 1
5 2-1 2 2
7 2-1-1 5 3
6 2-2 2 2