I'm looking for a way to pivot a input dataset with the below structure in hive or pyspark, the input contains more than half a billion records and for each emp_id there are 8 rows with and 5 columns possible, so I will end up with 40 columns. I did refer to this link but here the pivoted output column is already there in the dataset, in mine it's not and I also tried this link, but the sql is becoming very huge (not that it matters), but Is there a much way to do where the resultant pivoted columns needs to concatenated with the rank.
input
emp_id, dept_id, dept_name, rank
1001, 101, sales, 1
1001, 102, marketing, 2
1002 101, sales 1
1002 102, marketing, 2
expected output
emp_id, dept_id_1, dept_name_1, dept_id_2, dept_id_2
1001, 101, sales, 102, marketing
1002, 101, sales, 102, marketing
You can use aggregations after pivoting, you'd have an option to rename column like so
import pyspark.sql.functions as F
(df
.groupBy('emp_id')
.pivot('rank')
.agg(
F.first('dept_id').alias('dept_id'),
F.first('dept_name').alias('dept_name')
)
.show()
)
# Output
# +------+---------+-----------+---------+-----------+
# |emp_id|1_dept_id|1_dept_name|2_dept_id|2_dept_name|
# +------+---------+-----------+---------+-----------+
# | 1002| 101| sales| 102| marketing|
# | 1001| 101| sales| 102| marketing|
# +------+---------+-----------+---------+-----------+
I have a query which gives a result:
id | manager_id | level | star_level
----+------------+-------+------------
1 | NULL | 1 | 0
2 | 1 | 2 | 1
3 | 2 | 3 | 1
4 | 3 | 4 | 2
5 | 4 | 5 | 2
6 | 5 | 6 | 2
7 | 6 | 7 | 3
8 | 7 | 8 | 3
9 | 8 | 9 | 4
(9 rows)
Here is the query:
WITH RECURSIVE parents AS (
SELECT e.id
, e.manager_id
, 1 AS level
, CAST(s.is_star AS INTEGER) AS star_level
FROM employees AS e
INNER JOIN skills AS s
ON e.skill_id = s.id
WHERE manager_id IS NULL
UNION ALL
SELECT e.id
, e.manager_id
, p.level + 1 AS level
, p.star_level + CAST(s.is_star AS INTEGER) AS star_level
FROM employees AS e
INNER JOIN skills AS s
ON e.skill_id = s.id
INNER JOIN parents AS p
ON e.manager_id = p.id
WHERE e.manager_id = p.id
)
SELECT *
FROM parents
;
Can you please tell me how you can change the query so that in the same query the level and star_level values can be written to the corresponding columns?
Demo data:
create table Employees(
id INT,
name VARCHAR,
manager_id INT,
skill_id INT,
level INT,
star_level INT
);
create table Skills(
id INT,
name VARCHAR,
is_star BOOL
);
INSERT INTO Employees
(id, name, manager_id, skill_id)
VALUES
(1, 'Employee 1', NULL, 1),
(2, 'Employee 2', 1, 2),
(3, 'Employee 3', 2, 3),
(4, 'Employee 4', 3, 4),
(5, 'Employee 5', 4, 5),
(6, 'Employee 6', 5, 1),
(7, 'Employee 7', 6, 2),
(8, 'Employee 8', 7, 3),
(9, 'Employee 9', 8, 4)
;
INSERT INTO Skills
(id, name, is_star)
VALUES
(1, 'Skill 1', FALSE),
(2, 'Skill 2', TRUE),
(3, 'Skill 3', FALSE),
(4, 'Skill 4', TRUE),
(5, 'Skill 5', FALSE)
;
As a result, I need a query which will count level and star_level columns for Employees table and write their values (in Employees table) in one query.
You can use an UPDATE statement together with your CTE:
with recursive parents as (
... your original query goes here ...
)
update employees
set level = p.level,
star_level = p.star_level
from parents p
where employees.id = p.id;
The rdd in pyspark are consist of four elements in every list :
[id1, 'aaa',12,87]
[id2, 'acx',1,90]
[id3, 'bbb',77,10]
[id2, 'bbb',77,10]
.....
I want to group by the ids in the first columns, and get the aggregate result of the other three columns: for example => [id2,[['acx',1,90], ['bbb',77,10]...]]
How can I realize it ?
spark.version
# u'2.2.0'
rdd = sc.parallelize((['id1', 'aaa',12,87],
['id2', 'acx',1,90],
['id3', 'bbb',77,10],
['id2', 'bbb',77,10]))
rdd.map(lambda x: (x[0], x[1:])).groupByKey().mapValues(list).collect()
# result:
[('id2', [['acx', 1, 90], ['bbb', 77, 10]]),
('id3', [['bbb', 77, 10]]),
('id1', [['aaa', 12, 87]])]
or, if you prefer lists strictly, you can add one more map operation after mapValues:
rdd.map(lambda x: (x[0], x[1:])).groupByKey().mapValues(list).map(lambda x: list(x)).collect()
# result:
[['id2', [['acx', 1, 90], ['bbb', 77, 10]]],
['id3', [['bbb', 77, 10]]],
['id1', [['aaa', 12, 87]]]]
Thank you in advance for any assistance you can provide. I am looking to get the date difference between events that are stored in the same column. Referring to the Sample data, I am looking for the differences between the "Partial Submissions" and their respective subsequent "Jr Reviewed" events.
referring to same data again, I need the dateDiff from
1st "Partial Review" to 1st "Jr Reviewed"
2nd "Partial Review" to 2nd "Jr Reviewed"
6th "Partial Review" to 3rd "Jr Reviewed"
I am not sure where to start, all i have done is add the rownumbers which are partitioned by "Descrip" and ordered by "Date" Asc. Any sort of guidance or method of accomplishing (Recursive CTE?) this would be greatly appreciated.
Start End - 2 records
DECLARE #Tbl TABLE (RowNumber INT, RecordNumber INT, IDX INT, DESCRIP NVARCHAR(50), DATES DATETIME, EVENTNUM INT)
INSERT INTO #Tbl
VALUES
(1, 11515, 13, 'Partial Submission', '8/12/16 00:21', 3078),
(1, 11515, 14, 'Junior Reviewed', '8/12/16 15:52', 3089),
(2, 11515, 26, 'Partial Submission', '8/18/16 15:24', 3078),
(3, 11515, 33, 'Partial Submission', '9/6/16 9:47', 3078),
(4, 11515, 34, 'Partial Submission', '9/6/16 9:47', 3078),
(5, 11515, 39, 'Partial Submission', '9/9/16 13:19', 3078),
(2, 11515, 40, 'Junior Reviewed', '9/11/16 8:30', 3089),
(6, 11515, 46, 'Partial Submission', '9/15/16 12:30', 3078),
(3, 11515, 54, 'Junior Reviewed', '9/17/16 10:01', 3089),
(7, 11515, 57, 'Full! Submission', '9/19/16 9:16', 3079),
(1, 11520, 19, 'Partial Submission', '8/20/16 00:42', 3078),
(1, 11520, 22, 'Junior Reviewed', '8/22/16 9:06', 3089),
(2, 11520, 28, 'Partial Submission', '8/29/16 20:12', 3078),
(2, 11520, 34, 'Junior Reviewed', '9/1/16 8:20', 3089),
(3, 11520, 38, 'Partial Submission', '9/8/16 15:03', 3078),
(4, 11520, 39, 'Partial Submission', '9/8/16 15:03', 3078),
(3, 11520, 47, 'Junior Reviewed', '9/14/16 13:53', 3089),
(5, 11520, 48, 'Full! Submission', '9/16/16 13:19', 3079),
(4, 11520, 52, 'Junior Reviewed', '9/17/16 10:51', 3089),
(6, 11520, 53, 'Full! Submission', '9/19/16 16:21', 3079)
;WITH CTE
AS
(
SELECT
*,
RowId = ROW_NUMBER() OVER (Partition BY Recordnumber ORDER BY Recordnumber, IDX),
RowIdByDescrip = ROW_NUMBER() OVER (PARTITION BY Recordnumber, DESCRIP ORDER BY Recordnumber, IDX)
FROM #tbl
)
,Test as
(
SELECT
A.Recordnumber,
A.DESCRIP,
A.EVENTNUM,
A.IDX,
A.DATES StartDate,
LEAD(A.DATES) OVER ( Partition BY A.Recordnumber ORDER BY A.IDX) EndDate,
DATEDIFF(HOUR, A.DATES, LEAD(A.DATES) OVER (Partition BY A.Recordnumber ORDER BY A.IDX)) AS DateDifff
FROM #tbl A INNER JOIN
(
SELECT
C.Recordnumber,
MIN(C.IDX) AS IDX
FROM
CTE C
GROUP BY
C.RowId - C.RowIdByDescrip,
C.DESCRIP,
C.Recordnumber
) B ON A.IDX = B.IDX and A.Recordnumber = B.Recordnumber
)
Select
*
From Test
Where eventnum in ('3078')
order by Recordnumber, IDX
Try as the below:
DECLARE #Tbl TABLE (RowNumber INT, RecordNumber INT, IDX INT, DESCRIP NVARCHAR(50), DATES DATETIME, EVENTNUM INT)
INSERT INTO #Tbl
VALUES
(1, 11515, 13, 'Partial Submission', '8/12/16 00:21', 3078),
(1, 11515, 14, 'Junior Reviewed', '8/12/16 15:52', 3089),
(2, 11515, 26, 'Partial Submission', '8/18/16 15:24', 3078),
(3, 11515, 33, 'Partial Submission', '9/6/16 9:47', 3078),
(4, 11515, 34, 'Partial Submission', '9/6/16 9:47', 3078),
(5, 11515, 39, 'Partial Submission', '9/9/16 13:19', 3078),
(2, 11515, 40, 'Junior Reviewed', '9/11/16 8:30', 3089),
(6, 11515, 46, 'Partial Submission', '9/15/16 12:30', 3078),
(3, 11515, 54, 'Junior Reviewed', '9/17/16 10:01', 3089),
(7, 11515, 57, 'Full! Submission', '9/19/16 9:16', 3079),
(1, 11520, 19, 'Partial Submission', '8/20/16 00:42', 3078),
(1, 11520, 22, 'Junior Reviewed', '8/22/16 9:06', 3089),
(2, 11520, 28, 'Partial Submission', '8/29/16 20:12', 3078),
(2, 11520, 34, 'Junior Reviewed', '9/1/16 8:20', 3089),
(3, 11520, 38, 'Partial Submission', '9/8/16 15:03', 3078),
(4, 11520, 39, 'Partial Submission', '9/8/16 15:03', 3078),
(3, 11520, 47, 'Junior Reviewed', '9/14/16 13:53', 3089),
(5, 11520, 48, 'Full! Submission', '9/16/16 13:19', 3079),
(4, 11520, 52, 'Junior Reviewed', '9/17/16 10:51', 3089),
(6, 11520, 53, 'Full! Submission', '9/19/16 16:21', 3079)
;WITH CTE
AS
(
SELECT
*,
ROW_NUMBER() OVER (ORDER BY IDX) RowId,
ROW_NUMBER() OVER (PARTITION BY DESCRIP ORDER BY IDX) RowIdByDescrip
FROM #Tbl
WHERE
EVENTNUM IN
(
3078, --Partial Submission
3089 -- Junior Reviewed
)
), CTE2
AS
(
SELECT
MIN(C.IDX) AS IDX
FROM
CTE C
GROUP BY
C.RowId - C.RowIdByDescrip,
C.DESCRIP
)
SELECT
R.RecordNumber,
R.IDX ,
R.StartDate ,
R.EndDate ,
R.DateDifff
FROM
(
SELECT
A.EVENTNUM,
A.RecordNumber,
A.DESCRIP,
A.IDX,
A.DATES StartDate,
LEAD(A.DATES) OVER (ORDER BY A.IDX) EndDate,
DATEDIFF(HOUR, A.DATES, LEAD(A.DATES) OVER (ORDER BY A.IDX)) AS DateDifff
FROM
#Tbl A INNER JOIN
CTE2 B ON A.IDX = B.IDX
) R
WHERE
R.EVENTNUM = 3078 --Partial Submission
ORDER BY R.RecordNumber
Result:
RecordNumber IDX StartDate EndDate DateDifff
------------ ----------- ---------------- ---------------- -----------
11515 13 2016-08-12 00:21 2016-08-12 15:52 15
11515 26 2016-08-18 15:24 2016-09-06 09:47 450
11515 34 2016-09-06 09:47 2016-09-01 08:20 -121
11515 46 2016-09-15 12:30 2016-09-14 13:53 -23
11520 38 2016-09-08 15:03 2016-09-11 08:30 65
11520 19 2016-08-20 00:42 2016-08-22 09:06 57
This is not an answer. Just too long for a comment.
I guess. I did not understand the question exactly. Let me tell you what i know.
Firstly, Sorting according to IDX
I just work between the two events, Partial Submission and Junior Reviewed
Result table: Partial Submission Start - Junior Reviewed END
RowNumber RecordNumber IDX DESCRIP DATES EVENTNUM RowId RowIdByDescrip
----------- ------------ ----------- -------------------------------------------------- ----------------------- ----------- -------------------- --------------------
1 11515 13 Partial Submission Start 2016-08-12 00:21:00.000 3078 1 1
1 11515 14 Junior Reviewed END 2016-08-12 15:52:00.000 3089 2 1
1 11520 19 Partial Submission Start 2016-08-20 00:42:00.000 3078 3 2
1 11520 22 Junior Reviewed END 2016-08-22 09:06:00.000 3089 4 2
2 11515 26 Partial Submission Start 2016-08-18 15:24:00.000 3078 5 3
2 11520 28 Partial Submission 2016-08-29 20:12:00.000 3078 6 4
3 11515 33 Partial Submission 2016-09-06 09:47:00.000 3078 7 5
4 11515 34 Partial Submission 2016-09-06 09:47:00.000 3078 8 6
2 11520 34 Junior Reviewed End 2016-09-01 08:20:00.000 3089 9 3
3 11520 38 Partial Submission Start 2016-09-08 15:03:00.000 3078 10 7
4 11520 39 Partial Submission 2016-09-08 15:03:00.000 3078 11 8
5 11515 39 Partial Submission 2016-09-09 13:19:00.000 3078 12 9
2 11515 40 Junior Reviewed End 2016-09-11 08:30:00.000 3089 13 4
6 11515 46 Partial Submission Start 2016-09-15 12:30:00.000 3078 14 10
3 11520 47 Junior Reviewed End 2016-09-14 13:53:00.000 3089 15 5
4 11520 52 Junior Reviewed 2016-09-17 10:51:00.000 3089 16 6
3 11515 54 Junior Reviewed 2016-09-17 10:01:00.000 3089 17 7
Result:
RecordNumber IDX StartDate EndDate DateDifff
------------ ----------- ---------------- ---------------- -----------
11515 13 2016-08-12 00:21 2016-08-12 15:52 15
11515 26 2016-08-18 15:24 2016-09-06 09:47 450
11515 34 2016-09-06 09:47 2016-09-01 08:20 -121
11515 46 2016-09-15 12:30 2016-09-14 13:53 -23
11520 38 2016-09-08 15:03 2016-09-11 08:30 65
11520 19 2016-08-20 00:42 2016-08-22 09:06 57