hive ql - posexplode with more than 2 columns - hiveql

Have 3 arrays:
[21,31,41], [121,131,141], [1021,1031,1041]
Wanted to explode as:
21, 121, 1021
31, 131, 1031
41, 141, 1041
I have written like this:
select key1, key2, key3 from
lateral view posexplode(col_name_1) key1 as q1, key1
lateral view posexplode(col_name_2) key2 as q2, key2
lateral view posexplode(col_name_3) key3 as q3, key3
where q1=q2 and q1=q3;
Gets an exception as:
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask

Related

Hive/pyspark: pivot non numeric data for huge dataset

I'm looking for a way to pivot a input dataset with the below structure in hive or pyspark, the input contains more than half a billion records and for each emp_id there are 8 rows with and 5 columns possible, so I will end up with 40 columns. I did refer to this link but here the pivoted output column is already there in the dataset, in mine it's not and I also tried this link, but the sql is becoming very huge (not that it matters), but Is there a much way to do where the resultant pivoted columns needs to concatenated with the rank.
input
emp_id, dept_id, dept_name, rank
1001, 101, sales, 1
1001, 102, marketing, 2
1002 101, sales 1
1002 102, marketing, 2
expected output
emp_id, dept_id_1, dept_name_1, dept_id_2, dept_id_2
1001, 101, sales, 102, marketing
1002, 101, sales, 102, marketing
You can use aggregations after pivoting, you'd have an option to rename column like so
import pyspark.sql.functions as F
(df
.groupBy('emp_id')
.pivot('rank')
.agg(
F.first('dept_id').alias('dept_id'),
F.first('dept_name').alias('dept_name')
)
.show()
)
# Output
# +------+---------+-----------+---------+-----------+
# |emp_id|1_dept_id|1_dept_name|2_dept_id|2_dept_name|
# +------+---------+-----------+---------+-----------+
# | 1002| 101| sales| 102| marketing|
# | 1001| 101| sales| 102| marketing|
# +------+---------+-----------+---------+-----------+

How to set values from recursive query in PostgreSQL?

I have a query which gives a result:
id | manager_id | level | star_level
----+------------+-------+------------
1 | NULL | 1 | 0
2 | 1 | 2 | 1
3 | 2 | 3 | 1
4 | 3 | 4 | 2
5 | 4 | 5 | 2
6 | 5 | 6 | 2
7 | 6 | 7 | 3
8 | 7 | 8 | 3
9 | 8 | 9 | 4
(9 rows)
Here is the query:
WITH RECURSIVE parents AS (
SELECT e.id
, e.manager_id
, 1 AS level
, CAST(s.is_star AS INTEGER) AS star_level
FROM employees AS e
INNER JOIN skills AS s
ON e.skill_id = s.id
WHERE manager_id IS NULL
UNION ALL
SELECT e.id
, e.manager_id
, p.level + 1 AS level
, p.star_level + CAST(s.is_star AS INTEGER) AS star_level
FROM employees AS e
INNER JOIN skills AS s
ON e.skill_id = s.id
INNER JOIN parents AS p
ON e.manager_id = p.id
WHERE e.manager_id = p.id
)
SELECT *
FROM parents
;
Can you please tell me how you can change the query so that in the same query the level and star_level values ​​can be written to the corresponding columns?
Demo data:
create table Employees(
id INT,
name VARCHAR,
manager_id INT,
skill_id INT,
level INT,
star_level INT
);
create table Skills(
id INT,
name VARCHAR,
is_star BOOL
);
INSERT INTO Employees
(id, name, manager_id, skill_id)
VALUES
(1, 'Employee 1', NULL, 1),
(2, 'Employee 2', 1, 2),
(3, 'Employee 3', 2, 3),
(4, 'Employee 4', 3, 4),
(5, 'Employee 5', 4, 5),
(6, 'Employee 6', 5, 1),
(7, 'Employee 7', 6, 2),
(8, 'Employee 8', 7, 3),
(9, 'Employee 9', 8, 4)
;
INSERT INTO Skills
(id, name, is_star)
VALUES
(1, 'Skill 1', FALSE),
(2, 'Skill 2', TRUE),
(3, 'Skill 3', FALSE),
(4, 'Skill 4', TRUE),
(5, 'Skill 5', FALSE)
;
As a result, I need a query which will count level and star_level columns for Employees table and write their values (in Employees table) in one query.
You can use an UPDATE statement together with your CTE:
with recursive parents as (
... your original query goes here ...
)
update employees
set level = p.level,
star_level = p.star_level
from parents p
where employees.id = p.id;

spark dataframe : finding employees who is having salary more than the average salary of the organization

I am trying to run a test spark/scala code to find employees who is having salary more than the avarage salary with a test data using below spark dataframe . But this is failing while executing :
Exception in thread "main" java.lang.UnsupportedOperationException: Cannot evaluate expression: avg(input[4, double, false])
What might be the correct syntax to achieve this ?
val dataDF20 = spark.createDataFrame(Seq(
(11, "emp1", 2, 45, 1000.0),
(12, "emp2", 1, 34, 2000.0),
(13, "emp3", 1, 33, 3245.0),
(14, "emp4", 1, 54, 4356.0),
(15, "emp5", 2, 76, 56789.0)
)).toDF("empid", "name", "deptid", "age", "sal")
val condition1 : Column = col("sal") > avg(col("sal"))
val d0 = dataDF20.filter(condition1)
println("------ d0.show()----", d0.show())
You can get this done in two steps:
val avgVal = dataDF20.select(avg($"sal")).take(1)(0)(0)
dataDF20.filter($"sal" > avgVal).show()
+-----+----+------+---+-------+
|empid|name|deptid|age| sal|
+-----+----+------+---+-------+
| 15|emp5| 2| 76|56789.0|
+-----+----+------+---+-------+

How to group by one column in rdd in pyspark?

The rdd in pyspark are consist of four elements in every list :
[id1, 'aaa',12,87]
[id2, 'acx',1,90]
[id3, 'bbb',77,10]
[id2, 'bbb',77,10]
.....
I want to group by the ids in the first columns, and get the aggregate result of the other three columns: for example => [id2,[['acx',1,90], ['bbb',77,10]...]]
How can I realize it ?
spark.version
# u'2.2.0'
rdd = sc.parallelize((['id1', 'aaa',12,87],
['id2', 'acx',1,90],
['id3', 'bbb',77,10],
['id2', 'bbb',77,10]))
rdd.map(lambda x: (x[0], x[1:])).groupByKey().mapValues(list).collect()
# result:
[('id2', [['acx', 1, 90], ['bbb', 77, 10]]),
('id3', [['bbb', 77, 10]]),
('id1', [['aaa', 12, 87]])]
or, if you prefer lists strictly, you can add one more map operation after mapValues:
rdd.map(lambda x: (x[0], x[1:])).groupByKey().mapValues(list).map(lambda x: list(x)).collect()
# result:
[['id2', [['acx', 1, 90], ['bbb', 77, 10]]],
['id3', [['bbb', 77, 10]]],
['id1', [['aaa', 12, 87]]]]

DateDiff with multiple events Same Column SQL Server

Thank you in advance for any assistance you can provide. I am looking to get the date difference between events that are stored in the same column. Referring to the Sample data, I am looking for the differences between the "Partial Submissions" and their respective subsequent "Jr Reviewed" events.
referring to same data again, I need the dateDiff from
1st "Partial Review" to 1st "Jr Reviewed"
2nd "Partial Review" to 2nd "Jr Reviewed"
6th "Partial Review" to 3rd "Jr Reviewed"
I am not sure where to start, all i have done is add the rownumbers which are partitioned by "Descrip" and ordered by "Date" Asc. Any sort of guidance or method of accomplishing (Recursive CTE?) this would be greatly appreciated.
Start End - 2 records
DECLARE #Tbl TABLE (RowNumber INT, RecordNumber INT, IDX INT, DESCRIP NVARCHAR(50), DATES DATETIME, EVENTNUM INT)
INSERT INTO #Tbl
VALUES
(1, 11515, 13, 'Partial Submission', '8/12/16 00:21', 3078),
(1, 11515, 14, 'Junior Reviewed', '8/12/16 15:52', 3089),
(2, 11515, 26, 'Partial Submission', '8/18/16 15:24', 3078),
(3, 11515, 33, 'Partial Submission', '9/6/16 9:47', 3078),
(4, 11515, 34, 'Partial Submission', '9/6/16 9:47', 3078),
(5, 11515, 39, 'Partial Submission', '9/9/16 13:19', 3078),
(2, 11515, 40, 'Junior Reviewed', '9/11/16 8:30', 3089),
(6, 11515, 46, 'Partial Submission', '9/15/16 12:30', 3078),
(3, 11515, 54, 'Junior Reviewed', '9/17/16 10:01', 3089),
(7, 11515, 57, 'Full! Submission', '9/19/16 9:16', 3079),
(1, 11520, 19, 'Partial Submission', '8/20/16 00:42', 3078),
(1, 11520, 22, 'Junior Reviewed', '8/22/16 9:06', 3089),
(2, 11520, 28, 'Partial Submission', '8/29/16 20:12', 3078),
(2, 11520, 34, 'Junior Reviewed', '9/1/16 8:20', 3089),
(3, 11520, 38, 'Partial Submission', '9/8/16 15:03', 3078),
(4, 11520, 39, 'Partial Submission', '9/8/16 15:03', 3078),
(3, 11520, 47, 'Junior Reviewed', '9/14/16 13:53', 3089),
(5, 11520, 48, 'Full! Submission', '9/16/16 13:19', 3079),
(4, 11520, 52, 'Junior Reviewed', '9/17/16 10:51', 3089),
(6, 11520, 53, 'Full! Submission', '9/19/16 16:21', 3079)
;WITH CTE
AS
(
SELECT
*,
RowId = ROW_NUMBER() OVER (Partition BY Recordnumber ORDER BY Recordnumber, IDX),
RowIdByDescrip = ROW_NUMBER() OVER (PARTITION BY Recordnumber, DESCRIP ORDER BY Recordnumber, IDX)
FROM #tbl
)
,Test as
(
SELECT
A.Recordnumber,
A.DESCRIP,
A.EVENTNUM,
A.IDX,
A.DATES StartDate,
LEAD(A.DATES) OVER ( Partition BY A.Recordnumber ORDER BY A.IDX) EndDate,
DATEDIFF(HOUR, A.DATES, LEAD(A.DATES) OVER (Partition BY A.Recordnumber ORDER BY A.IDX)) AS DateDifff
FROM #tbl A INNER JOIN
(
SELECT
C.Recordnumber,
MIN(C.IDX) AS IDX
FROM
CTE C
GROUP BY
C.RowId - C.RowIdByDescrip,
C.DESCRIP,
C.Recordnumber
) B ON A.IDX = B.IDX and A.Recordnumber = B.Recordnumber
)
Select
*
From Test
Where eventnum in ('3078')
order by Recordnumber, IDX
Try as the below:
DECLARE #Tbl TABLE (RowNumber INT, RecordNumber INT, IDX INT, DESCRIP NVARCHAR(50), DATES DATETIME, EVENTNUM INT)
INSERT INTO #Tbl
VALUES
(1, 11515, 13, 'Partial Submission', '8/12/16 00:21', 3078),
(1, 11515, 14, 'Junior Reviewed', '8/12/16 15:52', 3089),
(2, 11515, 26, 'Partial Submission', '8/18/16 15:24', 3078),
(3, 11515, 33, 'Partial Submission', '9/6/16 9:47', 3078),
(4, 11515, 34, 'Partial Submission', '9/6/16 9:47', 3078),
(5, 11515, 39, 'Partial Submission', '9/9/16 13:19', 3078),
(2, 11515, 40, 'Junior Reviewed', '9/11/16 8:30', 3089),
(6, 11515, 46, 'Partial Submission', '9/15/16 12:30', 3078),
(3, 11515, 54, 'Junior Reviewed', '9/17/16 10:01', 3089),
(7, 11515, 57, 'Full! Submission', '9/19/16 9:16', 3079),
(1, 11520, 19, 'Partial Submission', '8/20/16 00:42', 3078),
(1, 11520, 22, 'Junior Reviewed', '8/22/16 9:06', 3089),
(2, 11520, 28, 'Partial Submission', '8/29/16 20:12', 3078),
(2, 11520, 34, 'Junior Reviewed', '9/1/16 8:20', 3089),
(3, 11520, 38, 'Partial Submission', '9/8/16 15:03', 3078),
(4, 11520, 39, 'Partial Submission', '9/8/16 15:03', 3078),
(3, 11520, 47, 'Junior Reviewed', '9/14/16 13:53', 3089),
(5, 11520, 48, 'Full! Submission', '9/16/16 13:19', 3079),
(4, 11520, 52, 'Junior Reviewed', '9/17/16 10:51', 3089),
(6, 11520, 53, 'Full! Submission', '9/19/16 16:21', 3079)
;WITH CTE
AS
(
SELECT
*,
ROW_NUMBER() OVER (ORDER BY IDX) RowId,
ROW_NUMBER() OVER (PARTITION BY DESCRIP ORDER BY IDX) RowIdByDescrip
FROM #Tbl
WHERE
EVENTNUM IN
(
3078, --Partial Submission
3089 -- Junior Reviewed
)
), CTE2
AS
(
SELECT
MIN(C.IDX) AS IDX
FROM
CTE C
GROUP BY
C.RowId - C.RowIdByDescrip,
C.DESCRIP
)
SELECT
R.RecordNumber,
R.IDX ,
R.StartDate ,
R.EndDate ,
R.DateDifff
FROM
(
SELECT
A.EVENTNUM,
A.RecordNumber,
A.DESCRIP,
A.IDX,
A.DATES StartDate,
LEAD(A.DATES) OVER (ORDER BY A.IDX) EndDate,
DATEDIFF(HOUR, A.DATES, LEAD(A.DATES) OVER (ORDER BY A.IDX)) AS DateDifff
FROM
#Tbl A INNER JOIN
CTE2 B ON A.IDX = B.IDX
) R
WHERE
R.EVENTNUM = 3078 --Partial Submission
ORDER BY R.RecordNumber
Result:
RecordNumber IDX StartDate EndDate DateDifff
------------ ----------- ---------------- ---------------- -----------
11515 13 2016-08-12 00:21 2016-08-12 15:52 15
11515 26 2016-08-18 15:24 2016-09-06 09:47 450
11515 34 2016-09-06 09:47 2016-09-01 08:20 -121
11515 46 2016-09-15 12:30 2016-09-14 13:53 -23
11520 38 2016-09-08 15:03 2016-09-11 08:30 65
11520 19 2016-08-20 00:42 2016-08-22 09:06 57
This is not an answer. Just too long for a comment.
I guess. I did not understand the question exactly. Let me tell you what i know.
Firstly, Sorting according to IDX
I just work between the two events, Partial Submission and Junior Reviewed
Result table: Partial Submission Start - Junior Reviewed END
RowNumber RecordNumber IDX DESCRIP DATES EVENTNUM RowId RowIdByDescrip
----------- ------------ ----------- -------------------------------------------------- ----------------------- ----------- -------------------- --------------------
1 11515 13 Partial Submission Start 2016-08-12 00:21:00.000 3078 1 1
1 11515 14 Junior Reviewed END 2016-08-12 15:52:00.000 3089 2 1
1 11520 19 Partial Submission Start 2016-08-20 00:42:00.000 3078 3 2
1 11520 22 Junior Reviewed END 2016-08-22 09:06:00.000 3089 4 2
2 11515 26 Partial Submission Start 2016-08-18 15:24:00.000 3078 5 3
2 11520 28 Partial Submission 2016-08-29 20:12:00.000 3078 6 4
3 11515 33 Partial Submission 2016-09-06 09:47:00.000 3078 7 5
4 11515 34 Partial Submission 2016-09-06 09:47:00.000 3078 8 6
2 11520 34 Junior Reviewed End 2016-09-01 08:20:00.000 3089 9 3
3 11520 38 Partial Submission Start 2016-09-08 15:03:00.000 3078 10 7
4 11520 39 Partial Submission 2016-09-08 15:03:00.000 3078 11 8
5 11515 39 Partial Submission 2016-09-09 13:19:00.000 3078 12 9
2 11515 40 Junior Reviewed End 2016-09-11 08:30:00.000 3089 13 4
6 11515 46 Partial Submission Start 2016-09-15 12:30:00.000 3078 14 10
3 11520 47 Junior Reviewed End 2016-09-14 13:53:00.000 3089 15 5
4 11520 52 Junior Reviewed 2016-09-17 10:51:00.000 3089 16 6
3 11515 54 Junior Reviewed 2016-09-17 10:01:00.000 3089 17 7
Result:
RecordNumber IDX StartDate EndDate DateDifff
------------ ----------- ---------------- ---------------- -----------
11515 13 2016-08-12 00:21 2016-08-12 15:52 15
11515 26 2016-08-18 15:24 2016-09-06 09:47 450
11515 34 2016-09-06 09:47 2016-09-01 08:20 -121
11515 46 2016-09-15 12:30 2016-09-14 13:53 -23
11520 38 2016-09-08 15:03 2016-09-11 08:30 65
11520 19 2016-08-20 00:42 2016-08-22 09:06 57