Spark scala window count max - scala

I have following df:-
result
state
clubName
win
XYZ
club1
win
XYZ
club2
win
XYZ
club1
win
PQR
club3
I need state wise max wining clubName
val byState =Window.partitionBy("state").orderBy('state)
I tried creating a window but does not helps..
Expected Result :-
Some like this in sql
select temp.res
(select count(result) as res
from table
group by clubName) temp
group by state
e.g
state
max_count_of_wins
clubName
XYZ
2
club1

You can get the win count for each club, then assign a rank for each club ordered by wins, and filter those rows with rank = 1.
import org.apache.spark.sql.expressions.Window
val df2 = df.withColumn(
"wins",
count(when(col("result") === "win", 1))
.over(Window.partitionBy("state","clubName"))
).withColumn(
"rn",
row_number().over(Window.partitionBy("state").orderBy(desc("wins")))
).filter("rn = 1").selectExpr("state", "wins as max_count_of_wins", "clubName")
df2.show
+-----+-----------------+--------+
|state|max_count_of_wins|clubName|
+-----+-----------------+--------+
| PQR| 1| club3|
| XYZ| 2| club1|
+-----+-----------------+--------+

You can also use a SQL-dialect with SparkSQL (find doc here):
df.sql("""
SELECT tt.name, tt.state, MAX(tt.nWins) as max_count_of_wins
FROM (
SELECT t1.clubName as name, t1.state as state, COUNT(1) as nWins
FROM Table1 t1
WHERE t1.result = 'win'
GROUP BY state, name
) as tt
GROUP BY tt.state;
""")
where the table in the dataframe would be named Table1 and your dataframe df.
p.s. if you want to try it yourself use the initialization
CREATE TABLE Table1
(`result` varchar(3), `state` varchar(3), `clubName` varchar(5))
;
INSERT INTO Table1
(`result`, `state`, `clubName`)
VALUES
('win', 'XYZ', 'club1'),
('win', 'XYZ', 'club2'),
('win', 'XYZ', 'club1'),
('win', 'PQR', 'club3')
;
on http://sqlfiddle.com.

Related

How to reference a jsonb coulmn value from a value map in postgres

I want to be able to reference the error table for msg and description based on the err_id on the results table for err_map jsonb column, I'd also want to be able relate which error occurred against which column whether the independent columns c1,c2 or val_map jsonb column c3, c4)
the only reason the val_map stores data(with .) as "val_map.c3": 3 so we can identify that these columns were from val_map when mapping errors to columns.
I have a result table
here the err_map column values 1,3 reference to below error table
id | c1 | c2 | val_map | err_map
----------------------------------------------------------------
1 | chk1 | chk2 | {"c3":3, "c4":4} | {"c1": 1, "val_map.c3": 3}
Error Table
id | msg | description
----------------------------------------------------------------
1 | msg1 | an error1 occurred
----------------------------------------------------------------
3 | msg3 | an error3 occurred
I looked at jsonb_each and jsonb_object_keys but can't really figure out how to use it to join these tables. Any help/hints will be appreciated.
Pardon if something is unclear, please ask to provide more detail.
[Edit 1]: removed foreign key reference as it was misleading
[Edit 2]: I've got it working but it's quite inefficient
select
e.error_key,
e.error_message,
T2.key as key
from result.error e
inner join (
select
substring(T1.key, 11) as key,
T1.value
from (
select em.key, em.value
from result rd, jsonb_each(rd.error_map) as em
) as T1
where T1.key like '%value_map%'
union all
select T1.key , T1.value
from (
select em.key, em.value
from result rd, jsonb_each(rd.error_map) as em
) as T1
where T1.key not like '%value_map%'
) as T2 on T2.value::bigint = e.id;
You can simplify that UNION ALL to just
select
e.error_key,
e.error_message,
T2.key as key
from result.error e
inner join (
select
case when T1.key like 'val_map.%'
then substring(T1.key, 9)
else T1.key
end as key,
T1.value
from result rd, jsonb_each(rd.error_map) as T1
) as T2 on T2.value::bigint = e.id;

Postgres group by similarity result

I have some tables with properties of Name and Result each.
i want to join tables by name similarity but limit the result arrived from each group of similarity into max 2 results.
CREATE TABLE data_a(
Id serial primary key,
Name VARCHAR(70) NOT NULL,
Result INT4 NOT NULL);
CREATE TABLE data_b(
Id serial primary key,
Name VARCHAR(70) NOT NULL,
Result INT4 NOT NULL);
INSERT INTO data_a
(Name, Result)
VALUES
('Todd', 2),
('John', 5);
INSERT INTO data_b
(Name, Result)
VALUES
('Johns', 5),
('Todi', 3),
('Tod', 4),
('Todd', 5),
('John', 1),
('Jon', 1),
('Johny', 1),
('Johnny', 1),
('Johni', 1);
i would like to run a query that join both tables by name similarity and limit results to up to 2 results
SELECT da.Name as Name_a,db.Name , similarity(da.Name,da.Name) > 0.5
FROM data_a da
JOIN data_b db
ON da.Name % db.Name
GROUP BY da.Name,db.Name
ORDER BY similarity
LIMIT 2
and recive
|Name_a|Name_b|similarity|
|------|------|----------|
|Todd | Todd | 1 |
|------|------|----------|
|Todd | Tod |0.8 |
|------|------|----------|
|John | John |1 |
|------|------|----------|
|John | Johny|0.76 |
|------|------|----------|
currently i get
|Name_a|Name_b|similarity|
|------|------|----------|
|Todd | Todd | 1 |
|John | John |1 |
it seems that i'm not using correctly in group by , how can i group by
If you want to apply the LIMIT separately per da.Name, you would have to do it in a LATERAL subquery:
SELECT da.Name as Name_a,db.Name , similarity(da.Name,db.Name)
FROM data_a da CROSS JOIN LATERAL
(
SELECT db.Name
FROM data_b db
WHERE da.Name % db.Name
ORDER BY similarity(da.Name,db.Name) DESC
LIMIT 2
) db;
You can use lateral join to map each name to a name from the first table, then truncate the result to 2 values for each name.
with data AS (
select
da.name da_name,
db.name db_name,
similarity(da.name::text, db.name::text) similarity
from
data_a da
left join lateral (
select
data_b.Name
from
data_b
) db ON da.Name like '%' || db.Name || '%'
or db.Name like '%' || da.Name || '%'
)
select
*
from
(
select
row_number() over (
partition by da_name
order by
similarity
) r,
t.*
FROM
data t
) x
WHERE
x.r <= 2;
Demo in sql<>daddy.io

Getting earliest date by matching two columns, and returning array

I have a query I'm trying to write, but I cannot get the syntax quite right. From the table below, I have a set to dates with an id, and if the id does not have parent_id, and if the parent_id does not exist for an id it is NULL.
I'm trying to get an output of all the children of a parent that have the same date as the parent. As shown in the expected output below, [D#P, Z#Z] would be assigned to A because they have the same date and their parent_id is A, however Q#L would not be assigned to A because its date is not 1/1/2019. Nothing is assigned to B or D because they have no children on their created dates.
I've found some posts on how to do this in Postgres, however because I'm using Redshift some of the operations don't work.
Any help would be appreciated.
|date |id |parent_id |
-------------------------
1/1/2019|A |NULL
1/1/2019|B |NULL
1/1/2019|C |NULL
1/1/2019|D#P |A
1/1/2019|Z#Z |A
1/1/2019|K#H |C
1/2/2019|Q#L |A
1/3/2019|D |NULL
1/4/2019|H#Q |C
Expected Output:
date |id |children
-----------------------
1/1/2019 |A |[D#P, Z#Z]
1/1/2019 |C |[K#H]
Current Work:
SELECT
first_value(case
when parent_id
then date
end)
over (
partition by parent_id
order by date
rows between unbounded preceding and unbounded following)
as first_date)
id,
list_agg(parent_id)
FROM foo
I don't know why I am getting an error when using LISTAGG aggregate function, therefore I decided to use SELECT DISTINCT with LISTAGG window function:
WITH input as (
SELECT '1/1/2019' as date, 'A' as id, NULL as parent_id UNION ALL
SELECT '1/1/2019', 'B', NULL UNION ALL
SELECT '1/1/2019', 'C', NULL UNION ALL
SELECT '1/1/2019', 'D#P', 'A' UNION ALL
SELECT '1/1/2019', 'Z#Z', 'A' UNION ALL
SELECT '1/1/2019', 'K#H', 'C' UNION ALL
SELECT '1/2/2019', 'Q#L', 'A' UNION ALL
SELECT '1/3/2019', 'D', NULL UNION ALL
SELECT '1/4/2019', 'H#Q', 'C'
), parents as (
SELECT *
FROM input
WHERE parent_id IS NULL
), children as (
SELECT *
FROM input
WHERE parent_id IS NOT NULL
)
SELECT DISTINCT
parents.date,
parents.id,
listagg(children.id, ',') WITHIN GROUP ( ORDER BY children.id )OVER (PARTITION BY parents.id, parents.date) as children
FROM parents JOIN children
ON parents.id = children.parent_id
AND parents.date = children.date
Outputs:
date id children
1/1/2019 A D#P,Z#Z
1/1/2019 C K#H
Solution with GROUP BY and an LISTAGG aggregate function, would be for me more natural of solving your problem:
WITH input as (
[...]
SELECT
parents.date,
parents.id,
listagg(children.id, ',') WITHIN GROUP ( ORDER BY children.id )
FROM parents JOIN children
ON parents.id = children.parent_id
AND parents.date = children.date
group by parents.id, parents.date
Sadly it returns an error which I don't really understand:
[XX000][500310] Amazon Invalid operation: One or more of the used functions must be applied on at least one user created tables. Examples of user table only functions are LISTAGG, MEDIAN, PERCENTILE_CONT, etc; java.lang.RuntimeException: com.amazon.support.exceptions.ErrorException: Amazon Invalid operation: One or more of the used functions must be applied on at least one user created tables. Examples of user table only functions are LISTAGG, MEDIAN, PERCENTILE_CONT, etc;

Maintaining order in DB2 "IN" query

This question is based on this one. I'm looking for a solution to that question that works in DB2. Here is the original question:
I have the following table
DROP TABLE IF EXISTS `test`.`foo`;
CREATE TABLE `test`.`foo` (
`id` int(10) unsigned NOT NULL auto_increment,
`name` varchar(45) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Then I try to get records based on the primary key
SELECT * FROM foo f where f.id IN (2, 3, 1);
I then get the following result
+----+--------+
| id | name |
+----+--------+
| 1 | first |
| 2 | second |
| 3 | third |
+----+--------+
3 rows in set (0.00 sec)
As one can see, the result is ordered by id. What I'm trying to achieve is to get the results ordered in the sequence I'm providing in the query. Given this example it should return
+----+--------+
| id | name |
+----+--------+
| 2 | second |
| 3 | third |
| 1 | first |
+----+--------+
3 rows in set (0.00 sec)
You could use a derived table with the IDs you want, and the order you want, and then join the table in, something like...
SELECT ...
FROM mcscb.mcs_premise prem
JOIN mcscb.mcs_serv_deliv_id serv
ON prem.prem_nb = serv.prem_nb
AND prem.tech_col_user_id = serv.tech_col_user_id
AND prem.tech_col_version = serv.tech_col_version
JOIN (
SELECT 1, '9486154876' FROM SYSIBM.SYSDUMMY1 UNION ALL
SELECT 2, '9403149581' FROM SYSIBM.SYSDUMMY1 UNION ALL
SELECT 3, '9465828230' FROM SYSIBM.SYSDUMMY1
) B (ORD, ID)
ON serv.serv_deliv_id = B.ID
WHERE serv.tech_col_user_id = 'CRSSJEFF'
AND serv.tech_col_version = '00'
ORDER BY B.ORD
You can use derived column to do custom ordering.
select
case
when serv.SERV_DELIV_ID = '9486154876' then 1 ELSE
when serv.SERV_DELIV_ID = '9403149581' then 2 ELSE 3
END END as custom_order,
...
...
ORDER BY custom_order
To make the logic a little bit more evident you might modify the solution provided by bhamby like so:
WITH ordered_in_list (ord, id) as (
VALUES (1, '9486154876'), (2, '9403149581'), (3, '9465828230')
)
SELECT ...
FROM mcscb.mcs_premise prem
JOIN mcscb.mcs_serv_deliv_id serv
ON prem.prem_nb = serv.prem_nb
AND prem.tech_col_user_id = serv.tech_col_user_id
AND prem.tech_col_version = serv.tech_col_version
JOIN ordered_in_list il
ON serv.serv_deliv_id = il.ID
WHERE serv.tech_col_user_id = 'CRSSJEFF'
AND serv.tech_col_version = '00'
ORDER BY il.ORD

HQL select rows with max

I got this table:
+---+----+----+----+
|ID |KEY1|KEY2|COL1|
+---+----+----+----+
|001|aaa1|bbb1|ccc1|
|101|aaa1|bbb1|ddd2|
|002|aaa2|bbb2|eee3|
|102|aaa2|bbb2|fff4|
|003|aaa3|bbb3|ggg5|
|103|aaa3|bbb3|hhh6|
+---+----+----+----+
The Result must contain the rows with the highest ID if the columns key1 and key2 are equals.
+---+----+----+----+
|ID |KEY1|KEY2|COL1|
+---+----+----+----+
|101|aaa1|bbb1|ddd2|
|102|aaa2|bbb2|fff4|
|103|aaa3|bbb3|hhh6|
+---+----+----+----+
Since in HQL I can't do a subquery like:
select * from (select....)
How can I perform this query?
**SOLUTION**
Actually the solution were a little bit more complex so i want share it since the KEY1 and KEY2 were on an other table which join on the first table with two keys.
+-----+-------+-------+-------+
|t1.ID|t2.KEY1|t2.KEY2|t1.COL1|
+-----+-------+-------+-------+
| 001| aaa1| bbb1| ccc1|
| 101| aaa1| bbb1| ddd2|
| 002| aaa2| bbb2| eee3|
| 102| aaa2| bbb2| fff4|
| 003| aaa3| bbb3| ggg5|
| 103| aaa3| bbb3| hhh6|
+-----+-------+-------+-------+
I used this CORRECT query:
SELECT t1.ID, t2.KEY1, t2.KEY2, t1.COL1
FROM yourTable1 t1, yourTable2 t2
WHERE
t1.JoinCol1 = t2.JoinCol1 and t1.JoinCol2=t2.JoinCol2 and
t1.ID = (SELECT MAX(s1.ID) FROM yourTable1 s1, yourTable2 s2
WHERE
s1.JoinCol1 = s2.JoinCol1 and s1.JoinCol2=s2.JoinCol2 and
s2.KEY1 = t2.KEY1 AND s2.KEY2 = t2.KEY2)
If we were writing this query to be run directly on a regular database, such as MySQL or SQL Server, we might be tempted to join to a subquery. However, from what I read here, subqueries in HQL can only appear in the SELECT or WHERE clauses. We can phrase your query as follows, using the WHERE clause to implement your logic.
The query will be:
SELECT t1.ID, t1.KEY1, t1.KEY2, t1.COL1
FROM yourTable t1
WHERE t1.ID = (SELECT MAX(t2.ID) FROM yourTable t2
WHERE t2.KEY1 = t1.KEY1 AND t2.KEY2 = t1.KEY2)