MAX() usage in GROUP BY with non-numeric column - hiveql

I have a table similar to the following
UserId | ActionType
--------------------
1 | Create
2 | Read
1 | Edit
2 | Create
3 | Read
I want to find the "highest" action that a user has done, with the following hierarchy Create > Edit > Read. Running the desired query should return
UserId | ActionType
-------------------
1 | Create
2 | Create
3 | Read
Is there a way to leverage MAX() in HIVE to do this? My structure looks like the following very basic query but I'm unsure how to compute the above ActionType column.
SELECT UserId, ??? FROM UserActions GROUP BY UserId;
I think possible solutions are CASE statements in the GROUP BY or converting the values into numeric values, such as (Read => 0, Edit => 1, Create => 2) and then doing a GROUP BY, but I am hoping there is a more elegant solution.
Thanks!

i don't know if hiveql supports sub queries, but this is the idea if it was on SQL :
SELECT
a.UserId,
a.ActionType
From
a.UserActions
WHERE
a.ActionType = (
SELECT
b.ActionType
From
(
SELECT
MAX(COUNT(*)),
c.ActionType
FROM
UserActions as c
WHERE
c.UserId = a.UserId
GROUP BY
c.ActionType
) as b
)

Below would be query in hive.
select
t1.userId, t1.actionType,
min(case when t1.actionType='Create' then 1 else 100
when t1.actionType='Edit' then 2 else 100
when t1.actionType='Read' then 3 else 100 end) as GroupBy
from mytable t1 group by t1.userId, t1.actionType

Related

Postgres, get two row values that are both linked to the same ID

I have a rather tricky database problem that has really stumped me, would appreciate any help.
I have a table which includes data from multiple different sources. This data from different sources can be ‘duplicated’ and we have ways of identifying if that is the case.
Each row in the table has an ‘id’, and if it is identified as a duplicate of another row then we merge it, and it is given a ‘merged_into_id’ which refers to another row in the same table.
I am trying to run a report which will return information about where we have identified duplicates from two of those different sources.
Lets say I have three sources: A, B and C. I want to identify all of the duplicate rows between source A and source B.
I have got the query working fine to do this if a row from source A is directly merged into source B. However, we also have instances in the DB where source A row AND source B row are merged into source C. I am struggling with these and was hoping someone could help with that.
An example:
Original DB:
id
source
merged_into_id
1
A
3
2
B
3
3
C
NULL
What I would like to do is to be able to return id 1 and id 2 from that table, as they are both merged into the same ID e.g. like so:
source_a_id
source_b_id
1
2
But I'm really struggling to get to that - all I've managed to do is create a parent and child link like the following:
parent_id
child_id
child_source
3
1
A
3
2
B
I can also return just the IDs that I want, but they don't 'join' so to speak:
e.g.
SELECT
CASE WHEN child_source = 'A' then child_id as source_a_id,
CASE WHEN child_source = 'B' then child_id as source_b_id
But that just gives me a response with an empty row for the 'missing' data
---EDIT---
Using array_agg and array_to_string I've gotten a little closer to what I need:
SELECT
parent.id as parent_id,
ARRAY_TO_STRING(
ARRAY_AGG(CASE WHEN child_source = 'A' THEN child.id END)
, ','
) a_id,
ARRAY_TO_STRING(
ARRAY_AGG(CASE WHEN child_source = 'B' THEN child.id END)
, ','
) b_id
but its not quite the right format as I can occasionally have multiple versions from each source, so I get a table that looks like :
parent_id
a_id
b_id
3
1
2,4,5
In this case, I want to return a table that looks like:
parent_id
a_id
b_id
3
1
2
3
1
4
3
1
5
Does anyone have any advice on getting to my desired output? Many thanks
Suppose that we have this table
select * from t;
id | source | merged_into_id
----+--------+----------------
1 | A | 3
2 | B | 3
3 | C |
5 | B | 3
4 | B | 3
(5 rows)
This should do the work
WITH B_source as (select * from t where source = 'B'),
A_source as (select * from t where source = 'A')
SELECT merged_into_id,A_source.id as a_id,B_source.id as b_id
FROM A_source
INNER JOIN B_source using (merged_into_id);
Result
merged_into_id | a_id | b_id
----------------+------+------
3 | 1 | 2
3 | 1 | 5
3 | 1 | 4
(3 rows)

Does String Value Exists in a List of Strings | Redshift Query

I have some interesting data, I'm trying to query however I cannot get the syntax correct. I have a temporary table (temp_id), which I've filled with the id values I care about. In this example it is only two ids.
CREATE TEMPORARY TABLE temp_id (id bigint PRIMARY KEY);
INSERT INTO temp_id (id) VALUES ( 1 ), ( 2 );
I have another table in production (let's call it foo) which holds multiples those ids in a single cell. The ids column looks like this (below) with ids as a single string separated by "|"
ids
-----------
1|9|3|4|5
6|5|6|9|7
NULL
2|5|6|9|7
9|11|12|99
I want to evaluate each cell in foo.ids, and see if any of the ids in match the ones in my temp_id table.
Expected output
ids |does_match
-----------------------
1|9|3|4|5 |true
6|5|6|9|7 |false
NULL |false
2|5|6|9|7 |true
9|11|12|99 |false
So far I've come up with this, but I can't seem to return anything. Instead of trying to create a new column does_match I tried to filter within the WHERE statement. However, the issue is I cannot figure out how to evaluate all the id values in my temp table to the string blob full of the ids in foo.
SELECT
ids,
FROM foo
WHERE ids = ANY(SELECT LISTAGG(id, ' | ') FROM temp_ids)
Any suggestions would be helpful.
Cheers,
this would work, however not sure about performance
SELECT
ids
FROM foo
JOIN temp_ids
ON '|'||foo.ids||'|' LIKE '%|'||temp_ids.id::varchar||'|%'
you wrap the IDs list into a pair of additional separators, so you can always search for |id| including the first and the last number
The following SQL (I know it's a bit of a hack) returns exactly what you expect as an output, tested with your sample data, don't know how would it behave on your real data, try and let me know
with seq AS ( # create a sequence CTE to implement postgres' unnest
select 1 as i union all # assuming you have max 10 ids in ids field,
# feel free to modify this part
select 2 union all
select 3 union all
select 4 union all
select 5 union all
select 6 union all
select 7 union all
select 8 union all
select 9 union all
select 10)
select distinct ids,
case # since I can't do a max on a boolean field, used two cases
# for 1s and 0s and converted them to boolean
when max(case
when t.id in (
select split_part(ids,'|',seq.i) as tt
from seq
join foo f on seq.i <= REGEXP_COUNT(ids, '|') + 1
where tt != '' and k.ids = f.ids)
then 1
else 0
end) = 1
then true
else false
end as does_match
from temp_id t, foo
group by 1
Please let me know if this works for you!

Update Count column in Postgresql

I have a single table laid out as such:
id | name | count
1 | John |
2 | Jim |
3 | John |
4 | Tim |
I need to fill out the count column such that the result is the number of times the specific name shows up in the column name.
The result should be:
id | name | count
1 | John | 2
2 | Jim | 1
3 | John | 2
4 | Tim | 1
I can get the count of occurrences of unique names easily using:
SELECT COUNT(name)
FROM table
GROUP BY name
But that doesn't fit into an UPDATE statement due to it returning multiple rows.
I can also get it narrowed down to a single row by doing this:
SELECT COUNT(name)
FROM table
WHERE name = 'John'
GROUP BY name
But that doesn't allow me to fill out the entire column, just the 'John' rows.
you can do that with a common table expression:
with counted as (
select name, count(*) as name_count
from the_table
group by name
)
update the_table
set "count" = c.name_count
from counted c
where c.name = the_table.name;
Another (slower) option would be to use a co-related sub-query:
update the_table
set "count" = (select count(*)
from the_table t2
where t2.name = the_table.name);
But in general it is a bad idea to store values that can easily be calculated on the fly:
select id,
name,
count(*) over (partition by name) as name_count
from the_table;
Another method : Using a derived table
UPDATE tb
SET count = t.count
FROM (
SELECT count(NAME)
,NAME
FROM tb
GROUP BY 2
) t
WHERE t.NAME = tb.NAME

SQL to remove rows with duplicated value while keeping one

Say I have this table
id | data | value
-----------------
1 | a | A
2 | a | A
3 | a | A
4 | a | B
5 | b | C
6 | c | A
7 | c | C
8 | c | C
I want to remove those rows with duplicated value for each data while keeping the one with the min id, e.g. the result will be
id | data | value
-----------------
1 | a | A
4 | a | B
5 | b | C
6 | c | A
7 | c | C
I know a way to do it is to do a union like:
SELECT 1 [id], 'a' [data], 'A' [value] INTO #test UNION SELECT 2, 'a', 'A'
UNION SELECT 3, 'a', 'A' UNION SELECT 4, 'a', 'B'
UNION SELECT 5, 'b', 'C' UNION SELECT 6, 'c', 'A'
UNION SELECT 7, 'c', 'C' UNION SELECT 8, 'c', 'C'
SELECT * FROM #test WHERE id NOT IN (
SELECT MIN(id) FROM #test
GROUP BY [data], [value]
HAVING COUNT(1) > 1
UNION
SELECT MIN(id) FROM #test
GROUP BY [data], [value]
HAVING COUNT(1) <= 1
)
but this solution has to repeat the same group by twice (consider the real case is a massive group by with > 20 columns)
I would prefer a simpler answer with less code as oppose to complex ones. Is there any more concise way to code this?
Thank you
You can use one of the methods below:
Using WITH CTE:
WITH CTE AS
(SELECT *,RN=ROW_NUMBER() OVER(PARTITION BY data,value ORDER BY id)
FROM TableName)
DELETE FROM CTE WHERE RN>1
Explanation:
This query will select the contents of the table along with a row number RN. And then delete the records with RN >1 (which would be the duplicates).
This Fiddle shows the records which are going to be deleted using this method.
Using NOT IN:
DELETE FROM TableName
WHERE id NOT IN
(SELECT MIN(id) as id
FROM TableName
GROUP BY data,value)
Explanation:
With the given example, inner query will return ids (1,6,4,5,7). The outer query will delete records from table whose id NOT IN (1,6,4,5,7).
This fiddle shows the records which are going to be deleted using this method.
Suggestion: Use the first method since it is faster than the latter. Also, it manages to keep only one record if id field is also duplicated for the same data and value.
I want to add MYSQL solution for this query
Suggestion 1 : MySQL prior to version 8.0 doesn't support the WITH clause
Suggestion 2 : throw this error (you can't specify table TableName for update in FROM clause
So the solution will be
DELETE FROM TableName WHERE id NOT IN
(SELECT MIN(id) as id
FROM (select * from TableName) as t1
GROUP BY data,value) as t2;

how to make array_agg() work like group_concat() from mySQL

So I have this table:
create table test (
id integer,
rank integer,
image varchar(30)
);
Then some values:
id | rank | image
---+------+-------
1 | 2 | bbb
1 | 3 | ccc
1 | 1 | aaa
2 | 3 | c
2 | 1 | a
2 | 2 | b
I want to group them by id and concatenate the image name in the order given by rank. In mySQL I can do this:
select id,
group_concat( image order by rank asc separator ',' )
from test
group by id;
And the output would be:
1 aaa,bbb,ccc
2 a,b,c
Is there a way I can have this in postgresql?
If I try to use array_agg() the names will not show in the correct order and apparently I was not able to find a way to sort them. (I was using postgres 8.4 )
In PostgreSQL 8.4 you cannot explicitly order array_agg but you can work around it by ordering the rows passed into to the group/aggregate with a subquery:
SELECT id, array_to_string(array_agg(image), ',')
FROM (SELECT * FROM test ORDER BY id, rank) x
GROUP BY id;
In PostgreSQL 9.0 aggregate expressions can have an ORDER BY clause:
SELECT id, array_to_string(array_agg(image ORDER BY rank), ',')
FROM test
GROUP BY id;