hive 1.1 take out hierarchy value from three tables

hive 1.1 take out hierarchy value from three tables - scala

I have the following tables in HIVE (1.1 version) and need the output as shown in the Result. As UNION is not available in hive 1.1, need different approach to get the below result.
Table A:
id name
1 One
2 Two
4 Four
Table B:
id name
1 ONE
3 THREE
4 FOUR
Table C:
id name
1 one
2 two
3 three
5 five
Result
id name
1 One
2 Two
3 THREE
4 Four
5 five

select id, name from A
UNION ALL
SELECT sec.id, sec.name FROM B sec WHERE sec.id NOT IN (SELECT id FROM A)
UNION ALL
SELECT thr.id, thr.name FROM c thr WHERE thr.id NOT IN (SELECT id FROM A UNION ALL SELECT id FROM B)

Related

PostgreSQL: Merging sets of rows which text fields are contained in other sets of rows

Given the following table, I need to merge the fields in different "id" only if they are the same type (person or dog), and always as the value of every field of an "id" is contained in the values of other "ids".
id
being
feature
values
1
person
name
John;Paul
1
person
surname
Smith
2
dog
name
Ringo
3
dog
name
Snowy
4
person
name
John
4
person
surname
5
person
name
John;Ringo
5
person
surname
Smith
In this example, the merge results should be as follows:
1 and 4 (Since 4's name is present in 1's name and 4's surname is empty)
1 and 5 cannot be merged (the name field show different values)
4 and 5 can be merged
2 and 3 (dogs) cannot be merged. They have only the field "name" and they do not share values.
2 and 3 cannot be merged with 1, 4, 5 since they have different values in "being".
id
being
feature
values
1
person
name
John;Paul
1
person
surname
Smith
2
dog
name
Ringo
3
dog
name
Snowy
5
person
name
John;Ringo
5
person
surname
Smith
I have tried this:
UPDATE table a
SET values = (SELECT array_to_string(array_agg(distinct values),';') AS values FROM table b
WHERE a.being= b.being
AND a.feature= b.feature
AND a.id<> b.id
AND a.values LIKE '%'||a.values||'%'
)
WHERE (select count (*) FROM (SELECT DISTINCT c.being, c.id from table c where a.being=c.being) as temp) >1
;
This doesn't work well because it will merge, for example, 1 and 5. Besides, it duplicates values when merging that field.

One option is to aggregate names with surnames on "id" and "being". Once you get a single string per "id", a self join may find when a full name is completely included inside another (where the "being" is same for both "id"s), then you just select the smallest fullname, candidate for deletion:
WITH cte AS (
SELECT id,
being,
STRING_AGG(values, ';') AS fullname
FROM tab
GROUP BY id,
being
)
DELETE FROM tab
WHERE id IN (SELECT t2.id
FROM cte t1
INNER JOIN cte t2
ON t1.being = t2.being
AND t1.id > t2.id
AND t1.fullname LIKE CONCAT('%',t2.fullname,'%'));
Check the demo here.

PSQL filter each group of rows

Recently I've faced with pretty rare filtering case in PSQL.
My question is: How to filter redundant elements in each group of the grouped table?
For example: we have a nexp table:
id | group_idx | filter_idx
1 1 x
2 3 z
3 3 x
4 2 x
5 1 x
6 3 x
7 2 x
8 1 z
9 2 z
Firstly, to group rows:
SELECT group_idx FROM table
GROUP BY group_idx;
But how I can filter redundant fields (filter_idx = z) from each group after grouping?
P.S. I can't just write like that because I need to find groups firstly.
SELECT group_idx FROM table
where filter_idx <> z;
Thanks.

Assuming that you want to see all groups at all times, even when you filter out all records of some group:
drop table if exists test cascade;
create table test (id integer, group_idx integer, filter_idx character);
insert into test
(id,group_idx,filter_idx)
values
(1,1,'x'),
(2,3,'z'),
(3,3,'x'),
(4,2,'x'),
(5,1,'x'),
(6,3,'x'),
(7,2,'x'),
(8,1,'z'),
(9,2,'z'),
(0,4,'y');--added an example of a group that would be discarded using WHERE.
Get groups in one query, filter your rows in another, then left join the two.
select groups.group_idx,
string_agg(filtered_rows.filter_idx,',')
from
(select distinct group_idx from test) groups
left join
(select group_idx,filter_idx from test where filter_idx<>'y') filtered_rows
using (group_idx)
group by 1;
-- group_idx | string_agg
-------------+------------
-- 3 | z,x,x
-- 4 |
-- 2 | x,x,z
-- 1 | x,x,z
--(4 rows)

Postgres, get two row values that are both linked to the same ID

I have a rather tricky database problem that has really stumped me, would appreciate any help.
I have a table which includes data from multiple different sources. This data from different sources can be ‘duplicated’ and we have ways of identifying if that is the case.
Each row in the table has an ‘id’, and if it is identified as a duplicate of another row then we merge it, and it is given a ‘merged_into_id’ which refers to another row in the same table.
I am trying to run a report which will return information about where we have identified duplicates from two of those different sources.
Lets say I have three sources: A, B and C. I want to identify all of the duplicate rows between source A and source B.
I have got the query working fine to do this if a row from source A is directly merged into source B. However, we also have instances in the DB where source A row AND source B row are merged into source C. I am struggling with these and was hoping someone could help with that.
An example:
Original DB:
id
source
merged_into_id
1
A
3
2
B
3
3
C
NULL
What I would like to do is to be able to return id 1 and id 2 from that table, as they are both merged into the same ID e.g. like so:
source_a_id
source_b_id
1
2
But I'm really struggling to get to that - all I've managed to do is create a parent and child link like the following:
parent_id
child_id
child_source
3
1
A
3
2
B
I can also return just the IDs that I want, but they don't 'join' so to speak:
e.g.
SELECT
CASE WHEN child_source = 'A' then child_id as source_a_id,
CASE WHEN child_source = 'B' then child_id as source_b_id
But that just gives me a response with an empty row for the 'missing' data
---EDIT---
Using array_agg and array_to_string I've gotten a little closer to what I need:
SELECT
parent.id as parent_id,
ARRAY_TO_STRING(
ARRAY_AGG(CASE WHEN child_source = 'A' THEN child.id END)
, ','
) a_id,
ARRAY_TO_STRING(
ARRAY_AGG(CASE WHEN child_source = 'B' THEN child.id END)
, ','
) b_id
but its not quite the right format as I can occasionally have multiple versions from each source, so I get a table that looks like :
parent_id
a_id
b_id
3
1
2,4,5
In this case, I want to return a table that looks like:
parent_id
a_id
b_id
3
1
2
3
1
4
3
1
5
Does anyone have any advice on getting to my desired output? Many thanks

Suppose that we have this table
select * from t;
id | source | merged_into_id
----+--------+----------------
1 | A | 3
2 | B | 3
3 | C |
5 | B | 3
4 | B | 3
(5 rows)
This should do the work
WITH B_source as (select * from t where source = 'B'),
A_source as (select * from t where source = 'A')
SELECT merged_into_id,A_source.id as a_id,B_source.id as b_id
FROM A_source
INNER JOIN B_source using (merged_into_id);
Result
merged_into_id | a_id | b_id
----------------+------+------
3 | 1 | 2
3 | 1 | 5
3 | 1 | 4
(3 rows)

Subsetting records that contain multiple values in one column

In my postgres table, I have two columns of interest: id and name - my goal is to only keep records where id has more than one value in name. In other words, would like to keep all records of ids that have multiple values and where at least one of those values is B
UPDATE: I have tried adding WHERE EXISTS to the queries below but this does not work
The sample data would look like this:
> test
id name
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
6 6 A
7 7 A
8 2 B
9 1 B
10 2 B
and the output would look like this:
> output
id name
1 1 A
2 2 A
8 2 B
9 1 B
10 2 B
How would one write a query to select only these kinds records?

Based on your description you would seem to want:
select id, name
from (select t.*, min(name) over (partition by id) as min_name,
max(name) over (partition by id) as max_name
from t
) t
where min_name < max_name;

This can be done using EXISTS:
select id, name
from test t1
where exists (select *
from test t2
where t1.id = t2.id
and t1.name <> t2.name) -- this will select those with multiple names for the id
and exists (select *
from test t3
where t1.id = t3.id
and t3.name = 'B') -- this will select those with at least one b for that id

Those records where for their id more than one name shines up, right?
This could be formulated in "SQL" as follows:
select * from table t1
where id in (
select id
from table t2
group by id
having count(name) > 1)

How to determine whether a value exists in a junction table and return zero or one?

I am using SQL Server 2008 R2
I am trying to write a single query that will return only exactly what I need. I will drop in a MovieID and get back a list of ALL genres. If the movie represents a specific genre (has an associated record in the junction table), the Checked value will be 1. If not, then 0.
My result set should look like this:
GenreID Genre Checked
1 ABC 0
2 DEF 1
3 HIJ 0
4 KLM 1
My First table is named Genres. It looks like this:
GenreID Genre
1 ABC
2 DEF
3 HIJ
4 KLM
My second table is named Movies. It looks like this:
MovieID Title
1 Blah
2 Foo
3 Carpe
4 Diem
My third table is a junction table named Movies_Genres. It looks like this:
MovieID GenreID
1 2
1 1
1 4
2 1
2 3
3 4
4 1
I would normally, do a couple of queries and a couple of loops to handle this, but I want to really just make the database do the work here. How do I tweak my query so that I can get the resultset that I need with just a single query?
Here's the starting query:
SELECT GenreID,
Genre
FROM Genres
Thanks in advance for your help!!!

SELECT g.GenreID, g.Genre, Checked = CASE WHEN EXISTS
(SELECT 1 FROM dbo.Movies_Genres AS mg
INNER JOIN dbo.Movies AS m
ON mg.MovieID = m.MovieID
WHERE mg.GenreID = g.GenreID
AND m.MovieID = #MovieID) THEN 1 ELSE 0 END
FROM dbo.Genres AS g
ORDER BY g.GenreID;
If there is a unique constraint or primary key on dbo.Movies_Genres(MovieID, GenreID) then this can be simply:
SELECT g.GenreID, g.Genre, Checked = COUNT(mg.GenreID)
FROM dbo.Genres AS g
LEFT OUTER JOIN dbo.Movies_Genres AS mg
ON g.GenreID = mg.GenreID
AND mg.MovieID = #MovieID
GROUP BY g.GenreID, g.Genre;
...since the count for any genre can only be 0 or 1 given a single #MovieID.

Pretty straight forward using CASE;
SELECT DISTINCT g.GenreID, g.Genre,
CASE WHEN mg.MovieID IS NULL THEN 0 ELSE 1 END Checked
FROM Genres g
LEFT JOIN Movies_Genres mg
ON g.GenreID=mg.GenreID
AND mg.MovieId=#MovieID;
Demo here.
Edit: If entries are guaranteed to be unique in Movies_Genres, you could choose to drop the DISTINCT.

The #MovieID is the movie, you want to filter by.
SELECT Genres.GenreID,
Genres.Genre,
CASE WHEN (Movies_Genres.GenreID IS NULL)
THEN 0
ELSE 1
END AS Checked
FROM Genres LEFT JOIN
Movies_Genres ON Movies_Genres.GenreID = Genres.GenreID AND
MovieID = #MovieID

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

hive 1.1 take out hierarchy value from three tables - scala

select id, name from A UNION ALL SELECT sec.id, sec.name FROM B sec WHERE sec.id NOT IN (SELECT id FROM A) UNION ALL SELECT thr.id, thr.name FROM c thr WHERE thr.id NOT IN (SELECT id FROM A UNION ALL SELECT id FROM B)

Related

PostgreSQL: Merging sets of rows which text fields are contained in other sets of rows

PSQL filter each group of rows

Postgres, get two row values that are both linked to the same ID

Subsetting records that contain multiple values in one column

How to determine whether a value exists in a junction table and return zero or one?

Categories

Resources