Hive SubQuery and Group BY - group-by

I have two tables
table1:
id
1
2
3
table 2:
id date
1 x1
4 x2
1 x3
3 x4
3 x5
1 x6
3 x5
6 x6
6 x5
3 x6
I want the count of each ids for table 2 that is present in table 1.
Result
id count
1 3
2 0
3 4
I am using this query, but its giving me error:
SELECT tab2.id, count(tab2.id)
FROM <mytable2> tab2
GROUP BY tab2.id
WHERE tab2.id IN (select id from <mytable1>)
;
Error is:
missing EOF at 'WHERE' near 'di_device_id'

There are two possible issues. Sub queries in the WHERE clause are only supported from Hive 0.13 and up. If you are using such a version, then your problem is just that you have WHERE and GROUP BY the wrong way round:
SELECT tab2.id, count(tab2.id)
FROM <mytable2> tab2
WHERE tab2.id IN (select id from <mytable1>)
GROUP BY tab2.id
;
If you are using an older version of Hive then you need to use a JOIN:
SELECT tab2.id, count(tab2.id)
FROM <mytable2> tab2 INNER JOIN <mytable1> tab1 ON (tab2.id = tab1.id)
GROUP BY tab2.id
;

You have two issues :-
Where comes before group by. In SQL syntax you use having to filter after grouping by!
Hive doesn't support all types of nested queries in Where clause. See here: Hive Subqueries
However yours type of sub query will be ok. Try this:-
SELECT tab2.id, count(tab2.id)
FROM <mytable2> tab2
WHERE tab2.id IN (select id from <mytable1>)
GROUP BY tab2.id;
It will do exactly same thing what you meant.
Edit: I Just checked #MattinBit's answer. I didn't intended to duplicate the answer. His answer is more complete!

Related

join 2 tables with different dates into one date column

I have two tables: a_table and b_table. They contain closing records and checkout records, that for each customer can be performed on different dates. I would like to combine these 2 tables together, so that there is only one date field, one customer field, one close and one check field.
a_table
time_modified customer_name
2021-05-03 Ben
2021-05-08 Ben
2021-07-10 Jerry
b_table
time_modified account_id
2021-05-06 Ben
2021-07-08 Jerry
2021-07-12 Jerry
Expected result
date account_id_a close check
2021-05-03 Ben 1 0
2021-05-06 Ben 0 1
2021-05-08 Ben 1 0
2021-07-08 Jerry 0 1
2021-07-10 Jerry 1 1
2021-07-12 Jerry 0 1
The query so far:
with a_table as (
select rz.time_modified::date, rz.customer_name,
case when rz.time_modified::date is not null then 1 else 0 end as close
from schema.rz
),
b_table as (
select bo.time_modified::date, bo.customer_name,
case when bo.time_modified::date is not null then 1 else 0 end as check
from schema.bo
)
SELECT (CURRENT_DATE::TIMESTAMP - (i * interval '1 day'))::date as date,
a.*, b.*
FROM generate_series(1,2847) i
left join a_table a
on a.time_modified = i.date
left join b_table b
on b.time_modified = i.date
The query above returns:
SQL Error [500310] [0A000]: [Amazon](500310) Invalid operation: Specified types or functions (one per INFO message) not supported on Redshift tables.;
you just need to do a union rather than a join.
Join merges two tables into one where union adds the second table to the first
First off the error you are getting is due to the use of the generate_series() function in a query where its results need to be combined with table data. Generate_series() is a lead-node-only function and its results cannot be used on compute nodes. You will need to generate the number series you desire in another way. See How to Generate Date Series in Redshift for possible ways to do this.
I'm not sure I follow your query entirely but it seems like you want to UNION the tables and not JOIN them. You haven't defined what rz and bo are so it is a bit confusing. However UNION and some calculation for close and check seems like the way to go

I want to get unique rows on joining tables

I am running the following query while trying to join 3 tables :
select
a.project_id, a.acc_name, a.project_name, a.iot,a.acc_id, a.active,
b.app_fte, b.contact_person, c.cost_call_date
from
Account a, Application b, account_version c
where
a.acc_id in (Select acc_id from account where acc_name='GGG') and
EXTRACT(MONTH FROM c.cost_call_date) = 3;
Sample data from the tables are as follows :
Account :
acc_id acc_name iot acc_contact project_id project_name ilc_code license_no active
2 GGG NA YYY 7777 HHH TTR 766 false
Application :
app_id app_name app_fte contact_person acc_id
1 sfsf 4 sdsdff 2
Account_version :
line_id acc_id version_no chargable_fte cost_call_date is_approved
9 2 7 4 2018-03-20
Here acc_id is the primary key for the Account table and the foreign key for the Application and Account_version tables. When I am running the above query I am getting 30 rows I have also tried using the distinct keyword but still I get 10 rows. Please help me in getting unique rows.
Try something like this
SELECT DISTINCT a.project_id, a.acc_name, a.project_name, a.iot,a.acc_id, a.active, b.app_fte, b.contact_person, c.cost_call_date
FROM Account a
INNER JOIN Application b
USING (acc_id)
INNER JOIN account_version c
USING (acc_id)
WHERE a.acc_name = 'GGG'
AND EXTRACT(MONTH FROM c.cost_call_date) = 3
For reference as to why your query was giving you more rows than expected, try running this:
SELECT a.*, b.*
FROM generate_series(1, 10) a, generate_series(1, 10) b
What you are doing by selecting from multiple tables as you did is a cross join. What you should actually be doing is an inner join to get the rows you want, and a DISTINCT if required to get only distinct rows from the results.

Subsetting records that contain multiple values in one column

In my postgres table, I have two columns of interest: id and name - my goal is to only keep records where id has more than one value in name. In other words, would like to keep all records of ids that have multiple values and where at least one of those values is B
UPDATE: I have tried adding WHERE EXISTS to the queries below but this does not work
The sample data would look like this:
> test
id name
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
6 6 A
7 7 A
8 2 B
9 1 B
10 2 B
and the output would look like this:
> output
id name
1 1 A
2 2 A
8 2 B
9 1 B
10 2 B
How would one write a query to select only these kinds records?
Based on your description you would seem to want:
select id, name
from (select t.*, min(name) over (partition by id) as min_name,
max(name) over (partition by id) as max_name
from t
) t
where min_name < max_name;
This can be done using EXISTS:
select id, name
from test t1
where exists (select *
from test t2
where t1.id = t2.id
and t1.name <> t2.name) -- this will select those with multiple names for the id
and exists (select *
from test t3
where t1.id = t3.id
and t3.name = 'B') -- this will select those with at least one b for that id
Those records where for their id more than one name shines up, right?
This could be formulated in "SQL" as follows:
select * from table t1
where id in (
select id
from table t2
group by id
having count(name) > 1)

T-SQL End of Month sum

I have a table with some transaction fields, primary id is a CUSTomer field and a TXN_DATE and for two of them, NOM_AMOUNT and GRS_AMOUNT I need an EndOfMonth SUM (no rolling, just EOM, can be 0 if no transaction in the month) for these two amount fields. How can I do it? I need also a 0 reported for months with no transactions..
Thank you!
If you group by the expresion month(txn_date) you can calculate the sum. If you use a temporary table with a join on month you can determine which months have no records and thus report a 0 (or null if you don't use the coalesce fiunction).
This will be your end result, I assume you are able to add the other column you need to sum and adapt for your schema.
select mnt as month
, sum(coalesce(NOM_AMOUNT ,0)) as NOM_AMOUNT_EOM
, sum(coalesce(GRS_AMOUNT ,0)) as GRS_AMOUNT_EOM
from (
select 1 as mnt
union all select 2
union all select 3
union all select 4
union all select 5
union all select 6
union all select 7
union all select 8
union all select 9
union all select 10
union all select 11
union all select 12) as m
left outer join Table1 as t
on m.mnt = month(txn_date)
group by mnt
Here is the initial working sqlfiddle

Simplified cross joins?

Let's sat I have a Table 'A' with rows:
A
B
C
D
Is there a simple way to do a cross join that creates
A 1
A 2
A 3
A 4
...
D 1
D 2
D 3
D 4
without creating a second table?
Something like:
SELECT *
FROM A
CROSS JOIN (1,2,3,4)
something like that should work, i guess
select * from A cross join (select 1 union all select 2 union all select 3 union all select 4) as tmp
you will create a second table, but you won't persist it.
The following would work for a table of any size (though I only tested it against 6 rows). It uses the ranking functions available in SQL Server 2005 and up, but the idea should be adaptible to any RDBMS.
SELECT ta.SomeColumn, cj.Ranking
from TableA ta
cross join (select row_number() over (order by SomeColumn) Ranking from TableA) cj
order by ta.SomeColumn, cj.Ranking
You should be able to achieve this via
select * from A cross join
(select 1
union all
select 2
union all
select 3
union all
select 4)