Hive GROUP BY optimization based on cardinality - group-by

Logically cardinality of columns should matter while doing GROUP BY operation. When we write Hive queries involving GROUP BY, since we are familiar with the data being queried, we have an idea about cardinality of individual columns involved in the GROUP BY. But Hive has no idea about this. So let's say the Hive query in question is:-
SELECT Col1,Col2,Col3,Col4,Col5,COUNT(*) FROM MyTable GROUP BY Col1,Col2,Col3,Col4,Col5
I know the degree of cardinality of all the 5 columns here. But Hive doesn't know that, so Hive will probably perform the worst.
So let's say the cardinality information that I have about these columns is like this, from lowest to highest and also giving example of values contained:-
Col5 = it contains country name
Col4 = it contains state name
Col3 = it contains city name
Col2 = it contains postal code
Col1 = it contains email address
Now Hive will treat all these the same , won't it be beneficial if Hive knew about underlying cardinality information so it could exploit this in calculating unique groups? In that case if I explicitly arrange the columns in the GROUP BY clause in the order of cardinality, will it be efficient as shown in the following example ?
SELECT Col1,Col2,Col3,Col4,Col5,COUNT(*) FROM MyTable GROUP BY Col5,Col4,Col3,Col2,Col1
Or hive will ignore this order and treat all the columns equally regardless of the order?

Related

sql restriction for join table with string similarity rule

My Db is building from some tables that are similar to each other and share the same column names. The reason is to perform a comparison between data from each resource.
table_A and table_B: id, product_id, capacitor_name, ressitance
It is easy to join tables by product_id and see the comparison,
but I need to compare data between product_id if exists in both tables and if not I want to compare by name similarity and if similarity restricts the result for up to 3 results.
The names most of the time are not equal this is why I'm using a similarity.
SELECT * FROM table_a ta
JOIN table_b tb
ON
ta.product_id = tb.product_id
OR
similarity(ta.name,tb.name) > 0.8
It works fine. But the problem is sometimes I'm getting more data than I need, how can I restrict it? (and moreover, order it by similarity in order to get higher similarity names).
If you want to benefit from an trigram index, you need to use the operator form (%), not the function form. Then you would order on two "columns", the first to be exact matches first, the 2nd to put most similar matches after and in order. And use LIMIT to do the limit. I've assumed you have some WHERE condition to restrict this to just one row of table_a. If not, then your question is not very well formed. To what is this limit supposed to apply? Each what should be limited to just 3?
SELECT * FROM table_a ta
JOIN table_b tb
ON
ta.product_id = tb.product_id
OR
ta.name % tb.name
WHERE ta.id=$1
ORDER BY ta.product_id = tb.product_id desc, similarity(ta.name,tb.name) desc
LIMIT 3

SQL statement that returns exactly one row with columns

I'm having trouble creating a query for the following task: i want to return exactly one row with columns: region_id, region_name, province_name, province_code, country_name, country_code for any given regionid. The database has 3 tables "countrylist" , "provinces" and "regionlist"
the table countrylist has the following columns : countryid, language code, countryname, countrycode and continentid
provinces : country_code, country_name, province_code, province_name
regionlist: regionid, regiontype.
So I tried writing a query for joining the table but I'm sure if I'm doing it correct.
exactly one row with columns: region_id, region_name, province_name, province_code, country_name, country_code for any given regionid.
I am not 100% aware of the differences between Postgres and MySQL - but guess you get the idea at the very least.
One way to do it, to get your id with WHERE regionlist.regionid = and join the other tables. From either the regionlist you can use the LIMIT (reference) to get a limited amount of rows.
Apparently neither provinces nor country have a common column with regionlist, so I can not tell where the link between those are. However, once you have 1 row of the region list you should have no troubles joining them with the others (if the links are trivial).

Difference in partitioned and non-partitioned table in terms of vertical join in q kdb

I have two non-partitioned tables:
q)s:([] date:(2019.07.01;2019.07.01;2019.07.02;2019.07.01;2019.07.05); co:`a`b`f`b`c)
q)t:([] date:(2019.07.01;2019.07.01;2019.07.02;2019.07.01;2019.07.07); co:`a`b`e`b`d)
In above table when I run below query it works perfectly fine.
q)select distinct co from s,t where date within 2019.07.01 2019.07.02
co
--
a
b
f
e
I have tables with same name which are partitioned by date, when I try to run same query on partitioned tables I get below error:
ERROR: 'par
(trying to update a physically partitioned table)
Why do we get above error in partitioned tables?
What is the optimized approach to get similar output as we got in non-partitioned tables?
One solution to for 2 which I feel as brute-force is:
select distinct co from((select distinct co from s where date within 2019.07.01 2019.07.02),select distinct co from t where date within 2019.07.01 2019.07.02)
I'm assuming you are only including the date name in the source tables to assist in queries. A date partitioned table will generate the virtual date column from the hdb structure, you shouldn't include it in the actual table being written to.
Why do we get above error in partitioned tables?
There is no way to avoid having to access the data of a partitioned table except through an initial a select statement.. In this case you are directly trying to perform a , operation to the s and t tables
What is the optimized approach to get similar output as we got in non-partitioned tables?
In general, there may be a trade-off between the table size and the nature and frequency of the operations, sometimes it may be worth bringing the table into memory for frequent joins, or creating a top-level flat table with the relevant subset of data.
If this is just a generalized test case for larger operations then something along the following would be ideal
distinct raze {select distinct co from x where date within 2019.07.01 2019.07.02} each `s`t
This performance is not very different from your own query however, it's just a bit more succinct.

How to optimise tables in Netezza to compliment a join with date conditions

I have two tables that I need to join in Netezza and one of them is very large
I have a dimension table that is a customer table which has two fields, customer id and an observation date i.e.
cust_id, obs_date
'a','2015-01-05'
'b','2016-02-03'
'c','2014-05-21'
'd','2016-01-31'
I have a fact table that is transactional and very high in volume. It has a lot of transactions per customer per date i.e.
cust_id, tran_date, transaction_amt
'a','2015-01-01',1
'a','2015-01-01',2
'a','2015-01-01',5
'a','2015-01-02',7
'a','2015-01-02',2
'b','2016-01-02',12
Both tables are distributed by the same key - cust_id
However When I join the tables, i need to join given the date condition. The query is very fast when i just join them together, but when I add the date condition it does not seem optimised. Does anyone have tips on how to set up the underlying tables or write the join?
I.e. sum transaction_amt for each customer for all their transactions for the 3 months up to their obs_date
FROM CUSTOMER_TABLE
INNER JOIN TRANSACTION_TABLE
ON CUSTOMER_TABLE.cust_id = TRANSACTION_TABLE.cust_id
AND TRANSACTION_TABLE.TRAN_DATE BETWEEN CUSTOMER_TABLE.OBS_DATE - 30 AND CUSTOMER_TABLE.OBS_DATE
If your transaction table is sufficiently large, it may benefit from using CBTs.
If you can, create a copy of the table that uses TRAN_DATE to organize (I'm guessing at your ddl here):
create table transaction_table (
cust_id varchar(20)
,tran_date date
,transaction_amt numeric(10,0)
) distribute on (cust_id)
organize on (tran_date);
Join to that and see if performance is improved. You could also use a materialized view for just those columns, but I think a CBT would be more useful here.
As Scott mentions in the comments below, you should either sort by the date on insert or groom the records after to make sure that they are sorted appropriately.

Hive: How to do a SELECT query to output a unique primary key using HiveQL?

I have the following schema dataset which i want to transform into a table that can be exported to SQL. I am using HIVE. Input as follows
call_id,stat1,stat2,stat3
1,a,b,c,
2,x,y,z,
3,d,e,f,
1,j,k,l,
The output table needs to have call_id as its primary key so it needs to be unique. The output schema should be
call_id,stat2,stat3,
1,b,c, or (1,k,l)
2,y,z,
3,e,f,
The problem is that when i use the keyword DISTINCT in the HIVE query, the DISTINCT applies to the all the colums combined. I want to apply the DISTINCT operation only to the call_id. Something on the lines of
SELECT DISTINCT(call_id), stat2,stat3 from intable;
However this is not valid in HIVE(I am not well-versed in SQL either).
The only legal query seems to be
SELECT DISTINCT call_id, stat2,stat3 from intable;
But this returns multiple rows with same call_id as the other columns are different and the row on the whole is distinct.
NOTE: There is no arithmetic relation between a,b,c,x,y,z, etc. So any trick of averaging or summing is not viable.
Any ideas how i can do this?
One quick idea,not the best one, but will do the work-
hive>create table temp1(a int,b string);
hive>insert overwrite table temp1
select call_id,max(concat(stat1,'|',stat2,'|',stat3)) from intable group by call_id;
hive>insert overwrite table intable
select a,split(b,'|')[0],split(b,'|')[1],split(b,'|')[2] from temp1;
,,I want to apply the DISTINCT operation only to the call_id"
But how will then Hive know which row to eliminate?
Without knowing the amount of data / size of the stat fields you have, the following query can the job:
select distinct i1.call_id, i1.stat2, i1.stat3 from (
select call_id, MIN(concat(stat1, stat2, stat3)) as smin
from intable group by call_id
) i2 join intable i1 on i1.call_id = i2.call_id
AND concat(i1.stat1, i1.stat2, i1.stat3) = i2.smin;