I am currently working on updating a table based on its existence on another table:
Ex:
Dataset A (relatively small, 300k of rows): DepartmentId, EmployeeId, Salary, Error
Dataset B (relatively huge, millions of rows): DepartmentId, EmployeeId, Salary
The logic is:
1. If A's (DepartmentId, EmployeeId) pair exists in B, then update A's salary with B's salary
2. Otherwise, write a message to A's error field
The solution I have now is doing a left outer join on A with B. Is there any other better practices for this type of problem?
Thank you in advance!
For better performance, you can use broadcast hash join as mention here by #Ram Ghadiyaram
The broadcasted dataframe will be distributed in all the partition which increases the performance in joining.
DataFrame join optimization - Broadcast Hash Join
Hope this helps!
Related
My Db is building from some tables that are similar to each other and share the same column names. The reason is to perform a comparison between data from each resource.
table_A and table_B: id, product_id, capacitor_name, ressitance
It is easy to join tables by product_id and see the comparison,
but I need to compare data between product_id if exists in both tables and if not I want to compare by name similarity and if similarity restricts the result for up to 3 results.
The names most of the time are not equal this is why I'm using a similarity.
SELECT * FROM table_a ta
JOIN table_b tb
ON
ta.product_id = tb.product_id
OR
similarity(ta.name,tb.name) > 0.8
It works fine. But the problem is sometimes I'm getting more data than I need, how can I restrict it? (and moreover, order it by similarity in order to get higher similarity names).
If you want to benefit from an trigram index, you need to use the operator form (%), not the function form. Then you would order on two "columns", the first to be exact matches first, the 2nd to put most similar matches after and in order. And use LIMIT to do the limit. I've assumed you have some WHERE condition to restrict this to just one row of table_a. If not, then your question is not very well formed. To what is this limit supposed to apply? Each what should be limited to just 3?
SELECT * FROM table_a ta
JOIN table_b tb
ON
ta.product_id = tb.product_id
OR
ta.name % tb.name
WHERE ta.id=$1
ORDER BY ta.product_id = tb.product_id desc, similarity(ta.name,tb.name) desc
LIMIT 3
Logically cardinality of columns should matter while doing GROUP BY operation. When we write Hive queries involving GROUP BY, since we are familiar with the data being queried, we have an idea about cardinality of individual columns involved in the GROUP BY. But Hive has no idea about this. So let's say the Hive query in question is:-
SELECT Col1,Col2,Col3,Col4,Col5,COUNT(*) FROM MyTable GROUP BY Col1,Col2,Col3,Col4,Col5
I know the degree of cardinality of all the 5 columns here. But Hive doesn't know that, so Hive will probably perform the worst.
So let's say the cardinality information that I have about these columns is like this, from lowest to highest and also giving example of values contained:-
Col5 = it contains country name
Col4 = it contains state name
Col3 = it contains city name
Col2 = it contains postal code
Col1 = it contains email address
Now Hive will treat all these the same , won't it be beneficial if Hive knew about underlying cardinality information so it could exploit this in calculating unique groups? In that case if I explicitly arrange the columns in the GROUP BY clause in the order of cardinality, will it be efficient as shown in the following example ?
SELECT Col1,Col2,Col3,Col4,Col5,COUNT(*) FROM MyTable GROUP BY Col5,Col4,Col3,Col2,Col1
Or hive will ignore this order and treat all the columns equally regardless of the order?
my table contains 1 billion records. It is also partitioned by month.Id and datetime is the primary key for the table. When I select
select col1,col2,..col8
from mytable t
inner join cte on t.Id=cte.id and dtime>'2020-01-01' and dtime<'2020-10-01'
It uses index scan, but takes more than 5 minutes to select.
Please suggest me.
Note: I have set work_mem to 1GB. cte table results comes with in 3 seconds.
Well it's the nature of join and it is usually known as a time consuming operation.
First of all, I recommend to use in rather than join. Of course they have got different meanings, but in some cases technically you can use them interchangeably. Check this question out.
Secondly, according to the relation algebra whenever you use join each rows of mytable table is combined with each rows from the second table, and DBMS needs to make a huge temporary table, and finally igonre unsuitable rows. Undoubtedly all the steps and the result would take much time. Before using the Join opeation, it's better to filter your tables (for example mytable based date) and make them smaller, and then use the join operations.
I had created a bucketed table using below command in Spark:
df.write.bucketBy(200, "UserID").sortBy("UserID").saveAsTable("topn_bucket_test")
Size of Table : 50 GB
Then I joined another table (say t2 , size :70 GB)(Bucketed as before ) with above table on UserId column . I found that in the execution plan the table topn_bucket_test was being sorted (but not shuffled) before the join and I expected it to be neither shuffled nor sorted before join as it was bucketed. What can be the reason ? and how to remove sort phase for topn_bucket_test?
As far as I am concerned it is not possible to avoid the sort phase. When using the same bucketBy call it is unlikely that the physical bucketing will be identical in both tables. Imagine the first table having UserID ranging from 1 to 1000 and the second from 1 to 2000. Different UserIDs might end up in the 200 buckets and within those bucket there might be multiple different (and unsorted!) UserIDs.
I have two tables that I need to join in Netezza and one of them is very large
I have a dimension table that is a customer table which has two fields, customer id and an observation date i.e.
cust_id, obs_date
'a','2015-01-05'
'b','2016-02-03'
'c','2014-05-21'
'd','2016-01-31'
I have a fact table that is transactional and very high in volume. It has a lot of transactions per customer per date i.e.
cust_id, tran_date, transaction_amt
'a','2015-01-01',1
'a','2015-01-01',2
'a','2015-01-01',5
'a','2015-01-02',7
'a','2015-01-02',2
'b','2016-01-02',12
Both tables are distributed by the same key - cust_id
However When I join the tables, i need to join given the date condition. The query is very fast when i just join them together, but when I add the date condition it does not seem optimised. Does anyone have tips on how to set up the underlying tables or write the join?
I.e. sum transaction_amt for each customer for all their transactions for the 3 months up to their obs_date
FROM CUSTOMER_TABLE
INNER JOIN TRANSACTION_TABLE
ON CUSTOMER_TABLE.cust_id = TRANSACTION_TABLE.cust_id
AND TRANSACTION_TABLE.TRAN_DATE BETWEEN CUSTOMER_TABLE.OBS_DATE - 30 AND CUSTOMER_TABLE.OBS_DATE
If your transaction table is sufficiently large, it may benefit from using CBTs.
If you can, create a copy of the table that uses TRAN_DATE to organize (I'm guessing at your ddl here):
create table transaction_table (
cust_id varchar(20)
,tran_date date
,transaction_amt numeric(10,0)
) distribute on (cust_id)
organize on (tran_date);
Join to that and see if performance is improved. You could also use a materialized view for just those columns, but I think a CBT would be more useful here.
As Scott mentions in the comments below, you should either sort by the date on insert or groom the records after to make sure that they are sorted appropriately.