Can I use SAS hash merge to merge three datasets by two keys.
For example, the lookup datasets DATA1 has column A and B. I would like to merge DATA1 and DATA2 by key A and merge DATA1 and DATA3 by key B. Can I do that in one step hash merge?
In a hash merge, you define the keys and relationships for each table. You're free to have whatever key value relationships you like, and certainly can do more than one set of keys.
Related
What are the benefits keeping the lookup table as keyed for left join in kdb?Does it provide performance benefits/memory benifits and how?
As I have below table:
t:([]sym:6?`GOOG`AMZN`IBM`AAPL; px:6?10.) /- source table
kt:([sym:`IBM`AAPL`GOOG`AMZN]; vol:4?10000) /- lookup table
t lj kt
Are the same benefits applicable for asof join as well because I have read in Q for Mortals "There is no requirement for any of the join columns to be keys but the join will be faster on keys."
A keyed table is a dictionary mapping of keyed records to tables of values. Joining with a keyed table is, therefore, a dictionary lookup, which is inherently faster in kdb. The matching columns are used to simply index into the second table.
In Redshift database, I want to decide a sort key for a dimension table between surrogatekey and natural primary key. The definition says "Sort keys should be selected based on the most commonly used columns when filtering, ordering or grouping the data".
My Question is -
I have a Employee table with (Emp_key,Emp_Id,Emp_name) and this table is joined to Fact table on Emp key. Here "Emp_key" is the surrogate key and "Emp_id" is the natural primary key. And I filter the query on Emp_id but "Emp_key" in the fact table is defined as a "dist key" and read that for a large dimension defining sort & dist keys on the join keys results in better performance and so I want to know which one should i choose between Emp_key and Emp_id for Sort key in a dimension table?
And also, another confusion is choosing sort for the "date" dimension table between "date_key" or ignore defining sort key.
I would appreciate your suggestions in this regard.
Thank you!
Your employee table likely doesn't contain too many rows, you can choose ALL distribution style, so the copy of the table sits on every node of your cluster. This way you'll avoid this dilemma at a very low cost.
UPD: with this design, I would have emp_key as dist key (so that data that is joined sits on the same nodes) and emp_id as sort key (to filter efficiently). I'm pretty sure the query planner would prioritize filtering over joining, so first it will filter the rows from the dimension table and only then it will join the corresponding rows from the fact table. But it's better to try all options and benchmark a few queries to see what works best.
If you can change the design I would just add emp_id to the fact table (because it seems like the keys map 1 to 1) as a part of ELT and avoid the dilemma again.
I have 12 different tables with primary keys. I want to merge the values corresponding to those primary keys in all my 12 different tables and remove the duplicates, as some primary keys are repeating in different tables.
In the end I want to append all 12 tables in one.
I am working on SQL ( Microsoft SQL Server)
Let's say we have a table table1 with field1 INT ENCODE ZSTD, and we added interleaved sort key on field1.
But when I do the query select * from table1 where field1=123;, I still see a sequential scan on the whole table which I suppose should be a sub scan on the table.
Do I have some misunderstanding about sort key?
1) From what you tell, you don't need an interleaved sort key because you have just one column you're interested in. You need an interleaved sort key when you want multiple columns to be equally important because you want to run both where col1=123 and where col2=123 kind of queries. This provides benefit for large tables.
2) Compressing your sort key column is considered a bad practice. Proof from Amazon: the first column in a compound sort key should not be encoded (one column key is the same). The entire article is actually useful, read it and you won't regret
3) When sort key is configured and data is populated it's better to run vacuum and analyze commands to make sure rows are sorted accordingly to sort key and table statistics are updated.
I'm looking at the performance of some queries that I'm doing in Redshift and noticed something that I can't quite find in the documentation.
I created two tables that have a join key between them (about 10K rows in the child table).
For the parent table, let's call it A, I have a primary key that I've declared to be the distkey and sort key for the table. Let's call this id.
For the child table B, I've made a foreign key field, parent_id that references A.id. parent_id has been declared as the distkey for table B. Table B also has a primary key, id that I've defined. I've created an interleaved sort key on table B for (parent_id,id).
When I try to do an explain joining the two tables, I will always get a Hash Join. If I recreate table B with a normal compound sort key, I will always get a Merge Join.
When I look at the stats of the tables, I don't see any skews that are out of line.
My question is, will Redshift always use Hash Joins with interleaved sort keys or is there something I'm doing wrong?
EDIT - The order of the interleaved sort keys in Table B is actually (parent_id, id). I wrote it above incorrectly. I've updated the above to be clear now.
From my understanding:
A merge join can be used when both tables are sorted on the join column, which is very efficient -- a bit like closing a zipper, where both sides "fit into" each other.
A hash join is less efficient because it needs to do lookups via hashes to find matching values.
As you pointed out, if the tables are sorted using a normal compound key, then both tables are sorted by the join column.
In an interleaved join, however, values are not guaranteed to be sorted within each column.
The documentation for Interleaved Keys says:
An interleaved sort gives equal weight to each column, or subset of columns, in the sort key. If multiple queries use different columns for filters, then you can often improve performance for those queries by using an interleaved sort style. When a query uses restrictive predicates on secondary sort columns, interleaved sorting significantly improves query performance as compared to compound sorting.
However, it does not mean that all columns are sorted (as they are with a Compound sort). Rather, it gives a generally good mix of sorting, so that sorts on any column work generally well. Therefore, each column is not necessarily fully sorted, hence the need for a hash join.
The blog post Quickly Filter Data in Amazon Redshift Using Interleaved Sorting tries to explain how the data is stored when using interleaved sorting.