Hive, bucket map join doesn't work, but sort merge bucket join works fine. Why? - hiveql

I created 2 tables using blow sql and test the bucketed join, it fails(Job has 3 map tasks and 1 reduce task).
create table tmp(name string, id int)
clustered by(name) into 4 buckets
stored as textfile;
Then I add "sort" to the above sql and test the sort merge bucket join, it works(Job has 3 map tasks but 0 reduce task).
create table tmp(name string, id int)
clustered by(name) sorted by(name ASC) into 4 buckets
stored as textfile;
Is that the the "sorted by" is a must for bucketed join?
settings:
set hive.execution.engine=mr;
set hive.auto.convert.sortmerge.join=true;
set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
set hive.enforce.bucketing=true;
set hive.enforce.sorting=true;
set hive.auto.convert.join=true;

Related

Using MERGE command in Upsolver

I would like to use Upsolver MERGE command in my new transformations to populates S3/Athena and Snowflake tables. Since Snowflake is supporting Upsert command, while defining my transformation job, do I rely on Snowflake functionality and use Upsolver INSER statement or define Upsolver MERGE transformation in the same way I do it for Athena , i.e.
CREATE JOB my_job_upsert
START_FROM = BEGINNING
ADD_MISSING_COLUMNS = TRUE
RUN_INTERVAL = 1 MINUTE
AS MERGE INTO default_glue_catalog.upsolver_samples.test_upsert_with_merge AS target
/*
Use the SELECT statement below to choose your columns and performed the desired transformations.
In this example, we aggregate the sample orders data by customer and filter it to only include repeat purchasers.
*/
USING (SELECT field1 AS email,
COUNT(DISTINCT field2) AS count
MIN(field3) AS min_number,
MAX(date) AS last_date
FROM default_glue_catalog.upsolver_samples.test_raw_data
WHERE $commit_time BETWEEN run_start_time() AND run_end_time()
GROUP BY 1
HAVING COUNT(DISTINCT field2) > 1) source
ON (target.email = source.email)--primary key
WHEN MATCHED THEN REPLACE -- Update if primary keys match
WHEN NOT MATCHED THEN INSERT MAP_COLUMNS_BY_NAME; -- Insert if primary key is unique (new record)
It would be nice to know in general if MERGE command syntax is consistent across various target platforms.
I already build Athena transformation and it works as expected
You can use the same way you did for Athena. Upsolver INSERT command will insert new keys (APPEND) and if the table had primary key defined, INSERT command will update the existing keys (Upsert) as its default behavior.
MERGE as its definition is for UPSERT and can handle Deletes as well. And the syntax is consistent across all database/data warehouse/catalog targets

How to remove sort phase in spark dataframe join?

I had created a bucketed table using below command in Spark:
df.write.bucketBy(200, "UserID").sortBy("UserID").saveAsTable("topn_bucket_test")
Size of Table : 50 GB
Then I joined another table (say t2 , size :70 GB)(Bucketed as before ) with above table on UserId column . I found that in the execution plan the table topn_bucket_test was being sorted (but not shuffled) before the join and I expected it to be neither shuffled nor sorted before join as it was bucketed. What can be the reason ? and how to remove sort phase for topn_bucket_test?
As far as I am concerned it is not possible to avoid the sort phase. When using the same bucketBy call it is unlikely that the physical bucketing will be identical in both tables. Imagine the first table having UserID ranging from 1 to 1000 and the second from 1 to 2000. Different UserIDs might end up in the 200 buckets and within those bucket there might be multiple different (and unsorted!) UserIDs.

SQL Server: Optimizer intelligence with multiple indexes on the same table in a UNION ALL query

I'm trying to write a query for a rather large table (10 million+ would be a typical size), the result of which needs to be filtered on various predicates / conditions based on some business logic. My question: does the query optimizier (in SQL Server 2008+) attempt to use a single index for the whole query, or does it attempt to use different indexes on a by-query basis?
Consider the following:
--Use Index A
SELECT Set1
FROM ATable
WHERE AColumn = sarg-able value
UNION ALL
--Are we stuck with Index A?
SELECT Set2
FROM ATable
WHERE BColumn = sarg-able value
If we choose Index A for Set1, are we stuck with Index A for the entire query, or is the optimizer smart enough to use a different index for Set2 (assuming one exists)?
Everything #andreyNikolov said is 100% correct. This is the kind of thing you can easily figure out on your own by reviewing the Actual Execution Plan (Not Estimated Execution plan). Note the following sample data, table and index structure:
USE tempdb -- safe place in Dev to test this kind of thing...
GO
-- sample data and indexes
IF OBJECT_ID('dbo.ATable','U') IS NOT NULL DROP TABLE dbo.ATable
CREATE TABLE dbo.ATable
(
Set1 INT NOT NULL,
Set2 INT NOT NULL,
AColumn INT NOT NULL,
BColumn INT NOT NULL
);
INSERT dbo.ATable (Set1, Set2, AColumn, BColumn)
VALUES (1,2,3,3),(1,2,4,4),(5,5,6,6),(11,22,40,40),(11,20,40,44),(11,22,14,4),(1,2,3,3);
CREATE NONCLUSTERED INDEX indexA ON dbo.ATable(AColumn) INCLUDE(Set1);
CREATE NONCLUSTERED INDEX indexB ON dbo.ATable(BColumn) INCLUDE(Set2);
Now run the following with "Include Actual Execution Plan" turned on.
SELECT Set1 --Use Index A
FROM dbo.ATable
WHERE AColumn = 3
UNION ALL
SELECT Set2 --Use Index B
FROM dbo.ATable
WHERE BColumn = 4;
... and the execution plan:
The query above the UNION ALL performs a nonclustered seek against IndexA's key column (AColumn). Because I included Set1 as an include column on IndexA, IndexA can satisfy the query without a Rid or Key lookup against. This is how indexes should be designed. The same is true about the query below the UNION ALL except that it's using IndexB.
Again, this is the kind of thing that is easy to figure out on your own once you have a full understanding of how to read the execution plans.
Yes, the optimizer is smart enough. These are two separate operations, which can be performed either as a table/index scans or seeks. The decision for executing each one of them is independent and it is perfectly normal to use different indexes for each of them. Then the results of both operations will be combined.

Unable to optimise Redshift query

I have build a system where data is loaded from s3 into redshift every few minutes (from a kinesis firehose). I then grab data from that main table and split it into a table per customer.
The main table has a few hundred million rows.
creating the subtable is done with a query like this:
create table {$table} as select * from {$source_table} where customer_id = '{$customer_id} and time between {$start} and {$end}'
I have keys defined as:
SORTKEY (customer_id, time)
DISTKEY customer_id
Everything I have read suggests this would be the optimal way to structure my tables/queries but the performance is absolutely awful. building the sub tables takes over a minute even with only a few rows to select.
Am I missing something or do I just need to scale the cluster?
If you do not have a better key you may have to consider using DISTSTYLE EVEN, keeping the same sort key.
Ideally the distribution key should be a value that is used in joins and spreads your data evenly across the cluster. By using customer_id as the distribution key and then filtering on that key you're forcing all work to be done on just one slice.
To see this in action look in the system tables. First, find an example query:
SELECT *
FROM stl_query
WHERE userid > 1
ORDER BY starttime DESC
LIMIT 10;
Then, look at the bytes per slice for each step of you query in svl_query_report:
SELECT *
FROM svl_query_report
WHERE query = <your query id>
ORDER BY query,segment,step,slice;
For a very detailed guide on designing the best table structure have a look at our "Amazon Redshift Engineering’s Advanced Table Design Playbook"

Cassandra CQL3 select row keys from table with compound primary key

I'm using Cassandra 1.2.7 with the official Java driver that uses CQL3.
Suppose a table created by
CREATE TABLE foo (
row int,
column int,
txt text,
PRIMARY KEY (row, column)
);
Then I'd like to preform the equivalent of SELECT DISTINCT row FROM foo
As for my understanding it should be possible to execute this query efficiently inside Cassandra's data model(given the way compound primary keys are implemented) as it would just query the 'raw' table.
I searched the CQL documentation but I didn't find any options to do that.
My backup plan is to create a separate table - something like
CREATE TABLE foo_rows (
row int,
PRIMARY KEY (row)
);
But this requires the hassle of keeping the two in sync - writing to foo_rows for any write in foo(also a performance penalty).
So is there any way to query for distinct row(partition) keys?
I'll give you the bad way to do this first. If you insert these rows:
insert into foo (row,column,txt) values (1,1,'First Insert');
insert into foo (row,column,txt) values (1,2,'Second Insert');
insert into foo (row,column,txt) values (2,1,'First Insert');
insert into foo (row,column,txt) values (2,2,'Second Insert');
Doing a
'select row from foo;'
will give you the following:
row
-----
1
1
2
2
Not distinct since it shows all possible combinations of row and column. To query to get one row value, you can add a column value:
select row from foo where column = 1;
But then you will get this warning:
Bad Request: Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING
Ok. Then with this:
select row from foo where column = 1 ALLOW FILTERING;
row
-----
1
2
Great. What I wanted. Let's not ignore that warning though. If you only have a small number of rows, say 10000, then this will work without a huge hit on performance. Now what if I have 1 billion? Depending on the number of nodes and the replication factor, your performance is going to take a serious hit. First, the query has to scan every possible row in the table (read full table scan) and then filter the unique values for the result set. In some cases, this query will just time out. Given that, probably not what you were looking for.
You mentioned that you were worried about a performance hit on inserting into multiple tables. Multiple table inserts are a perfectly valid data modeling technique. Cassandra can do a enormous amount of writes. As for it being a pain to sync, I don't know your exact application, but I can give general tips.
If you need a distinct scan, you need to think partition columns. This is what we call a index or query table. The important thing to consider in any Cassandra data model is the application queries. If I was using IP address as the row, I might create something like this to scan all the IP addresses I have in order.
CREATE TABLE ip_addresses (
first_quad int,
last_quads ascii,
PRIMARY KEY (first_quad, last_quads)
);
Now, to insert some rows in my 192.x.x.x address space:
insert into ip_addresses (first_quad,last_quads) VALUES (192,'000000001');
insert into ip_addresses (first_quad,last_quads) VALUES (192,'000000002');
insert into ip_addresses (first_quad,last_quads) VALUES (192,'000001001');
insert into ip_addresses (first_quad,last_quads) VALUES (192,'000001255');
To get the distinct rows in the 192 space, I do this:
SELECT * FROM ip_addresses WHERE first_quad = 192;
first_quad | last_quads
------------+------------
192 | 000000001
192 | 000000002
192 | 000001001
192 | 000001255
To get every single address, you would just need to iterate over every possible row key from 0-255. In my example, I would expect the application to be asking for specific ranges to keep things performant. Your application may have different needs but hopefully you can see the pattern here.
according to the documentation, from CQL version 3.11, cassandra understands DISTINCT modifier.
So you can now write
SELECT DISTINCT row FROM foo
#edofic
Partition row keys are used as unique index to distinguish different rows in the storage engine so by nature, row keys are always distinct. You don't need to put DISTINCT in the SELECT clause
Example
INSERT INTO foo(row,column,txt) VALUES (1,1,'1-1');
INSERT INTO foo(row,column,txt) VALUES (2,1,'2-1');
INSERT INTO foo(row,column,txt) VALUES (1,2,'1-2');
Then
SELECT row FROM foo
will return 2 values: 1 and 2
Below is how things are persisted in Cassandra
+----------+-------------------+------------------+
| row key | column1/value | column2/value |
+----------+-------------------+------------------+
| 1 | 1/'1' | 2/'2' |
| 2 | 1/'1' | |
+----------+-------------------+------------------+