Can the Cassandra SELECT DISTINCT operation be used to find all the unique values of a column if that column has an index on it?
My question is not the same as simply asking how to find distinct values of a non primary key columns. I realize that Cassandra does not allow queries that would require a table-scan, because they would be inefficient; here the presence of an index eliminates the need for a table scan.
If I have a table thus:
CREATE TABLE thing (
id uuid,
version bigint,
name text,
... data columns ...
PRIMARY KEY ((id),version)
);
CREATE INDEX ON thing(name);
I can SELECT DISTINCT id FROM thing; to get all the thing IDs. That requires one response from each node in my cluster, with each response returning the keys for its node.
But can I SELECT DISTINCT name FROM thing; to get all the thing names? That should also require only one response from each node in my cluster, with each response constructed only by examining the portion of the index on its node. And if name is a good column on which to have an index, each response would be smaller that the query for the primary keys (there should be fewer names than partition keys).
At least to me the documentation suggests that I should be able to select distinct values of any column:
DISTINCT selection_list
selection_list is one of:
A list of partition keys (used with DISTINCT)
selector AS alias, selector AS alias, ...| *
Where selector is column name. The documentation makes no restriction on what column name could be.
Matter of fact, you can only use DISTINCT with partition key columns (C* 2.2.4). Using it on anything else will yield an error:
cqlsh:stresscql> SELECT distinct name FROM thing ;
InvalidRequest: code=2200 [Invalid query] message="SELECT DISTINCT queries must only request partition key columns and/or static columns (not name)"
I don't have any in-depth understanding on the workings of secondary indexes, but I also have the feeling that allowing a DISTINCT count on an indexed column should not be worse in terms of reads incurred than querying the index for a particular value.
But as indexed values repeat across nodes it would be worse in terms of memory and network overhead relative to the result size as the coordinator would condense down the nodes' responses to only contain unique values.
Though, for replication factors > 1 this is also the case for partition key values.
Related
The order that the records come in is not always guaranteed unless I use an order by clause.
If I throw a clustered index on a table and then do a select top 100, for example, would the 100 rows returned always be the same?
I am asking this because a clustered index sorts the data physically on the key value.
I am lead to believe so from my observations, but wanted to see what others thought.
No. The rule is simple: SQL tables and result sets represent unordered sets. The only exception is a result set associated with a query that has an ORDER BY in the outermost SELECT.
A clustered index affects how data is stored on each page. However, it does not guarantee that a result set built on that table will even use the clustered index.
Consider a table that has a primary, clustered key on id and a query that returns:
select top (100) othercol
from t;
This query could use an index on othercol -- avoiding the clustered index altogether.
In Redshift database, I want to decide a sort key for a dimension table between surrogatekey and natural primary key. The definition says "Sort keys should be selected based on the most commonly used columns when filtering, ordering or grouping the data".
My Question is -
I have a Employee table with (Emp_key,Emp_Id,Emp_name) and this table is joined to Fact table on Emp key. Here "Emp_key" is the surrogate key and "Emp_id" is the natural primary key. And I filter the query on Emp_id but "Emp_key" in the fact table is defined as a "dist key" and read that for a large dimension defining sort & dist keys on the join keys results in better performance and so I want to know which one should i choose between Emp_key and Emp_id for Sort key in a dimension table?
And also, another confusion is choosing sort for the "date" dimension table between "date_key" or ignore defining sort key.
I would appreciate your suggestions in this regard.
Thank you!
Your employee table likely doesn't contain too many rows, you can choose ALL distribution style, so the copy of the table sits on every node of your cluster. This way you'll avoid this dilemma at a very low cost.
UPD: with this design, I would have emp_key as dist key (so that data that is joined sits on the same nodes) and emp_id as sort key (to filter efficiently). I'm pretty sure the query planner would prioritize filtering over joining, so first it will filter the rows from the dimension table and only then it will join the corresponding rows from the fact table. But it's better to try all options and benchmark a few queries to see what works best.
If you can change the design I would just add emp_id to the fact table (because it seems like the keys map 1 to 1) as a part of ELT and avoid the dilemma again.
I have a PostgreSQL table with two unique indices on different fields (nickname and email). I'm trying to insert a record which violates both constraints. I want that nickname is checked and reported first.
I observed that the index that was created first is checked first, so I create the index on nickname first and it kind of works. But is that behavior specified and can I rely upon it, or is it only by chance?
PostgreSQL uses RelationGetIndexList to get the list of indexes, and the comment says:
/*
* RelationGetIndexList -- get a list of OIDs of indexes on this relation
[...]
*
* The returned list is guaranteed to be sorted in order by OID. This is
* needed by the executor, since for index types that we obtain exclusive
* locks on when updating the index, all backends must lock the indexes in
* the same order or we will get deadlocks (see ExecOpenIndices()). Any
* consistent ordering would do, but ordering by OID is easy.
[...]
*/
OIDs (object identifiers) are 4-byte unsigned integers that are counted up, so normally later objects will get higher OIDs. This corresponds to what you observe.
However, once the range of OIDs is exhausted, they wrap around and start again at FirstNormalObjectId (16384), so there is no guarantee that the index that was created first has the lower OID.
You could use a query like:
SELECT 'indexname'::regclass::oid;
to find the OID of each index.
To get the OID of all indexes on a table, use
SELECT indexrelid AS oid, indexrelid::regclass
FROM pg_index
WHERE indrelid = 'tablename'::regclass;
Lets assume I have a Column Family with following schema:
CREATE TABLE users (
user_id timeuuid,
name varchar,
last_name varchar,
children list,
phone_numbers map,
PRIMARY KEY(user_id)
);
Then I insert a row into this CF with "USING TTL 60000". When I want to verify if any of these columns still has TTL set I get error: "Cannot use selection function ttl on collections".
My question is: how to get TTL on elements of a column that is defined as collection ?
Cheers!
I reproduced your problem -- naturally getting the very same result. The problem is that either (1) in collections TTL's are element-wise (one TTL per entry in collection) and (2) I found no way of getting entries from Maps or Lists.
Of course I can delete one element -- but selecting it or it's TTL was not possible. Even the Datastax' CQL driver v2 has not provided the metadata for that.
So you may change your data structure for that. If this was 'just' for testing purposes you have to trust Cassandra doing this well enough.
I'm trying to write a select statement using CQL3 against Cassandra that will return multiple rows. Under normal circumstances, I could specify the partition key using an IN statement, but the wrinkle for my use case is that I'm dealing with a composite partition key.
For example:
create table MyTable (
my_key_1 text,
my_key_2 text,
my_value text,
primary key ((my_key_1, my_key_2))
);
With this table structure, I want to select the following rows by key value in a single query.
Row 1: my_key_1 = 'key1value1', my_key_2 = 'key2value1'
Row 2: my_key_1 = 'key1value2', my_key_2 = 'key2value2'
Is this possible? For the sake of argument, I know that I could create a partition key out of my_key_1 and a cluster key of my_key_2, but "my_key_1" will be extremely dense, and better specified WITH my_key_2 as the partition key.
I've tried using a select * from MyTable where my_key_1 in ('key1value1', 'key1value2') and my_key_2 in ('key2value1', 'key2value2'), but I receive an error that on the the last part of the partition key can be contained in an IN list query like this. So, is there any way to do this, or am I going to have to "materialize" the my_key_1 + my_key_2 value to the primary key and query by that?
Unfortunately, CQL does not support this as of 2.0.5. If you look at the original work item implementing composite keys, querying over multiple combinations was specifically left out due to implementation complexities.
For now your only recourse would be to combine the keys on the client-side and treat them as one column in Cassandra (or you can break up your multiple row query into several single-row)