Selecting Multiple Cassandra Rows in one Select, With Composite Primary Key - nosql

I'm trying to write a select statement using CQL3 against Cassandra that will return multiple rows. Under normal circumstances, I could specify the partition key using an IN statement, but the wrinkle for my use case is that I'm dealing with a composite partition key.
For example:
create table MyTable (
my_key_1 text,
my_key_2 text,
my_value text,
primary key ((my_key_1, my_key_2))
);
With this table structure, I want to select the following rows by key value in a single query.
Row 1: my_key_1 = 'key1value1', my_key_2 = 'key2value1'
Row 2: my_key_1 = 'key1value2', my_key_2 = 'key2value2'
Is this possible? For the sake of argument, I know that I could create a partition key out of my_key_1 and a cluster key of my_key_2, but "my_key_1" will be extremely dense, and better specified WITH my_key_2 as the partition key.
I've tried using a select * from MyTable where my_key_1 in ('key1value1', 'key1value2') and my_key_2 in ('key2value1', 'key2value2'), but I receive an error that on the the last part of the partition key can be contained in an IN list query like this. So, is there any way to do this, or am I going to have to "materialize" the my_key_1 + my_key_2 value to the primary key and query by that?

Unfortunately, CQL does not support this as of 2.0.5. If you look at the original work item implementing composite keys, querying over multiple combinations was specifically left out due to implementation complexities.
For now your only recourse would be to combine the keys on the client-side and treat them as one column in Cassandra (or you can break up your multiple row query into several single-row)

Related

Index required for basic joins on foreign key that references a primary key

I have a question about a fundamental aspect of PostgreSQL.
Suppose I have two tables along the lines of the following:
create table source_data_property (
source_data_property_id integer primary key generated by default as identity,
property_name text not null
);
create table source_data_value (
source_data_value_id integer primary key generated by default as identity,
source_data_property_id integer not null references source_data_property,
data_value numeric not null
);
Suppose I write a very simple query that just performs a basic join:
select
sdp.source_data_property_id,
sdp.property_name,
sdv.source_data_value_id,
sdv.data_value
from source_data_property as sdp
join source_data_value as sdv using (source_data_property_id)
;
For optimal query performance, is it necessary to add an index on the source_data_property_id column in the source_data_value table? My original thought was no, because the source_data_property_id is already indexed in the source_data_property table, but after thinking about it a bit I'm not so sure.
For optimal query performance, is it necessary to add an index on the source_data_property_id column in the source_data_value table?
In general yes, make indexes for your foreign keys. However...
A very small table won't get any advantage from indexes and Postgres will do a seq scan instead.
Similarly it depends on what sort of queries you're doing. In your example you're fetching every row in source_data_property which will also fetch every row in source_data_value. Using an index is slower and Postgres will do a seq scan instead.

Is there a way to reserve a range in a postgres sequence?

I'm writing a program that generates large numbers of rows to be inserted into a PostgreSQL database. Due to the presence of multiple indices, this process is getting slower over time. That's why I want to move to using COPY which seems to be significantly faster. The problem is that one of the tables has a foreign key into another, and I do not have the IDs for the foreign key column.
I was thinking that maybe if I could reserve a range in the sequence used for the primary key of the first table, I could do the ID assignment manually but I don't think Postgres natively supports such an operation. Is there a way to achieve this another way?
First off from your source data identify the business key for the parent and child tables. Create those tables and a unique constraint business key. This will not be the surrogate - auto generated - PK. Now create a staging table with all the columns necessary (except the FK). Since you will most likely be using across sessions this is a permanent table, but the intent is single time usage. With this insert into the parent table generating the pk from the sequence. Then insert into the child selecting the FK by referencing the business key from the parent.
insert into parent( <columns> )
select column_list
from stage
on conflict (business key ) do nothing;
insert into child ( <columns>, )
select s.<columns>, a.id
from stage s
join parent a on s.business key = a.business key
on conflict (a.parent_id, child_bk) do nothing;
Since the above is rather obscure in the abstract see a concrete example here. There is no need attempting to "reserve a range", just let the pk/fk develop naturally.

Does SELECT DISTINCT on a Cassandra index column work

Can the Cassandra SELECT DISTINCT operation be used to find all the unique values of a column if that column has an index on it?
My question is not the same as simply asking how to find distinct values of a non primary key columns. I realize that Cassandra does not allow queries that would require a table-scan, because they would be inefficient; here the presence of an index eliminates the need for a table scan.
If I have a table thus:
CREATE TABLE thing (
id uuid,
version bigint,
name text,
... data columns ...
PRIMARY KEY ((id),version)
);
CREATE INDEX ON thing(name);
I can SELECT DISTINCT id FROM thing; to get all the thing IDs. That requires one response from each node in my cluster, with each response returning the keys for its node.
But can I SELECT DISTINCT name FROM thing; to get all the thing names? That should also require only one response from each node in my cluster, with each response constructed only by examining the portion of the index on its node. And if name is a good column on which to have an index, each response would be smaller that the query for the primary keys (there should be fewer names than partition keys).
At least to me the documentation suggests that I should be able to select distinct values of any column:
DISTINCT selection_list
selection_list is one of:
A list of partition keys (used with DISTINCT)
selector AS alias, selector AS alias, ...| *
Where selector is column name. The documentation makes no restriction on what column name could be.
Matter of fact, you can only use DISTINCT with partition key columns (C* 2.2.4). Using it on anything else will yield an error:
cqlsh:stresscql> SELECT distinct name FROM thing ;
InvalidRequest: code=2200 [Invalid query] message="SELECT DISTINCT queries must only request partition key columns and/or static columns (not name)"
I don't have any in-depth understanding on the workings of secondary indexes, but I also have the feeling that allowing a DISTINCT count on an indexed column should not be worse in terms of reads incurred than querying the index for a particular value.
But as indexed values repeat across nodes it would be worse in terms of memory and network overhead relative to the result size as the coordinator would condense down the nodes' responses to only contain unique values.
Though, for replication factors > 1 this is also the case for partition key values.

Is it possible to obtain the actual values of a primary composite key with a SQL statement?

I have a table named F0911 (JD Edwards ERP system) that is in DB2 on an AS400. This table has a primary key, F0911_PK, which is defined as a composite of seven columns: GLDCT, GLDGJ, GLDOC, GLEXTL, GLJELN, GLKCO and GLLT
I am trying to replicate this table into a BI application and it would go easier if I could obtain the actual values of the primary key, ideally with a statement like:
select F0911_PK, [other columns] from F0911 Where ...
Is such a thing possible? I am guessing that the index values have already been calculated and are likely integers. Is it possible to get at the raw values using a SQL statement?
The primary key is a logical construct; there are no "actual values of the primary key", apart from the values in the columns it comprises. If you mean the key values of the index that backs the primary key constraint, they may or may not be a simple concatenation of the binary representation of each column value; in any case these values have no meaning or use outside the physical structure of the index file.

Compute shared hstore key names in Postgresql

If I have a table with an HSTORE column:
CREATE TABLE thing (properties hstore);
How could I query that table to find the hstore key names that exist in every row.
For example, if the table above had the following data:
properties
-------------------------------------------------
"width"=>"b", "height"=>"a"
"width"=>"b", "height"=>"a", "surface"=>"black"
"width"=>"c"
How would I write a query that returned 'width', as that is the only key that occurs in each row?
skeys() will give me all the property keys, but I'm not sure how to aggregate them so I only have the ones that occur in each row.
The manual gets us most of the way there, but not all the way... way down at the bottom of http://www.postgresql.org/docs/8.3/static/hstore.html under the heading "Statistics", they describe a way to count keys in an hstore.
If we adapt that to your sample table above, you can compare the counts to the # of rows in the table.
SELECT key
FROM (SELECT (each(properties)).key FROM thing1) AS stat
GROUP BY key
HAVING count(*) = (select count(*) from thing1)
ORDER BY key;
If you want to find the opposite (all those keys that are not in every row of your table), just change the = to < and you're in business!