Finding distinct values of non Primary Key column in CQL Cassandra - select

I use the following code for creating table:
CREATE KEYSPACE mykeyspace
WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
USE mykeyspace;
CREATE TABLE users (
user_id int PRIMARY KEY,
fname text,
lname text
);
INSERT INTO users (user_id, fname, lname)
VALUES (1745, 'john', 'smith');
INSERT INTO users (user_id, fname, lname)
VALUES (1744, 'john', 'doe');
INSERT INTO users (user_id, fname, lname)
VALUES (1746, 'john', 'smith');
I would like to find the distinct value of lname column (that is not a PRIMARY KEY). I would like to get the following result:
lname
-------
smith
By using SELECT DISTINCT lname FROM users;
However since lname is not a PRIMARY KEY I get the following error:
InvalidRequest: code=2200 [Invalid query] message="SELECT DISTINCT queries must
only request partition key columns and/or static columns (not lname)"
cqlsh:mykeyspace> SELECT DISTINCT lname FROM users;
How can I get the distinct values from lname?

User - Undefined_variable - makes two good points:
In Cassandra, you need to build your data model to match your query patterns. This sometimes means duplicating your data into additional tables, to attain the desired level of query flexibility.
DISTINCT only works on partition keys.
So, one way to get this to work, would be to build a specific table to support that query:
CREATE TABLE users_by_lname (
lname text,
fname text,
user_id int,
PRIMARY KEY (lname, fname, user_id)
);
Now after I run your INSERTs to this new query table, this works:
aploetz#cqlsh:stackoverflow> SELECT DISTINCT lname FROm users_by_lname ;
lname
-------
smith
doe
(2 rows)
Notes: In this table, all rows with the same partition key (lname) will be sorted by fname, as fname is a clustering key. I added user_id as an additional clustering key, just to ensure uniqueness.

There is no such functionality in cassandra. DISTINCT is possible on partition key only.
You should Design Your data model based on your requirements.
You have to process the data in application logic (spark may be useful)

Related

PostrgreSQL ForeignKeyViolation

I am attempting to insert some data into my database via a lambda function. I am getting the following error ForeignKeyViolation: insert or update on table "address" violates foreign key constraint "address_id_fkey"
I understand that this is because my address table has a foreign key linking it to the clients table, and the keys are not matching.
Is there a way to format my tables so that I can input my client data and address data together? Or will I need to input the client data first, then retrieve the id and use it to input the address data.
Currently I am running the following two functions.
postgres_insert_query = "INSERT INTO clients (name, phone, contact) VALUES ('{0}','{1}','{2}')".format(data['name'], data['phone'], data['contact'])
postgres_insert_query = "INSERT INTO address (line1, city, state, zip) VALUES ('{0}','{1}','{2}', {3})".format(address['line1'], address['city'], address['state'], address['zip'])
Even if no address data is present I would still like to create a row for it (with the correct foreign key).
use DEFERRABLE foreign key constraint. Then wrap you function into a transaction.
CREATE temp TABLE pktable (
id INT4 PRIMARY KEY,
other INT4
);
CREATE temp TABLE fktable (
id INT4 PRIMARY KEY,
fk INT4 REFERENCES pktable DEFERRABLE INITIALLY DEFERRED
);
BEGIN;
INSERT INTO fktable VALUES (100, 200);
INSERT INTO pktable VALUES (200, 500);
COMMIT;
Postgres allows DML operations within a CTE. Doing so will allow you to insert into both tables in a single statement while allowing auto-generation of both ids. The following is a Postgres implementation. See demo.
with thedata(name, phone, contact, line1, city, state, zip) as
( values ('client 1', 'ev4 4213', 'andy','614 a', 'some city;','that state','11111'))
, theinsert (cli_id) as
( insert into clients(name, phone, contact)
select name, phone, contact
from thedata
returning cli_id
)
insert into addresses(cli_id, line1, city, state, zip)
select cli_id, line1, city, state, zip
from theinsert
cross join thedata;
Unfortunately I do not know your obscurification (Orm) language but perhaps something like:
pg_query = "with thedata( {0} name, {1} phone, {2} contact, {3} line1, {4} city, {5} state, {6} zip) as
, theinsert (cli_id) as
( insert into clients(name, phone, contact)
select name, phone, contact
from thedata
returning cli_id
)
insert into addresses(cli_id, line1, city, state, zip)
select cli_id, line1, city, state, zip
from theinsert
cross join thedata".format(data['name'], data['phone'], data['contact']
, address['line1'], address['city'], address['state'], address['zip']);

Insert data into strongly normalized DB and maintain the integrity (Postgres)

I'm trying to develop a simple database for the phonebook. This is what I wrote:
CREATE TABLE phone
(
phone_id SERIAL PRIMARY KEY,
phone CHAR(15),
sub_id INT, -- subscriber id --
cat_id INT -- category id --
);
CREATE TABLE category
(
cat_id SERIAL PRIMARY KEY, -- category id --
cat_name CHAR(15) -- category name --
);
CREATE TABLE subscriber
(
sub_id SERIAL PRIMARY KEY,
name CHAR(20),
fname CHAR(20), -- first name --
lname CHAR(20), -- last name --
);
CREATE TABLE address
(
addr_id SERIAL PRIMARY KEY,
country CHAR(20),
city CHAR(20),
street CHAR(20),
house_num INT,
apartment_num INT
);
-- many-to-many relation --
CREATE TABLE sub_link
(
sub_id INT REFERENCES subscriber(sub_id),
addr_id INT
);
I created a link table for many-to-many relation because few people can live at the same address and one person can live in different locations at different times.
But I cannot figure out how to add data in strongly normalized DB like this and maintain the integrity of the data.
The first improvement was that I added inique key on address table bacause this table should not contain duplicated data:
CREATE TABLE address
(
addr_id SERIAL PRIMARY KEY,
country CHAR(20),
city CHAR(20),
street CHAR(20),
house_num INT,
apartment_num INT,
UNIQUE (country, city, street, house_num, apartment_num)
);
Now the problem is how to add a new record about some person into DB. I think I should use the next order of actions:
Insert a record into subscriber table, because sub_link and phone tables must use id of a new subscriber.
Insert a record into address table because addr_id must exist before adding record into sub_link.
Link last records from subscriber and address in sub_link table. But at this step I have a new problem: how can I get sub_id and addr_id from steps 1) and 2) in PostgreSQL effectively?
Then I need to insert a record into the phone table. As at 3) step I dont know how to get sub_id from previous queries effectively.
I read about WITH block in the Postgres but I cannot figure out how to use it in my case.
UPDATE
I've done like ASL suggested:
-- First record --
WITH t0 AS (
WITH t1 AS (
INSERT INTO subscriber
VALUES(DEFAULT, 'Twilight Sparkle', NULL, NULL)
RETURNING sub_id
),
t2 AS (
INSERT INTO address
VALUES(DEFAULT, 'Equestria', 'Ponyville', NULL, NULL, NULL)
RETURNING addr_id
)
INSERT INTO sub_link
VALUES((SELECT sub_id FROM t1), (SELECT addr_id FROM t2))
)
INSERT INTO phone
VALUES (DEFAULT, '000000', (SELECT sub_id FROM t1), 1);
But I have an error: WITH clause containing a data-modifying statement must be at the top level
LINE 2: WITH t1 AS (INSERT INTO subscriber VALUES(DEFAULT,
You can do it all in one query using a WITH block with a RETURNING clause. See PostgreSQL docs on INSERT. For example:
WITH t1 AS (INSERT INTO subscriber VALUES ... RETURNING sub_id),
t2 AS (INSERT INTO address VALUES ... RETURNING addr_id)
INSERT INTO sub_link VALUES ((SELECT sub_id FROM t1), (SELECT addr_id FROM t2))
Note that this simple form will only work when inserting a single row into each table.
This is somewhat off the topic of your question, but I suggest you also consider making sub_id and cat_id columns in the phone table foreign keys (use REFERENCES).
You got the idea. Insert data from topmost tables so that you have their IDs before inserting references to them.
In PostgreSQL you can use INSERT/UPDATE ... RETURNING id construct. If you are not using some ORM which do it automatically, this may be useful.
The only thing here is that in step 2 you probably want to check if the address already exists before inserting:
SELECT addr_id FROM address WHERE country = ? AND city = ? ...

Looking up values from many tables based on value in each column

I have several tables containing key value pairs for differint fields in my database. I also have a table that that contains the keys of these differint tables that should be selected as the value for that key. However, I can't figure out how to select these values from the multiple tables?
The tables
CREATE TABLE CHARACTERS(
ID INTEGER PRIMARY KEY,
NAME VARCHAR(64)
);
CREATE TABLE MEDIA(
ID INTEGER PRIMARY KEY,
NAME VARCHAR(64)
);
CREATE TABLE EPISODES(
ID INTEGER PRIMARY KEY,
MEDIAID INTEGER,
NAME VARCHAR(64)
);
-- Selecting from this table
CREATE TABLE APPS(
ID INTEGER PRIMARY KEY,
CHARID INTEGER,
EPISODEID INTEGER,
MEDIAID INTEGER
);
I am selecting from the APPS table, and I want to replace the value of the *ID columns with the value of the name in the accomping table's NAME column. I want this done for each row in the APPS table. Like so...
CHARID -> CHARACTERS.NAME
EPISODEID -> EPISODES.NAME
MEDIAID -> MEDIA.NAME
I have tried to use joins, but they don't do it for each row in the APPS table. I have 18 rows in the APPS table, but I only get back way less than I have in the table or way more than I have in the table. So how can I make it do it for each row in the APPS table?
You do by JOINing the tables together and selecting the desired columns from the individual tables:
SELECT c.name AS character_name, e.name AS episode, m.name AS media
FROM apps a
LEFT JOIN episodes e ON e.id = a.episodeid
LEFT JOIN media m ON m.id = a.mediaid
LEFT JOIN characters c ON c.id = a.charid;
If you want to present the rows in a specific order, you can specify that too as a final clause in the SELECT statement. You can use any field from the included tables; that field is not necessarily part of the columns selected:
ORDER BY a.id -- order by apps.id
or
ORDER BY e.id, c.name -- order first by episode id, then by character name
etc

How can I generate big data sample for Postgresql using generate_series and random?

I want to generate big data sample (almost 1 million records) for studying tuplesort.c's polyphase merge in postgresql, and I hope the schema as follows:
CREATE TABLE Departments (code VARCHAR(4), UNIQUE (code));
CREATE TABLE Towns (
id SERIAL UNIQUE NOT NULL,
code VARCHAR(10) NOT NULL, -- not unique
article TEXT,
name TEXT NOT NULL, -- not unique
department VARCHAR(4) NOT NULL REFERENCES Departments (code),
UNIQUE (code, department)
);
how to use generate_series and random for do it? thanks a lot!
To insert one million rows into Towns
insert into towns (
code, article, name, department
)
select
left(md5(i::text), 10),
md5(random()::text),
md5(random()::text),
left(md5(random()::text), 4)
from generate_series(1, 1000000) s(i)
Since id is a serial it is not necessary to include it.

How to implicitly insert SERIAL ID via view over more than one table

I have two tables, connected in E/R by a is-relation. One representing the "mother table"
CREATE TABLE PERSONS(
id SERIAL NOT NULL,
name character varying NOT NULL,
address character varying NOT NULL,
day_of_creation timestamp NOT NULL DEFAULT current_timestamp,
PRIMARY KEY (id)
)
the other representing the "child table"
CREATE TABLE EMPLOYEES (
id integer NOT NULL,
store character varying NOT NULL,
paychecksize integer NOT NULL,
FOREIGN KEY (id)
REFERENCES PERSONS(id),
PRIMARY KEY (id)
)
Now those two tables are joined in a view
CREATE VIEW EMPLOYEES_VIEW AS
SELECT
P.id,name,address,store,paychecksize,day_of_creation
FROM
PERSONS AS P
JOIN
EMPLOYEES AS E ON P.id = E.id
I want to write either a rule or a trigger to enable a db user to make an insert on that view, sparing him the nasty details of the splitted columns into different tables.
But I also want to make it convenient, as the id is a SERIAL and the day_of_creation has a default value there is no actual need that a user has to provide those, therefore a statement like
INSERT INTO EMPLOYEES_VIEW (name, address, store, paychecksize)
VALUES ("bob", "top secret", "drugstore", 42)
should be enough to result in
PERSONS
id|name|address |day_of_creation
-------------------------------
1 |bob |top secret| 2013-08-13 15:32:42
EMPLOYEES
id| store |paychecksize
---------------------
1 |drugstore|42
A basic rule would be easy as
CREATE RULE EMPLOYEE_VIEW_INSERT AS ON INSERT TO EMPLOYEE_VIEW
DO INSTED (
INSERT INTO PERSONS
VALUES (NEW.id,NEW.name,NEW.address,NEW.day_of_creation),
INSERT INTO EMPLOYEES
VALUES (NEW.id,NEW.store,NEW.paychecksize)
)
should be sufficient. But this will not be convenient as a user will have to provide the id and timestamp, even though it actually is not necessary.
How can I rewrite/extend that code base to match my criteria of convenience?
Something like:
CREATE RULE EMPLOYEE_VIEW_INSERT AS ON INSERT TO EMPLOYEES_VIEW
DO INSTEAD
(
INSERT INTO PERSONS (id, name, address, day_of_creation)
VALUES (default,NEW.name,NEW.address,default);
INSERT INTO EMPLOYEES (id, store, paychecksize)
VALUES (currval('persons_id_seq'),NEW.store,NEW.paychecksize)
);
That way the default values for persons.id and persons.day_of_creation will be the default values. Another option would have been to simply remove those columns from the insert:
INSERT INTO PERSONS (name, address)
VALUES (NEW.name,NEW.address);
Once the rule is defined, the following insert should work:
insert into employees_view (name, address, store, paychecksize)
values ('Arthur Dent', 'Some Street', 'Some Store', 42);
Btw: with a current Postgres version an instead of trigger is the preferred way to make a view updateable.