Cassandra using composite indexes and secondary together - nosql

We want to use cassandra to store complex data
but we can't figure out how to organize indexes.
Our table (column family) looks like this:
Users =
{
RandomId int,
Firstname varchar,
Lastname varchar,
Age int,
Country int,
ChildCount int
}
We have queries with mandatory fields (Firstname, Lastname, Age) and extra search options (Country, ChildCount).
How should we organize the index to make this kind of queries faster?
First I thought, it would be natural to make composite index on (Firstname, Lastname, Age) and add separate secondary index on remaining fields (Country and ChildCount).
But I can't insert rows into table after creating secondary indexes and I can't query the table.
Using
cassandra 1.1.0
cqlsh with --cql3 option.
Any other suggestions to solve our problem (complex queries with mandatory and additional options) are welcome.

This is my idea. You could simply create a column family with your RandomId as the row key and all the remaining fields simply as columns (e.g. column name 'firstname', column value 'jonh'). After this you have to create a secondary index for each of these columns. The cardinality of your values seems to be low so it should be slightly efficient.
THe CQL code should be something like:
create column family users with comparator=UTF8Type and column_metadata=[{column_name: firstname, validation_class: UTF8Type,index_type: KEYS},
{column_name: lastname, validation_class: UTF8Type, index_type: KEYS},
{column_name: contry, validation_class: IntegerType, index_type: KEYS},
{column_name: age, validation_class: IntegerType, index_type: KEYS]},
{column_name: ChildCount, validation_class: IntegerType, index_type: KEYS]];
A good reference for it could be http://www.datastax.com/docs/0.7/data_model/secondary_indexes
Let me know if I'm wrong;

For queries involving a large number of partitions indices are not very efficient.
I think it is better to think the tables based on the queries you'd want to make: you want a table for queries based on user name and that seems like the right place to store all the info concerning the user. On the other hand you want to be able to search based on country, I assumed, to provide a list of users: for that you don't really need all the info, maybe just the first and last names, or just the email, etc. Another table could do it then.
This involves some data duplication but that better fits the Cassandra data modelling ideas.
This would give:
CREATE TABLE users(
id UUID,
lastname TEXT,
firstname TEXT,
age INT,
country TEXT,
childcount INT,
PRIMARY KEY(UUID)
);
CREATE TABLE users_by_country(
country TEXT,
firstname TEXT,
lastname TEXT,
user_uuid UUID,
PRIMARY KEY((country), firstname, lastname)
);
CREATE TABLE users_by_age(
age INT,
firstname TEXT,
lastname TEXT,
user_uuid UUID,
PRIMARY KEY((age), firstname, lastname)
);

Related

'There is a column named "question_id" in table "subjects", but it cannot be referenced from this part of the query.'

I am using node-postgressql to create and put data into my Elephant SQL. I was able to create the table and seed one of the tables using JSON file but in the second table it is giving me an error
I have created to tables and linked them as above
await pool.query("CREATE TABLE questions (id INT PRIMARY KEY GENERATED ALWAYS AS IDENTITY, question text NOT NULL, answer text NOT NULL)")
await pool.query("CREATE TABLE subjects ( id INT PRIMARY KEY GENERATED ALWAYS AS IDENTITY, question_id INT REFERENCES questions(id), subject TEXT NOT NULL)")
Then I am using a json file to populate the data in these tables as below
return await pool.query("INSERT INTO questions (question, answer) (SELECT question, answer FROM json_populate_recordset(NULL::questions, $1::JSON));", [JSON.stringify(questions)])
return await pool.query("INSERT INTO subjects (question_id, subject) (SELECT question_id, subject FROM json_populate_recordset(NULL::questions, $1::JSON));", [JSON.stringify(subject)])
I am able to insert the data in questions table but when I do the same for the subjects table it gives me the below error
There is a column named "question_id" in table "subjects", but it cannot be referenced from this part of the query.'
can someone please let me know why am I getting this error what am i doing wrong?

multiple form fields with the same path inserting to db in spring

I have to insert data to db in form fields that have the same path and i want to save it different id's but rather it concatenated it with ",", how could i possibly do it?
I tried to make some alias in SQL but it saves into same db field name with concatenated with ","
i expected in db when i insert that
EX.
db field name = description
input 1 value = "john";
input 2 value = "doe";
id description
1 john
2 doe
above is my expected result
but in my case when i insert it shows these
id description
1 john,doe
can someone help me to achieve that result ? THANKYOU!
Let me present a similar situation. You have a database of people and you are concerned that each person might have multiple phone numbers.
CREATE TABLE Persons (
person_id INT UNSIGNED AUTO_INCREMENT,
...
PRIMARY KEY(person_id) );
CREATE TABLE PhoneNumbers (
person_id INT UNSIGNED,
phone VARCHAR(20) CHARACTER SET ascii,
type ENUM('unknown', 'cell', 'home', 'work'),
PRIMARY KEY(person_id, phone) );
The table PhoneNumbers has a "many-to-1" relationship between phone numbers and persons. (It does not care if two persons share the same number.)
SELECT ...
GROUP CONCAT(pn.phone) AS phone_numbers,
...
FROM Persons AS p
LEFT JOIN PhoneNumbers AS pn USING(person_id)
...;
will deliver a commalist of phone numbers (eg: 123-456-7890,333-444-5555) for each person being selected. Because of the LEFT, it will deliver NULL in case a person has no associated phones.
To address your other question: It is not practical to split a commalist into the components.

SQl Server 2012 autofill one column from another

I have a table where a user inputs name, dob, etc. and I have a User_Name column that I want automatically populated from other columns.
For example input is: Name - John Doe, DOB - 01/01/1900
I want the User_Name column to be automatically populated with johndoe01011900 (I already have the algorithm to concatenate the column parts to achieve the desired result)
I just need to know how (SQL, Trigger) to have the User_Name column filled once the user completes imputing ALL target columns. What if the user skips around and does not input the data in order? Of course the columns that are needed are (not null).
This should do it:
you can use a calculated field with the following calculation:
LOWER(REPLACE(Name, ' ', ''))+CONVERT( VARCHAR(10), DateOfBirth, 112))
In the below sample I have used a temp table but this is the same for regular tables as well.
SAMPLE:
CREATE TABLE #temp(Name VARCHAR(100)
, DateOfBirth DATE
, CalcField AS LOWER(REPLACE(Name, ' ', ''))+CONVERT( VARCHAR(10), DateOfBirth, 112));
INSERT INTO #temp(Name
, DateOfBirth)
VALUES
('John Doe'
, '01/01/1900');
SELECT *
FROM #temp;
RESULT:

Why SELECT with WHERE clause returns 0 rows on Cassandra's table? (should return 2 rows)

I created a minimal example of users TABLE on Cassandra 2.0.9 database. I can use SELECT to select all its rows, but I do not understand why adding my WHERE clause (on indexed collumn) returns 0 rows.
(I also do not get why 'COINTAINS' statement causes an error here, as presented below, but let's assume this is not my primary concern. )
DROP TABLE IF EXISTS users;
CREATE TABLE users
(
KEY varchar PRIMARY KEY,
password varchar,
gender varchar,
session_token varchar,
state varchar,
birth_year bigint
);
INSERT INTO users (KEY, gender, password) VALUES ('jessie', 'f', 'avlrenfls');
INSERT INTO users (KEY, gender, password) VALUES ('kate', 'f', '897q7rggg');
INSERT INTO users (KEY, gender, password) VALUES ('mike', 'm', 'mike123');
CREATE INDEX ON users (gender);
DESCRIBE TABLE users;
Output:
CREATE TABLE users (
key text,
birth_year bigint,
gender text,
password text,
session_token text,
state text,
PRIMARY KEY ((key))
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.100000 AND
gc_grace_seconds=864000 AND
index_interval=128 AND
read_repair_chance=0.000000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='99.0PERCENTILE' AND
memtable_flush_period_in_ms=0 AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'LZ4Compressor'};
CREATE INDEX users_gender_idx ON users (gender);
This SELECT works OK
SELECT * FROM users;
key | birth_year | gender | password | session_token | state
--------+------------+--------+-----------+---------------+-------
kate | null | f | 897q7rggg | null | null
jessie | null | f | avlrenfls | null | null
mike | null | m | mike123 | null | null
And this does not:
SELECT * FROM users WHERE gender = 'f';
(0 rows)
This also fails:
SELECT * FROM users WHERE gender CONTAINS 'f';
Bad Request: line 1:33 no viable alternative at input 'CONTAINS'
It sounds like your index may have become corrupt. Try rebuilding it. Run this from a command prompt:
nodetool rebuild_index yourKeyspaceName users users_gender_idx
However, the larger issue here is that secondary indexes are known to perform poorly. Some have even identified their use as an anti-pattern. DataStax has a document designed to guide you in appropriate use of secondary indexes. And this is definitely not one of them.
creating an index on an extremely low-cardinality column, such as a boolean column, does not make sense. Each value in the index becomes a single row in the index, resulting in a huge row for all the false values, for example. Indexing a multitude of indexed columns having foo = true and foo = false is not useful.
While gender may not be a boolean column, it has the same cardinality. A secondary index on this column is a terrible idea.
If querying by gender is something you really need to do, then you may need to find a different way to model or partition your data. For instance, PRIMARY KEY (state, gender, key) will allow you to query gender by state.
SELECT * FROM users WHERE state='WI' and gender='f';
That would return all female users from the state of Wisconsin. Of course, that would mean you would also have to query all states individually. But the bottom line, is that Cassandra does not handle queries for low cardinality keys/indexes well, so you have to be creative in how you solve these types of problems.

One to Many equivalent in Cassandra and data model optimization

I am modeling my database in Cassandra, coming from RDBMS. I want to know how can I create a one-to-many relationship which is embedded in the same Column Name and model my table to fit the following query needs.
For example:
Boxes:{
23442:{
belongs_to_user: user1,
box_title: 'the box title',
items:{
1: {
name: 'itemname1',
size: 44
},
2: {
name: 'itemname2',
size: 24
}
}
},
{ ... }
}
I read that its preferable to use composite columns instead of super columns, so I need an example of the best way to implement this. My queries are like:
Get items for box by Id
get top 20 boxes with their items (for displaying a range of boxes with their items on the page)
update items size by item id (increment size by a number)
get all boxes by userid (all boxes that belongs to a specific user)
I am expecting lots of writes to change the size of each item in the box. I want to know the best way to implement it without the need to use super columns. Furthermore, I don't mind getting a solution that takes Cassandra 1.2 new features into account, because I will use that in production.
Thanks
This particular model is somewhat challenging, for a number of reasons.
For example, with the box ID as a row key, querying for a range of boxes will require a range query in Cassandra (as opposed to a column slice), which means the use of an ordered partitioner. An ordered partitioner is almost always a Bad Idea.
Another challenge comes from the need to increment the item size, as this calls for the use of a counter column family. Counter column families store counter values only.
Setting aside the need for a range of box IDs for a moment, you could model this using multiple tables in CQL3 as follows:
CREATE TABLE boxes (
id int PRIMARY KEY,
belongs_to_user text,
box_title text,
);
CREATE INDEX useridx on boxes (belongs_to_user);
CREATE TABLE box_items (
id int,
item int,
size counter,
PRIMARY KEY(id, item)
);
CREATE TABLE box_item_names (
id int PRIMARY KEY,
item int,
name text
);
BEGIN BATCH
INSERT INTO boxes (id, belongs_to_user, box_title) VALUES (23442, 'user1', 'the box title');
INSERT INTO box_items (id, item, name) VALUES (23442, 1, 'itemname1');
INSERT INTO box_items (id, item, name) VALUES (23442, 1, 'itemname2');
UPDATE box_items SET size = size + 44 WHERE id = 23442 AND item = 1;
UPDATE box_items SET size = size + 24 WHERE id = 23442 AND item = 2;
APPLY BATCH
-- Get items for box by ID
SELECT size FROM box_items WHERE id = 23442 AND item = 1;
-- Boxes by user ID
SELECT * FROM boxes WHERE belongs_to_user = 'user1';
It's important to note that the BATCH mutation above is both atomic, and isolated.
Technically speaking, you could also denormalize all of this into a single table. For example:
CREATE TABLE boxes (
id int,
belongs_to_user text,
box_title text,
item int,
name text,
size counter,
PRIMARY KEY(id, item, belongs_to_user, box_title, name)
);
UPDATE boxes set size = item_size + 44 WHERE id = 23442 AND belongs_to_user = 'user1'
AND box_title = 'the box title' AND name = 'itemname1' AND item = 1;
SELECT item, name, size FROM boxes WHERE id = 23442;
However, this provides no guarantees of correctness. For example, this model makes it possible for items of the same box to have different users, or titles. And, since this makes boxes a counter column family, it limits how you can evolve the schema in the future.
I think in PlayOrm's objects first, then show the column model below....
Box {
#NoSqlId
String id;
#NoSqlEmbedded
List<Item> items;
}
User {
#NoSqlId
TimeUUID uuid;
#OneToMany
List<Box> boxes;
}
The User then is a row like so
rowkey = uuid=<someuuid> boxes.fkToBox35 = null, boxes.fktoBox37=null, boxes.fkToBox38=null
Note, the form of the above is columname=value where some of the columnnames are composite and some are not.
The box is more interesting and say Item has fields name and idnumber, then box row would be
rowkey = id=myid, items.item23.name=playdo, items.item23.idnumber=5634, itesm.item56.name=pencil, items.item56.idnumber=7894
I am not sure what you meant though on get the top 20 boxes? top boxes meaning by the number of items in them?
Dean
You can use Query-Driven Methodology, for data modeling.You have the three broad access paths:
1) partition per query
2) partition+ per query (one or more partitions)
3) table or table+ per query
The most efficient option is the “partition per query”. This article can help you in this case, step-by-step. it's sample is exactly a one-to-many relation.
And according to this, you will have several tables with some similar columns. You can manage this, by Materialized View or batch-log(as alternative approach).