Store ordered items in Cassandra - nosql

I'm working on web application which should use Apache Cassandra to store data in.
I need to store rating for each item and then get list of items with highest rating.
So the task - to store some additional info for items in sorted order to get rid of client side ordering or ordering using ORDER BY.
One of the possible options is to create index Column Family:
userId {
100_ItemId1 : null,
90__ItemId2 : null,
80__ItemId3 : null,
80__ItemId4 : null
}
Note: userId is a key of the row, 100, 90, 80 - are rating values
But here is an issue with deleting, we should know previous rating value to remove index, an it can require to store reversed info in Column Family:
reversed_userId{
ItemId1 : 100_ItemId1,
ItemId2 : 100_ItemId2,
...
}
Could you please say are there some patterns to store ordered items effectively?
P.S: I'm not going to use OrderPreservingPartitioner since it can be applied to whole KeySpace and it can damage load balancing and performance.

I'm hoping you'll be happy to know that in CQL 3 you can use composite key structure to sort now.
http://www.datastax.com/dev/blog/whats-new-in-cql-3-0
So for example:
CREATE TABLE SortedPosts (
post_id int,
sort_order int,
post_title text,
PRIMARY KEY(post_id, sort_order)
);
sort_order will sort it. And you can:
SELECT * FROM SortedPosts WHERE post_id = 1 ORDER BY sort_order ASC
SELECT * FROM SortedPosts WHERE post_id = 1 ORDER BY sort_order DESC

Related

PostgreSQL array of data composite update element using where condition

I have a composite type:
CREATE TYPE mydata_t AS
(
user_id integer,
value character(4)
);
Also, I have a table, uses this composite type as an array of mydata_t.
CREATE TABLE tbl
(
id serial NOT NULL,
data_list mydata_t[],
PRIMARY KEY (id)
);
Here I want to update the mydata_t in data_list, where mydata_t.user_id is 100000
But I don't know which array element's user_id is equal to 100000
So I have to make a search first to find the element where its user_id is equal to 100000 ... that's my problem ... I don't know how to make the query .... in fact, I want to update the value of the array element, where it's user_id is equal to 100000 (Also where the id of tbl is for example 1) ... What will be my query?
Something like this (I know it's wrong !!!)
UPDATE "tbl" SET "data_list"[i]."value"='YYYY'
WHERE "id"=1 AND EXISTS (SELECT ROW_NUMBER() OVER() AS i
FROM unnest("data_list") "d" WHERE "d"."user_id"=10000 LIMIT 1)
For example, this is my tbl data:
Row1 => id = 1, data = ARRAY[ROW(5,'YYYY'),ROW(6,'YYYY')]
Row2 => id = 2, data = ARRAY[ROW(10,'YYYY'),ROW(11,'YYYY')]
Now i want to update tbl where id is 2 and set the value of one of the tbl.data elements to 'XXXX' where the user_id of element is equal to 11
In fact, the final result of Row2 will be this:
Row2 => id = 2, data = ARRAY[ROW(10,'YYYY'),ROW(11,'XXXX')]
If you know the value value, you can use the array_replace() function to make the change:
UPDATE tbl
SET data_list = array_replace(data_list, (11, 'YYYY')::mydata_t, (11, 'XXXX')::mydata_t)
WHERE id = 2
If you do not know the value value then the situation becomes more complex:
UPDATE tbl SET data_list = data_arr
FROM (
-- UPDATE doesn't allow aggregate functions so aggregate here
SELECT array_agg(new_data) AS data_arr
FROM (
-- For the id value, get the data_list values that are NOT modified
SELECT (user_id, value)::mydata_t AS new_data
FROM tbl, unnest(data_list)
WHERE id = 2 AND user_id != 11
UNION
-- Add the values to update
VALUES ((11, 'XXXX')::mydata_t)
) x
) y
WHERE id = 2
You should keep in mind, though, that there is an awful lot of work going on in the background that cannot be optimised. The array of mydata_t values has to be examined from start to finish and you cannot use an index on this. Furthermore, updates actually insert a new row in the underlying file on disk and if your array has more than a few entries this will involve substantial work. This gets even more problematic when your arrays are larger than the pagesize of your PostgreSQL server, typically 8kB. All behind the scene so it will work, but at a performance penalty. Even though array_replace sounds like changes are made in-place (and they indeed are in memory), the UPDATE command will write a completely new tuple to disk. So if you have 4,000 array elements that means that at least 40kB of data will have to be read (8 bytes for the mydata_t type on a typical system x 4,000 = 32kB in a TOAST file, plus the main page of the table, 8kB) and then written to disk after the update. A real performance killer.
As #klin pointed out, this design may be more trouble than it is worth. Should you make data_list as table (as I would do), the update query becomes:
UPDATE data_list SET value = 'XXXX'
WHERE id = 2 AND user_id = 11
This will have MUCH better performance, especially if you add the appropriate indexes. You could then still create a view to publish the data in an aggregated form with a custom type if your business logic so requires.

How to work with data values formatted [{}, {}, {}]

I apologize if this is a simple question - I had some trouble even formatting the question when I was trying to Google for help!
In one of the tables I am working with, there's data value that looks like below:
Invoice ID
Status
Product List
1234
Processed
[{"product_id":463153},{"product_id":463165},{"product_id":463177},{"pid":463218}]
I want to count how many products each order has purchased. What is the proper syntax and way to count the values under "Product List" column? I'm aware that count() is wrong, and I need to maybe extract the data from the string value.
select invoice_id, count(Product_list)
from quote_table
where status = 'processed'
group by invoice_id
You can use a JSON function named: json_array_length and cast this column like a JSON data type (as long as possible), for example:
select invoice_id, json_array_length(Product_list::json) as count
from quote_table
where status = 'processed'
group by invoice_id;
invoice_id | count
------------+-------
1234 | 4
(1 row)
If you need to count a specific property of the json column, you can use the query below.
This query solves the problem by using create type, json_populate_recordset and subquery to count product_id inside json data.
drop type if exists count_product;
create type count_product as (product_id int);
select
t.invoice_id,
t.status,
(
select count(*) from json_populate_recordset(
null::count_product,
t.Product_list
)
where product_id is not null
) as count_produto_id
from (
-- symbolic data to use in query
select
1234 as invoice_id,
'processed' as status,
'[{"product_id":463153},{"product_id":463165},{"product_id":463177},{"pid":463218}]'::json as Product_list
) as t

SQL query to filter where all array items in JSONB array meet condition

I made a similar post before, but deleted it as it had contextual errors.
One of the tables in my database includes a JSONB column which includes an array of JSON objects. It's not dissimilar to this example of a session table which I've mocked up below.
id
user_id
snapshot
inserted_at
1
37
{cart: [{product_id: 1, price_in_cents: 3000, name: "product A"}, {product_id: 2, price_in_cents: 2500, name: "product B"}]}
2022-01-01 20:00:00.000000
2
24
{cart: [{product_id: 1, price_in_cents: 3000, name: "product A"}, {product_id: 3, price_in_cents: 5500, name: "product C"}]}
2022-01-02 20:00:00.000000
3
88
{cart: [{product_id: 4, price_in_cents: 1500, name: "product D"}, {product_id: 2, price_in_cents: 2500, name: "product B"}]}
2022-01-03 20:00:00.000000
The query I've worked with to retrieve records from this table is as follows.
SELECT sessions.*
FROM sessions
INNER JOIN LATERAL (
SELECT *
FROM jsonb_to_recordset(sessions.snapshot->'cart')
AS product(
"product_id" integer,
"name" varchar,
"price_in_cents" integer
)
) AS cart ON true;
I've been trying to update the query above to retrieve only the records in the sessions table for which ALL of the products in the cart have a price_in_cents value of greater than 2000.
To this point, I've not had any success on forming this query but I'd be grateful if anyone here can point me in the right direction.
You can use a JSON path expression:
select *
from sessions
...
where not sessions.snapshot ## '$.cart[*].price_in_cents <= 2000'
There is no JSON path expression that would check that all array elements are greater 2000. So this returns those rows where no element is smaller than 2000 - because that can be expressed with a JSON path expression.
Here is one possible solution based on the idea of your original query.
Each element of the cart JSON array object is joined to its sessions parent row. You 're left adding the WHERE clause conditions now that the wanted JSON array elements are exposed.
SELECT *
FROM (
SELECT
sess.id,
sess.user_id,
sess.inserted_at,
cart_items.cart_name,
cart_items.cart_product_id,
cart_items.cart_price_in_cents
FROM sessions sess,
LATERAL (SELECT (snapshot -> 'cart') snapshot_cart FROM sessions WHERE id = sess.id) snap_arr,
LATERAL (SELECT
(value::jsonb ->> 'name')::text cart_name,
(value::jsonb -> 'product_id')::int cart_product_id,
(value::jsonb -> 'price_in_cents')::int cart_price_in_cents
FROM JSONB_ARRAY_ELEMENTS(snap_arr.snapshot_cart)) cart_items
) session_snapshot_cart_product;
Explanation :
From the sessions table, the cart array is exctracted and joined per sessions row
The necessary items of the cart JSON array is then unnested by the second join using the JSONB_ARRAY_ELEMENTS(jsonb) function
The following worked well for me and allowed me the flexibility to use different comparison operators other than just ones such as == or <=.
In one of the scenarios I needed to construct, I also needed to have my WHERE in the subquery also compare against an array of values using the IN comparison operator, which was not viable using some of the other solutions that were looked at.
Leaving this here in case others run into the same issue as I did, or if others find better solutions or want to propose suggestions to build upon this one.
SELECT *
FROM sessions
WHERE NOT EXISTS (
SELECT sessions.*
FROM sessions
INNER JOIN LATERAL (
SELECT *
FROM jsonb_to_recordset(sessions.snapshot->'cart')
AS product(
"product_id" integer,
"name" varchar,
"price_in_cents" integer
)
) AS cart ON true
WHERE name ILIKE "Product%";
)

KDB/q: insertions of list into table

I am looking to insert a list of data into tables. I have tried both upsert and insert. Both did not update the trade table.
// simTrade is a function that takes in
// number of orders and generate a table with random data
simTrade:{[nOrders]
seed:-314159;
openTime:`time$09:30;
closeTime:`time$16:00;
listCustomers:`XNYS`ARCX`XCHI`XASE`XCIS`XNAS`XBOS`XPHL`BATS`BATY`EDGA`EDGX`IEXG;
listProducts: `Derivative`Futures`Indicies
system "S ",string seed;
times: asc closeTime&openTime+nOrders?390*60*1000;
dates: asc 2015.03.01&2015.01.01+nOrders?30;
customers: nOrders?listCustomers;
products: nOrders?listProducts;
orderIds: 1+til nOrders;
versions: nOrders?5;
sizes: 100*nOrders?10;
trade:([]
time:`time$();
date:`date$();
customer:`symbol$(); / Customer name {xyz fund, asd,fund}
product:`symbol$(); / product name {Derivative, Equities}
orderId:`long$(); / order id {1-10}
version:`long$(); / version {1-10}
size:`long$()
)
insert[`trade; (times;dates;customers;products;orderIds;versions;sizes)];
show trade
}
lob:simTrade[100]
I have tried checking the data type error but couldn't find any issues with it.
may I also ask why when I change insert -> upsert, an error is returned
evaluation error: length
Thanks for any help
insert and upsert can be applied to global variables only. table is local variable, hence insert/upsert throws type error. For more details see insert
I would suggest you to fill table values in-place:
...
trade: ([]
time:times;
date:dates;
customer:customers; / Customer name {xyz fund, asd,fund}
product:products; / product name {Derivative, Equities}
orderId:orderIds; / order id {1-10}
version:versions; / version {1-10}
size:sizes
);
...

One to Many equivalent in Cassandra and data model optimization

I am modeling my database in Cassandra, coming from RDBMS. I want to know how can I create a one-to-many relationship which is embedded in the same Column Name and model my table to fit the following query needs.
For example:
Boxes:{
23442:{
belongs_to_user: user1,
box_title: 'the box title',
items:{
1: {
name: 'itemname1',
size: 44
},
2: {
name: 'itemname2',
size: 24
}
}
},
{ ... }
}
I read that its preferable to use composite columns instead of super columns, so I need an example of the best way to implement this. My queries are like:
Get items for box by Id
get top 20 boxes with their items (for displaying a range of boxes with their items on the page)
update items size by item id (increment size by a number)
get all boxes by userid (all boxes that belongs to a specific user)
I am expecting lots of writes to change the size of each item in the box. I want to know the best way to implement it without the need to use super columns. Furthermore, I don't mind getting a solution that takes Cassandra 1.2 new features into account, because I will use that in production.
Thanks
This particular model is somewhat challenging, for a number of reasons.
For example, with the box ID as a row key, querying for a range of boxes will require a range query in Cassandra (as opposed to a column slice), which means the use of an ordered partitioner. An ordered partitioner is almost always a Bad Idea.
Another challenge comes from the need to increment the item size, as this calls for the use of a counter column family. Counter column families store counter values only.
Setting aside the need for a range of box IDs for a moment, you could model this using multiple tables in CQL3 as follows:
CREATE TABLE boxes (
id int PRIMARY KEY,
belongs_to_user text,
box_title text,
);
CREATE INDEX useridx on boxes (belongs_to_user);
CREATE TABLE box_items (
id int,
item int,
size counter,
PRIMARY KEY(id, item)
);
CREATE TABLE box_item_names (
id int PRIMARY KEY,
item int,
name text
);
BEGIN BATCH
INSERT INTO boxes (id, belongs_to_user, box_title) VALUES (23442, 'user1', 'the box title');
INSERT INTO box_items (id, item, name) VALUES (23442, 1, 'itemname1');
INSERT INTO box_items (id, item, name) VALUES (23442, 1, 'itemname2');
UPDATE box_items SET size = size + 44 WHERE id = 23442 AND item = 1;
UPDATE box_items SET size = size + 24 WHERE id = 23442 AND item = 2;
APPLY BATCH
-- Get items for box by ID
SELECT size FROM box_items WHERE id = 23442 AND item = 1;
-- Boxes by user ID
SELECT * FROM boxes WHERE belongs_to_user = 'user1';
It's important to note that the BATCH mutation above is both atomic, and isolated.
Technically speaking, you could also denormalize all of this into a single table. For example:
CREATE TABLE boxes (
id int,
belongs_to_user text,
box_title text,
item int,
name text,
size counter,
PRIMARY KEY(id, item, belongs_to_user, box_title, name)
);
UPDATE boxes set size = item_size + 44 WHERE id = 23442 AND belongs_to_user = 'user1'
AND box_title = 'the box title' AND name = 'itemname1' AND item = 1;
SELECT item, name, size FROM boxes WHERE id = 23442;
However, this provides no guarantees of correctness. For example, this model makes it possible for items of the same box to have different users, or titles. And, since this makes boxes a counter column family, it limits how you can evolve the schema in the future.
I think in PlayOrm's objects first, then show the column model below....
Box {
#NoSqlId
String id;
#NoSqlEmbedded
List<Item> items;
}
User {
#NoSqlId
TimeUUID uuid;
#OneToMany
List<Box> boxes;
}
The User then is a row like so
rowkey = uuid=<someuuid> boxes.fkToBox35 = null, boxes.fktoBox37=null, boxes.fkToBox38=null
Note, the form of the above is columname=value where some of the columnnames are composite and some are not.
The box is more interesting and say Item has fields name and idnumber, then box row would be
rowkey = id=myid, items.item23.name=playdo, items.item23.idnumber=5634, itesm.item56.name=pencil, items.item56.idnumber=7894
I am not sure what you meant though on get the top 20 boxes? top boxes meaning by the number of items in them?
Dean
You can use Query-Driven Methodology, for data modeling.You have the three broad access paths:
1) partition per query
2) partition+ per query (one or more partitions)
3) table or table+ per query
The most efficient option is the “partition per query”. This article can help you in this case, step-by-step. it's sample is exactly a one-to-many relation.
And according to this, you will have several tables with some similar columns. You can manage this, by Materialized View or batch-log(as alternative approach).