Cassandra: how to search key columns with DECIMAL? - select

I understand Cassandra is designed for String based Key/Value pair.
I have a need to have Cassandra table with Decimal keys. Is there anyway to search the keys with range of numeric values. Like keys between 3 and 6 (inclusive)??.
Sample Key Column
1
3.3
6.345
9
10
2.5

Let's try this out. Assume a simple table with a decimal key, and a text value.
CREATE TABLE decimalRangePK (dec decimal, value text, PRIMARY KEY (dec));
In this case, dec is my partition key. And it is my only key, as there is not a clustering key present. After INSERTing some data, here is what I have:
aploetz#cqlsh:stackoverflow> SELECT * FROM decimalrangepk ;
dec | value
------+-------
2.5 | ghi
6.35 | abc
9 | def
3.2 | 3.2
1 | 1
3.3 | 3.3
10 | ten
(7 rows)
So I assume that you are trying a range query on your partition key, like this:
aploetz#cqlsh:stackoverflow> SELECT * FROM decimalrangeck WHERE dec>=3.3 AND dec<=9;
InvalidRequest: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"
As you can see, this doesn't work. Cassandra cannot execute range query on a partition key. However, because clustering keys are used enforce on-disk sort order (within a partition key) you can execute a range query on a clustering key.
In this next example, I'll try this again. But this time I will partition my data by date, like this:
CREATE TABLE decimalRangeCK (dateBucket text, dec decimal, value text,
PRIMARY KEY (dateBucket,dec));
After inserting some rows, I'll query the table and it will look slightly different:
aploetz#cqlsh:stackoverflow> SELECT * FROM decimalrangeck ;
datebucket | dec | value
------------+------+-------
20151108 | 1 | 1
20151108 | 3.2 | 3.2
20151110 | 2.5 | ghi
20151110 | 10 | ten
20151109 | 1 | 1
20151109 | 3.3 | 3.3
20151109 | 6.35 | abc
20151109 | 9 | def
(8 rows)
Now I can run a range query on dec, as long as I also provide a partition key:
aploetz#cqlsh:stackoverflow> SELECT * FROM decimalrangeck WHERE datebucket='20151109'
AND dec>=3.3 AND dec<=9;
datebucket | dec | value
------------+------+-------
20151109 | 3.3 | 3.3
20151109 | 6.35 | abc
20151109 | 9 | def
(3 rows)
As you can see, picking a good partition key is very important. High cardinality, unique partition keys are great for data distribution, but don't really give you a whole lot of query flexibility.

Related

Querying data with additional column that creates a number for ordering purposes

I am trying to create a "queue" system by adding an arbitrary column that creates a number based on a condition and date, to sort the importance of a row.
For example, below is the query result I pulled in Postgres:
Table: task
Result:
description | status/condition| task_created |
bla | A | 2019-12-01 07:00:00|
pikachu | A | 2019-12-01 16:32:10|
abcdef | B | 2019-12-02 18:34:22|
doremi | B | 2019-12-02 15:09:43|
lalala | A | 2019-12-03 22:10:59|
In the above, each task has a date/timestamp and status/condition applied to them. I would like to create another column that gives a number to a row where it prioritises the older tasks first, BUT if the condition is B, then we take the older task of those in B as first priority.
The expected end result (based on the example) should be:
Table1: task
description | status/condition| task_created | priority index
bla | A | 2019-12-01 07:00:00| 3
pikachu | A | 2019-12-01 16:32:10| 4
abcdef | B | 2019-12-02 18:34:22| 2
doremi | B | 2019-12-02 15:09:43| 1
lalala | A | 2019-12-03 22:10:59| 5
For priority number, 1 being most urgent to do/resolve, while 5 being the least.
How would I go about adding this additional column into the existing query? especially since there's another condition apart from just the task_created date/time.
Any help is appreciated. Many thanks!
You maybe want the Rank or Dense Rank function (depends on your needs) window functions.
If you don't need a conditional order on the status you can use this one.
SELECT *,
rank() OVER (
ORDER BY status desc, task_created
) as priority_index
FROM task
If you need a custom order based on the value of the status:
SELECT *,
rank() OVER (
ORDER BY
CASE status
WHEN 'B' THEN 1
WHEN 'A' THEN 2
WHEN 'C' THEN 3
ELSE 4
END, task_created
) as priority_index
FROM task
If you have few values this is good enough, because we can simply specify your custom order. But if you have a lot of values and the ordering information is fixed, then it should have its own table.

Stored procedure (or better way) to add a new row to existing table every day at 22:00

I will be very grateful for your advice regarding the following issue.
Given:
PostgreSQL database
Initial (basic) query
select day, Value_1, Value_2, Value_3
from table
where day=current_date
which returns a row with following columns
Day | Value_1(int) | Value_2(int) | Value 3 (int)
2019-11-14 | 10 | 10 | 14
It is needed to create a view with this starting information and add a new row every day based on the outcome of initial query executed at 22:00.
The expected outcome tomorrow at 22:01 will be
Day | Value_1 | Value_2 | Value_3
2019-11-14 | 10 | 10 | 14
2019-11-15 | N | M | P
Many thanks in advance for your time and support.

Database design with multiple units and multiple attributes

Supposing I have this database design which I have researched.
Table: Products
ProductId | Name | BaseUnitId
1 | Lab gown | 1
2 | Gloves | 1
FK: BaseUnitId references Units.UnitId
Table: Units
UnitId | Name
1 | Each / Pieces
2 | Dozen
3 | Box
Table: Unit Conversion
ProdID | BaseUnitID | Factor | ConvertToUnitID
1 | 1 | 12 | 2
2 | 1 | 100 | 3
FK: BaseUnitId references Units.UnitID
FK: ConvertToUnitId references Units.UnitID
Table: Product Attribute
AttribId | Prod_ID | Attribute | Value
1 | 1 | Color | Blue
2 | 1 | Size | Large
3 | 2 | Color | Violet
4 | 2 | Size | Small
5 | 2 | Size | Medium
6 | 2 | Size | Large
7 | 2 | Color | White
FK: Prod_ID references Product.ProductID
Table: Inventory
Prod_ID | Base Unit Qty | Expiry
1 | 12 | n/a
2 | 100 | 2020-01-01
2 | 100 | 2021-12-31
FK: Prod_ID references Product.ProductID
How can I breakdown the inventory per unit per attribute?
e.g How can I get the inventory of SMALL VIOLET GLOVES? LARGE WHITE GLOVES?
Any suggestions? My idea is to create another table which will link product unit, product attribute and quantity.
But I dont know how to link the size attribute and color attribute to a unit.
Lastly, is there something wrong with this design?
I think it is quite wrong to split off the attributes of a product into a different table. I understand the desire to normalize, but it should be done differently.
I'd handle a product and its attributes like this:
CREATE TABLE product (
id bigint PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
name text NOT NULL,
baseunit_id bigint NOT NULL REFERENCES unit
);
CREATE TABLE inventory (
id bigint PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
product_id bigint NOT NULL REFERENCES product,
color integer REFERENCES product_color,
size integer REFERENCES product_size,
other_attributes jsonb
);
That also makes sense if you think about in natural language terms: “How many dozens of large blue gloves do we have on store?”
Attributes that do not apply to a certain product can be left NULL.
I make a distinction between common and rare attributes. Common attributes have their own column. Rare attributes are bunched together in a jsonb column. I know that the latter is not normalized nor pretty, but varying attributes are not very suited for a relational model. A GIN index on the column will allow searches to be efficient.

String splitting and operations on only some results

I have strings that look like this:
schedulestart | event_labels
2018-04-04 | 9=TTR&11=DNV&14=SWW&26=DNV&2=QQQ&43=FTW
When I look at it in the database. I have code that relies in this string in this format to display a schedule with events with those labels on those days.
Now I find myself needing to break down the string in postgres for reporting/analysis, and I can't really pull out the string and parse it in another language, so I have to stick to postgres.
I've figured out a way to unpack the string so my results look like this:
User ID | Schedule Start | Unpacked String
2 | 2018-04-04 | TTR
2 | 2018-04-04 | 9
2 | 2018-04-04 | DNV
2 | 2018-04-04 | 11
2 | 2018-04-04 | SWW
2 | 2018-04-04 | 14
2 | 2018-04-04 | DNV
2 | 2018-04-04 | 26
select schedulestart, unnest(string_to_array(unnest(string_to_array(event_labels, '&')), '=')) from table;
Now what I need is a way to actually perform an interval calculation (so 2018-04-04+11 days::interval), and I can if I only get a numbers list, but I need to also bind that result to each string. So the goal is an output like this:
eventdate | event_label
2018-04-12 | TTR
2018-04-20 | DNV
Where eventdate is the schedule start + which day of the schedule the event is on. I'm not sure how to take the unpacked string I created and use it to perform date calculations, and tie it to the string.
I've considered doing only one unnest, so that it's 11=TTR and 14=DNV, but I'm not sure how to take that to my desired result either. Is there a way to read a string until you reach a certain character, and then use that in calculations, and then read every character past a certain character in a string into a new column?
I'm aware completely rewriting how this is handled would be ideal, but I did not initially write it, and I don't have the time or means to rewrite the ~20 locations this is used.
Here is your table (I added userid column):
CREATE TABLE test(userid INTEGER, schedulestart DATE, event_labels VARCHAR);
And input data:
INSERT INTO test(userid,schedulestart , event_labels) VALUES
(2,DATE '2018-04-04', '9=TTR&11=DNV&14=SWW&26=DNV&2=QQQ&43=FTW');
And finally the solution:
SELECT
userid,
(schedulestart + (SPLIT_PART(kv,'=',1)||' days')::INTERVAL)::DATE AS eventdate,
SPLIT_PART(kv,'=',2) AS event_label
FROM (
SELECT
userid,schedulestart,
REGEXP_SPLIT_TO_TABLE(event_labels, '&') AS kv
FROM test
WHERE userid = 2
) a

PostgreSQL - Pull earliest timestamp per user

I have a table which records each time the user performs a certain behavior, with timestamps for each iteration. I need to pull one row from each user with the earliest timestamp as part of a nested query.
As an example, the table looks like this:
+ row | user_id | timestamp | description
+ 1 | 100 | 02-02-2010| android
+ 2 | 100 | 02-03-2010| ios
+ 3 | 100 | 02-05-2010| windows
+ 4 | 111 | 02-01-2010| ios
+ 5 | 112 | 02-03-2010| android
+ 6 | 112 | 02-04-2010| android
And my query should pull just rows 1, 4 and 5.
Thanks!
This should be help. Don't understand your nested query part.
SELECT user_id, MIN(timestamp) AS min_timestamp
FROM table1
GROUP BY user_id
ORDER BY user_id;