simplest example of a query by date range in cassandra 1.x - nosql

I want to store an ID and a date and I want to retrieve all entries from dateA up to dateB, what exactly do I need to be able to perform select from my_column_family where date >= dateA and date < dateB; ?

the guys at #cassandra (IRC) helped me find a way, there's many subtle details so I'd like to document that here.
first you need to declare a column family similar to this (examples from cassandra-cli):
create column family users with comparator=UTF8Type and key_validation_class=UTF8Type and column_metadata=[
{column_name: id, validation_class: LongType}
{column_name: name, validation_class: UTF8Type, index_type: KEYS}
{column_name: age, validation_class: LongType}
];
few important things about this declaration:
the comparator and key_validation_class are there to be able to use strings as key names
the first declared column is special, it's the "row key" which is used to address each row and therefore cannot contain duplicate values (the INSERT is really an UPSERT so when there's duplicates the new values overwrite the old ones)
the second column declares a "secondary index" on its values (more on that below)
the dates are stored as Long datatypes, interpretation is up to the client
now let's add some values:
set users[1][name] = john;
set users[1][age] = 19;
set users[2][name] = jane;
set users[2][age] = 21;
set users[3][name] = john;
set users[3][age] = 32;
according to this: http://pkghosh.wordpress.com/2011/03/02/cassandra-secondary-index-patterns/ Cassandra does not support the < operators, what it does is to manually exclude the rows that don't match but it does that AFTER there's a resultset and it also refuses to do so unless and actual filtering has taken place.
what that means is that a query like get users where age > 20; will return null but if we add a predicate that includes = it'll magically work.
here's where the secondary index is important, without it you can't use = so on this example I can do get users where name = jane; but I cannot ask for get users where age = 21;
the funny thing is that, after using = the < works so having a secondary index allows you to ask for get users where name = john and age > 20; and it'll filter correctly.

There are a few ways to solve this. The simplest is probably the secondary index solution with the equality limitation mentioned in your own answer. I've used this method, adding an additional column called 'valid', setting the value to 1. Then the queries can become where valid=1 and date>nnnn
The other solutions require additional column families and additional queries.
When loading the data, create and add to a column family which contains the timestamps as keys, and each entry would list all the user ids as column names.
If the partitioning strategy is ordered, then a single RangeSliceQuery can specify the date range as a key range and get all the columns for each key. Then iterate through the result keys, using the column values for each user id and if needed, query the original column family for the data associated with each id. Cassandra always stores the column names sorted, and can be reversed when reading.
But, as documented, the ordered partitioner is not ideal, leading to hot spots and difficulty in load balancing the nodes.
Without the ordered partitioner, still keeping the timestamp column family, you would have to create another column family while loading data where you can store all the timestamps as the columns under one or more known keys (e.g. 'created' or 'updated'). The first query would be a SliceQuery for a known key, and then the column names (as timestamps) would provide the keys for the MultigetSliceQuery to the timestamp column family.
I've used variations on this, usually adding Composite keys or columns for additional flexibility.

Related

How to efficiently index fields with an identical (and long) prefix in PostgreSQL?

I’m working with identifiers in a rather unusual format: every single ID has the same prefix and the prefix consists of as many as 25 characters. The only thing that is unique is the last part of the ID string and it has a variable length of up to ten characters:
ID
----------------------------------
lorem:ipsum:dolor:sit:amet:12345
lorem:ipsum:dolor:sit:amet:abcd123
lorem:ipsum:dolor:sit:amet:efg1
I’m looking for advice on the best strategy around indexing and matching this kind of ID string in PostgreSQL.
One approach I have considered is basically cutting these long prefixes out and only storing the unique suffix in the table column.
Another option that comes to mind is only indexing the suffix:
CREATE INDEX ON books (substring(book_id FROM 26));
I don’t think this is the best idea though as you would need to remember to always strip out the prefix when querying the table. If you forgot to do it and had a WHERE book_id = '<full ID here>' filter, the index would basically be ignored by the planner.
Most times I always create an integer type ID for my tables if even I have one unique string type of field. Recommendation for you is a good idea, I must view all your queries in DB. If you are recently using substring(book_id FROM 26) after the where statement, this is the best way to create expression index (function-based index). Basically, you need to check table joining conditions, which fields are used in the joining processes, and which fields are used after WHERE statements in your queries. After then you can prepare the best plan for creating indexes. If on the process of table joining you are using last part unique characters on the ID field then this is the best way to extract unique last characters and store this in additional fields or create expression index using the function for extracting unique characters.

DynamoDB column with tilde and query using JPA

i have table column with tilde value like below
vendorAndDate - Column name
Chipotle~08-26-2020 - column value
I want to query for month "vendorAndPurchaseDate like '%~08%2020'" and for year ends with 2020 "vendorAndPurchaseDate like '%2020'". I am using Spring Data JPA to query the values. I have not worked on column with tilde values before. Please point me in a right direction or some examples
You cannot.
If vendorAndPurchaseDate is your partition key , you need to pass the whole value.
If vendorAndPurchaseDate is range key , you can only perform
= ,>,<>=,<=,between and begins_with operation along with a partition key
reference : https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Query.html
DynamoDB does not support this type of wildcard query.
Let's consider a more DynamoDB way of handling this type of query. It sounds like you want to support 2 access patterns:
Get Item by month
Get Item by year
You don't describe your Primary Keys (Partition Key/Sort Key), so I'm going to make some assumptions to illustrate one way to address these access patterns.
Your attribute appears to be a composite key, consisting of <vendor>~<date>, where the date is expressed by MM-DD-YYYY. I would recommend storing your date fields in YYYY-MM-DD format, which would allow you to exploit the sort-ability of the date field. An example will make this much clearer. Imagine your table looked like this:
I'm calling your vendorAndDate attribute SK, since I'm using it as a Sort Key in this example. This table structure allows me to implement your two access patterns by executing the following queries (in pseudocode to remain language agnostic):
Access Pattern 1: Fetch all Chipotle records for August 2020
query from MyTable where PK = "Vendors" and SK between Chipotle~2020-08-00 and Chipotle~2020-08-31
Access Pattern 2: Fetch all Chipotle records for 2020
query from MyTable where PK = "Vendors" and SK between Chipotle~2020-01-01 and Chipotle~2020-12-31
Because dates stored in ISO8601 format (e.g. YYYY-MM-DD...) are lexicographically sortable, you can perform range queries in DynamoDB in this way.
Again, I've made some assumptions about your data and access patterns for the purpose of illustrating the technique of using lexicographically sortable timestamps to implement range queries.

How to avoid column name conflicts in cassandra

I need to store a list of user names in a Cassandra column family(wide row/dynamic columns).
The columnname/comparator type will be integer, so as to sort the users based on a score.
The score ranges from 0 to 100. The problem is, if two or more users have a same score, how can i store them on different columns?, as cassandra would not allow that...
Is there any way to convert integer to timeuuids? Or any other solution for this problem?
This is a problem I have seen quite often (not scores but preventing column name conflict). In general the solution is a form or another of concatenating a UUID to the column name (Since those are made to never conflict).
If you want to keep on sorting by score then I advice you to use a CompositeType column name.
More specifically:
CompositeType(score: Integer | time: TimeUUID)
The comparator in Cassandra will then first sort by score and then by time (putting the most recent last I believe).
TimeUUID should also take care of "simultaneous" score posting even thought the probabilities to have that with a Long timestamp would be ridiculously low.
You can use build-in list feature, see http://www.datastax.com/dev/blog/cql3_collections
Just have column with a value and list of users for that value.

What is the purpose of dividing rows into columnfamilies if they can have different number/types of columns anyway?

Given that a column family can have rows with arbitrary structure we could store all rows in a single "store" (avoiding the name 'columnfamily/table' on purpose).
What is the purpose of column families then?
The simplest of all reasons is evident in the name itself "Column Family". A Column Family groups a bunch of related columns together. You could consider it as a namespace containing related columns.
For example the Column "Name" by itself lacks context, which can be provided by ColumnFamilies like "Employees" or "Cities". Or each Column would need to carry all of it's context by itself with no concept of related Columns.
Atomicity
In Cassandra 1.1 and below, the only atomic guarantee you have is that writes to the same row (i.e. with the same key) will be atomic.
Thus, you think very carefully about what you want in your columns, and what row those columns should be in so that your application will behave appropriately if a write fails.
Reasons:
To have a different sort order for the columns within a row. The comparator is specified at column family creation time and can't be changed afterwards. So if you have rows which columns must be sorted alphabetically or numerically you have to create different column families.
Customize the storage options that can be set on per column family basis. E.g. caching or rows, compaction, deletion of expired columns, etc. Per column family storage options can be found here
Can't mix counter and non-counter columns in the same column family
As mentioned in other answers, due to logical cohesion - columns represent attributes of some entity identified by the row id.

How to query Cassandra by date range

I have a Cassandra ColumnFamily (0.6.4) that will have new entries from users. I'd like to query Cassandra for those new entries so that I can process that data in another system.
My sense was that I could use a TimeUUIDType as the key for my entry, and then query on a KeyRange that starts either with "" as the startKey, or whatever the lastStartKey was. Is this the correct method?
How does get_range_slice actually create a range? Doesn't it have to know the data type of the key? There's no declaration of the data type of the key anywhere. In the storage_conf.xml file, you declare the type of the columns, but not of the keys. Is the key assumed to be of the same type as the columns? Or does it do some magic sniffing to guess?
I've also seen reference implementations where people store TimeUUIDType in columns. However, this seems to have scale issues as this particular key would then become "hot" since every change would have to update it.
Any pointers in this case would be appreciated.
When sorting data only the column-keys are important. The data stored is of no consequence neither is the auto-generated timestamp. The CompareWith attribute is important here. If you set CompareWith as UTF8Type then the keys will be interpreted as UTF8Types. If you set the CompareWith as TimeUUIDType then the keys are automatically interpreted as timestamps. You do not have to specify the data type. Look at the SlicePredicate and SliceRange definitions on this page http://wiki.apache.org/cassandra/API This is a good place to start. Also, you might find this article useful http://www.sodeso.nl/?p=80 In the third part or so he talks about slice ranging his queries and so on.
Doug,
Writing to a single column family can sometimes create a hot spot if you are using an Order-Preserving Partitioner, but not if you are using the default Random Partitioner (unless a subset of users create vastly more data than all other users!).
If you sorted your rows by time (using an Order-Preserving Partitioner) then you are probably even more likely to create hotspots, since you will be adding rows sequentially and a single node will be responsible for each range of the keyspace.
Columns and Keys can be of any type, since the row key is just the first column.
Virtually, the cluster is a circular hash key ring, and keys get hashed by the partitioner to get distributed around the cluster.
Beware of using dates as row keys however, since even the randomization of the default randompartitioner is limited and you could end up cluttering your data.
What's more, if that date is changing, you would have to delete the previous row since you can only do inserts in C*.
Here is what we know :
A slice range is a range of columns in a row with a start value and an end value, this is used mostly for wide rows as columns are ordered. Known column names defined in the CF are indexed however so they can be retrieved specifying names.
A key slice, is a key associated with the sliced column range as returned by Cassandra
The equivalent of a where clause uses secondary indexes, you may use inequality operators there, however there must be at least ONE equals clause in your statement (also see https://issues.apache.org/jira/browse/CASSANDRA-1599).
Using a key range is ineffective with a Random Partitionner as the MD5 hash of your key doesn't keep lexical ordering.
What you want to use is a Column Family based index using a Wide Row :
CompositeType(TimeUUID | UserID)
In order for this not to become hot, add a first meaningful key ("shard key") that would split the data accross nodes such as the user type or the region.
Having more data than necessary in Cassandra is not a problem, it's how it is designed, so what you must ask yourself is "what do I need to query" and then design a Column Family for it rather than trying to fit everything in one CF like you'd do in an RDBMS.