Is there any limitation for amount of tuples in CQL3 SELECT...IN clause? - select

Cassandra CQL3 SELECT statement allows using IN tuples like in
SELECT * FROM posts
WHERE userid='john doe' AND (blog_title, posted_at)
IN (('John''s Blog', '2012-01-01), ('Extreme Chess', '2014-06-01'))
as seen from CQ3 spec: http://cassandra.apache.org/doc/cql3/CQL.html#selectStmt
Is there limitation for amount of tuples can be used in SELECT's IN clause? What is the maximum?

Rebecca Mills of DataStax provides a definite limit on the number of keys allowed in an IN statement (Things you should be doing when using Cassandra drivers - point #22):
...specifically the limit on the number of keys in
an IN statement, the most you can have is 65535. But practically
speaking you should only be using small numbers of keys in INs, just
for performance reasons.
I assume that limit would also apply to the number of tuples that you could specify as well. Honestly though, I wouldn't try to top that out. If you sent it a large number, it wouldn't perform well at all. The CQL documentation on the SELECT CLAUSE warns users about this:
When not to use IN
The recommendations about when not to use an index apply to using IN
in the WHERE clause. Under most conditions, using IN in the WHERE
clause is not recommended. Using IN can degrade performance because
usually many nodes must be queried. For example, in a single, local
data center cluster with 30 nodes, a replication factor of 3, and a
consistency level of LOCAL_QUORUM, a single key query goes out to two
nodes, but if the query uses the IN condition, the number of nodes
being queried are most likely even higher, up to 20 nodes depending on
where the keys fall in the token range.
Suffice to say that while the maximum number of tuples you could pass is a matter of mathematics, the number of tuples you should pass will depend on your cluster configuration, JVM implementation, and a little bit of common sense.

Related

Does partitioning improve performance if all partitions are equally used?

Consider the following situation:
I have a large PostgreSQL table with a primary key of type UUID. The UUIDs are generated randomly and spread uniformly across the UUID space.
I partition the table on this UUID column on 256 ranges (e.g. based on the first 8 bits of the UUID).
All partitions are stored on the same physical disk.
Basically this means all the 256 partitions will be equally used (unlike with time-based paritionning where the most recent parititon would normally be hotter than the other ones).
Will I see any performance improvement at all by doing this type of partitioning:
For queries based on the UUID, returning a single row (WHERE uuid_key = :id)?
For other queries that must search all partitions?
Most queries will become slower. For example, if you search by uuid_key, the optimizer has to determine which partition to search, something that grows in expense with the number of partitions. The index scan itself will not be notably faster on a small table than on a big table.
You could benefit if you have several tables partitioned alike and you join them on the partitioning key, so that you get a partitionwise join (but remember to set enable_partitionwise_join = on). There are similar speed gains for partitionwise aggregates.
Even though you cannot expect a performance gain for your query, partitioning may still have its use, for example if you need several autovacuum workers to process a single table.
Will I see any performance improvement at all by doing this type of
partitioning:
For queries based on the UUID, returning a single row (WHERE uuid_key = :id)?
Yes: Postgresql will search only in the right partition. Also you can gain performances in insert or update, reducing page contention.
For other queries that must search all partitions?
Not really, but index desing can minimize the problem.

TimescaleDB upserts vs. InfluxDB writes: functionality and performance

TimescaleDB has some attractive and compelling benchmarks against influxDB, especially for high cardinality (which is our case). However as I understand it, there's a big assumption in the equivalence of functionality involved in the benchmarks:
This may be considered as a limitation, but writes in InfluxDB are designed in a way that there can be only a single timestamp + tag keys combination (series in influx terminology) associated with a field. This means that when writing twice the same combination timestamp + tag keys with a different field value every time, influxDB will only keep the last one. So it does overwrite by default, and one cannot have duplicated entries for a given timestamp + tag keys + field.
On the other hand, TimescaleDB allows multiple timestamp + tag keys + field values entries (although they would not be called this in TimescaleDB terminology), just like PostgreSQL would. If you want uniqueness you'll have to add a UNIQUE constraint on the combination of tags.
This is where the problems start: if you actually want that "multiple entries for a tags combination" functionality, then TimescaleDB is the solution because influxDB simply does not do it. But on the opposite, if you want to compare TimescaleDB with influxDB you need to set up TimescaleDB to match the functionality of InfluxDB, which means using UNIQUE constraint and UPSERT (ON CONFLICT DO UPDATE syntax of PostgreSQL). In my opinion, not doing so is making one of the following assumptions:
The dataset will not have duplicated values, it is impossible for a timestamp/value pair to be updated
InfluxDB writes are equivalent to TimescaleDB inserts, which is not true, if I understood correctly.
My problem is that our use case involves overwrites, and I would guess many other use cases would. We have implemented a performance evaluation of our own, writing batches of 10k rows, time ordered, for about 100k "tags" combinations (that is 2 tags+timestamp as a UNIQUE constraint) using UPSERT, and the performance drops pretty fast and is nowhere near that of InfluxDB.
Am I missing something here? Anyone has experience with using UPSERT with TimescaleDB at a large scale? Is there a way to mitigate this issue? Or is TimescaleDB simply not a good solution for our use case?
Thanks!

Redshift Composite Sortkey - how many columns should we use?

I'm building several very large data tables on Amazon Redshift, that should hold data covering several frequently-queried properties with the relevant metrics.
We're using an even distribution style ("diststyle even") to have all the nodes participate in query calculations, but I am not certain about the length of the sortkey.
It definitely should be compound - every query will use first filter on date and network - but after that level I have about 7 additional relevant factors that can be queried on.
All the examples I've seen use a compound sort key of 2-3 fields, 4 at most.
My question is -why not use a sortkey that includes all the key fields in the table? What are the downsides for having a long sortkey?
VACUUM will also take longer if you have several sort keys.

Postgresql: queries with ordering by partitioning key

I have created in PostgreSQL a table partitioned (see here) by received column. Let's use a toy example:
CREATE TABLE measurement (
received timestamp without timezone PRIMARY KEY,
city_id int not null,
peaktemp int,
unitsales int
);
I have created one partition for each month for several years (measurement_y2012m01 ... measurement_y2016m03).
I have noticed that postgresql is not aware of the order of the partitions, so for a query like below:
select * from measurement where ... order by received desc limit 1000;
postgresql performs index scan over all partitions, even though it is very likely that the first 1000 results are located in the latest partition (or the first two or three).
Do you have an idea how to take advantage of partitions for such query? I want to emphasize that where clause may vary, I don't want to hardcode it.
The first idea is to iterate partitions in a proper order until 1000 records are fetched or all partitions are visited. But how to implement it in a flexible way? I want to avoid implementing the aforementioned iteration in the application, but I don't mind if the app needs to call a stored procedure.
Thanks in advance for your help!
Grzegorz
If you really don't know how many partitions to scan to get your desired 1000 rows in the output you could build up your resultset in a stored procedure and fetch results iterating over partitions until your limit condition is satisfied.
Starting with the most recent partition would be a wise thing to do.
select * from measurement_y2016m03 where ... order by received desc limit 1000;
You could store the immediate resultset in a record and issue a count over it and change the limit dynamically for the next scanned partition, so that if you fetch for example 870 rows in first partition, you could build up a second query with limit 130 and then perform count once again after that and increase the counter if it still doesn't satisfy your 1000 rows condition.
Why Postgres doesn't know when to stop during planning?
Planner is unaware of how many partitions are needed to satisfy your LIMIT clause. Thus, it has to order the entire set by appending results from each partition and then perform a limit (unless it already satisfies this condition during run time). The only way to do this in an SQL statement would be to restrict the lookup only to a few partitions - but that may not be the case for you. Also, increasing work_mem setting may speed things up for you if you're hitting disk during lookups.
Key note
Also, a thing to remember is that when you setup your partitioning, you should have a descending order of mostly accessed partitions. This would speed up your inserts, because Postgres checks conditions one by one and stops on first that satisfies.
Instead of iterating the partitions, you could guess at the range of received that will satisfy your query and expand it until you get the desired number of rows. Adding the range to WHERE will exclude the unnecessary partitions (assuming you have exclusion constraints set).
Edit
Correct, that's what I meant (could've phrased it better).
Simplicity seems like a pretty reasonable advantage. I don't see the performance being different, either way. This might actually be a little more efficient if you guess reasonably close to the desired range most of the time, but probably won't make a significant difference.
It's also a little more flexible, since you're not relying on the particular partitioning scheme in your query code.

complex queries with DynamoDB

I wants to store the visits log of 10 websites, receiving a total of 10M visits a month in DynamoDB. After that I want to create a back-end to monitor and detect fraud operations.
Can DynamoDB handle complex queries like:
- List of visitors with a X bounce rate in a specified Interval
- Popular Destination URI between date/time & date/time
- Ordering & Grouping By
List of visitors with a X bounce rate in a specified Interval
DynamoDb can handle a lot of queries, but requires a bit of planning in your access patterns. You must query by hash key and filter by either the range key or a local secondary index. The query must contain a single comparison operator using the familiar >=, BETWEEN, IN, etc and will sort result as well. If you require a query like SELECT col1 FROM table1 WHERE condition1 > x AND condition2 > y AND condition3 > z, you're not necessarily stuck but you need to plan. These queries can be made, but might require querying sequentially across multiple tables or embedding part of the query logic in the hash key (e.g. hashkey = BOUNCE_RATE:1 or BOUNCE_RATE:2, ...) where :N would be some sort of meaningful tier for bounce rates. In my own experience this is not unusual at all. A caveat in this example is that you will possibly get poor distribution of the hash key across nodes (i.e. you could get hot keys that degrade performance which would possibly defeat the scalability advantage of DynamoDB), but my explanation should simply serve to give you ways to thinking about access patterns.
Assuming bounce rate is some precise decimal value, you could put it in either the range key or in a local secondary index which would then require including the time interval in the hash key. This would require that the time intervals are pre-determined (as the first example would require the bounce rate "tiers" to be pre-determined). You'll need to consider these types of trade-offs.
Finally, you can create multiple tables, each table holding the data of either a single tier of bounce rate or time interval. There are also other basic approaches, but food for thought...