Druid Default Distinct Approximation Algorithm - druid

Is there a way to modify the default HLL approximation algorithm with ThetaSketch in Druid? So that while querying for count distinct, druid by default uses ThetaSketch instead of HLL.

I believe you need to be explicit at query time, e.g. using APPROX_COUNT_DISTINCT_DS_THETA versus APPROX_COUNT_DISTINCT_DS_HLL
https://druid.apache.org/docs/latest/querying/sql.html#aggregation-functions

Related

TimescaleDB upserts vs. InfluxDB writes: functionality and performance

TimescaleDB has some attractive and compelling benchmarks against influxDB, especially for high cardinality (which is our case). However as I understand it, there's a big assumption in the equivalence of functionality involved in the benchmarks:
This may be considered as a limitation, but writes in InfluxDB are designed in a way that there can be only a single timestamp + tag keys combination (series in influx terminology) associated with a field. This means that when writing twice the same combination timestamp + tag keys with a different field value every time, influxDB will only keep the last one. So it does overwrite by default, and one cannot have duplicated entries for a given timestamp + tag keys + field.
On the other hand, TimescaleDB allows multiple timestamp + tag keys + field values entries (although they would not be called this in TimescaleDB terminology), just like PostgreSQL would. If you want uniqueness you'll have to add a UNIQUE constraint on the combination of tags.
This is where the problems start: if you actually want that "multiple entries for a tags combination" functionality, then TimescaleDB is the solution because influxDB simply does not do it. But on the opposite, if you want to compare TimescaleDB with influxDB you need to set up TimescaleDB to match the functionality of InfluxDB, which means using UNIQUE constraint and UPSERT (ON CONFLICT DO UPDATE syntax of PostgreSQL). In my opinion, not doing so is making one of the following assumptions:
The dataset will not have duplicated values, it is impossible for a timestamp/value pair to be updated
InfluxDB writes are equivalent to TimescaleDB inserts, which is not true, if I understood correctly.
My problem is that our use case involves overwrites, and I would guess many other use cases would. We have implemented a performance evaluation of our own, writing batches of 10k rows, time ordered, for about 100k "tags" combinations (that is 2 tags+timestamp as a UNIQUE constraint) using UPSERT, and the performance drops pretty fast and is nowhere near that of InfluxDB.
Am I missing something here? Anyone has experience with using UPSERT with TimescaleDB at a large scale? Is there a way to mitigate this issue? Or is TimescaleDB simply not a good solution for our use case?
Thanks!

How accurate is MongoDB's estimated count query?

The official MongoDB driver offers a 'count' and 'estimated document count' API, as far as I know the former command is highly memory intensive so it's recommended to use the latter in situations that require it.
But how accurate is this estimated document count? Can the count be trusted in a Production environment, or is using the count API recommended when absolute accuracy is needed?
Comparing the two, to me it's very difficult to conjure up a scenario in which you'd want to use countDocuments() when estimatedDocumentCount() was an option.
That is, the equivalent form of estimatedDocumentCount() is countDocuments({}), i.e., an empty query filter. The cost of the first function is O(1); the second is O(N), and if N is very large, the cost will be prohibitive.
Both return a count, which, in a scenario in which Mongo has been deployed, is likely to be quite ephemeral, i.e., it's inaccurate the moment you have it, as the collection changes.
Please review the MongoDB documentation for estimatedDocumentCount(). Specifically, they note that "After an unclean shutdown of a mongod using the Wired Tiger storage engine, count statistics reported by db.collection.estimatedDocumentCount() may be inaccurate." This is due to metadata being used for the count and checkpoint drift, which will typically be resolved after 60 seconds or so.
In contrast, the MongoDB documentation for countDocuments() states that this method is a wrapper that performs a $group aggregation stage to $sum the results set, ensuring absolute accuracy of the count.
Thus, if absolute accuracy is essential, use countDocuments(). If all you need is a rough estimate, use estimatedDocumentCount(). The names are accurate to their purpose and should be used accordingly.
The main difference is filtering.
count_documents can be filtered on like a normal query whereas estimated_document_count cannot be.
If filtering is not part of your use case then I would use count_documents since it is much faster.

MongoDB find median

I would like to upon user request graph median values of many documents. I'd prefer not to transfer entire documents from the database to my application solely for purposes of determining median values.
I understand that development is still planned for a median aggregator in MongoDB, however I see that currently the following operations are supported:
sort
count
limit
Short of editing mongo source code, Is there any reasonable way I can combine these operations to obtain median values; for example, to sort values, count them, and limit to return median values?
It appears that editing Mongo source code is the only solution.

Is there any limitation for amount of tuples in CQL3 SELECT...IN clause?

Cassandra CQL3 SELECT statement allows using IN tuples like in
SELECT * FROM posts
WHERE userid='john doe' AND (blog_title, posted_at)
IN (('John''s Blog', '2012-01-01), ('Extreme Chess', '2014-06-01'))
as seen from CQ3 spec: http://cassandra.apache.org/doc/cql3/CQL.html#selectStmt
Is there limitation for amount of tuples can be used in SELECT's IN clause? What is the maximum?
Rebecca Mills of DataStax provides a definite limit on the number of keys allowed in an IN statement (Things you should be doing when using Cassandra drivers - point #22):
...specifically the limit on the number of keys in
an IN statement, the most you can have is 65535. But practically
speaking you should only be using small numbers of keys in INs, just
for performance reasons.
I assume that limit would also apply to the number of tuples that you could specify as well. Honestly though, I wouldn't try to top that out. If you sent it a large number, it wouldn't perform well at all. The CQL documentation on the SELECT CLAUSE warns users about this:
When not to use IN
The recommendations about when not to use an index apply to using IN
in the WHERE clause. Under most conditions, using IN in the WHERE
clause is not recommended. Using IN can degrade performance because
usually many nodes must be queried. For example, in a single, local
data center cluster with 30 nodes, a replication factor of 3, and a
consistency level of LOCAL_QUORUM, a single key query goes out to two
nodes, but if the query uses the IN condition, the number of nodes
being queried are most likely even higher, up to 20 nodes depending on
where the keys fall in the token range.
Suffice to say that while the maximum number of tuples you could pass is a matter of mathematics, the number of tuples you should pass will depend on your cluster configuration, JVM implementation, and a little bit of common sense.

Selecting top N records per group in DynamoDB

Is NoSQL in general, and DynamoDB in particular, well suited to performing greatest-n-per-group type queries, as compared to MySQL?
DynamoDB support only 2 index and can only be queried efficiently on these.
hash key
range key (optional)
Using DynamoDB to find the biggest values in a random "row" is not a good idea at all. Querying on a random row implies scanning the whole dataset which will cost you a lot of money.
Nonetheless, if your data is properly modeled, query method may be used to find the biggest range_key for a given hash_key
Here is how to proceed:
Set the has_key
Set no filter for the range_key
limit the result count to 1
scan the index backward