Hyperunique Aggregations in Calcite-Druid Adapter - druid

In my Druid data source, I have a hyperUnique aggregation (ingestion time) on one of the fields.
I am trying to do the equivalent of COUNT(DISTINCT(<hyperunique_field>)) on this aggregated field.
Is it supported in the Calcite Druid Adapter? If so, what is the correct way to go about it?
In plywood, I can do COUNT_DISTINCT. Running this returns 0 counts.
SQL:
select floor("__time" to HOUR) time_bucket,”field_1", count(distinct(“ingestion_time_aggregated_field")) as uniq from “datasource" where "__time" between '2017-01-01 00:00:00' and '2017-01-02 00:00:00' and “field_1" in (‘value_1') and “field_2”='value_2' and “field_3”='value_3' and “field_4”='value_4' group by floor("__time" to HOUR),”field_1" order by floor("__time" to HOUR);
ingestion_time_aggregated_field:
{"name": "ingestion_time_aggregated_field", "type": "hyperUnique","fieldName": “field” }

Complex aggregators are not supported by the calcite-druid adapted. The reason is that HLL is an approximate and not exact so it does not actually answer to the query of unique count.

Related

How to sort data based on one of the timestamp field in Druid Scan Query

I'm using Druid scan query with ordering param "ascending". It is returning data based on configured timestamp field called serverReceiveTime. I wanted to sort my data on one of the other timestamp field(streamingSegmentStartTime). As per Scan query documentation, there is no such sort argument we can pass.
ScanDruidQuery.builder()
.dataSource(route.getDataSource())
.intervals(IntervalParser.getIntervals(getSessionsQuery.getStartTime(), getSessionsQuery.getEndTime()))
.filter(filterTranslator.translate(getSessionsQuery.getFilter()))
.order(DRUID_DATA_SORT_ORDER)
.columns(columnList)
.context(new DruidQueryContext(genericQuery.getRequestId()))
.limit(getSessionsQuery.getResultSize())
.offset(NumberUtils.toInt(getSessionsQuery.getNextToken(), 0))
.build();
Please let me know if there is any way to sort this data based on streamingSegmentStartTime at Druid end
Not sure what your query is doing, so this might not help, but you can sort by other columns if you use a group by query.
Take a look at the sortByDimsFirst query context property of the Group By query here: https://druid.apache.org/docs/latest/querying/groupbyquery.html#groupby-v2-configurations
If you set the first dimension of the DimensionSpec to the streamingSegmentStartTime and use sortByDimsFirst set to True, I think you can achieve what you want.

How to filter dates in Couchbase and Scala

I have a simple json:
{
"id": 1,
"name": "John",
"login": "2019-02-13"
}
This kind of documents are stored in Couchbase, but for now I would like to create index (or list in some other, well time way) which should filter all documents where login is older then 30 days. How should I create it in Couchbase and get this in Scala?
For now I get all documents from database and filter them in API, but I think it is not very good way. I would like to filter it on database side and retrieve only documents which have login older then 30 days.
Now, in Scala I have only the method only to get docs by id:
bucket.get(id, classOf[RawJsonDocument])
I would recommend taking a look at N1QL (which is just SQL for JSON). Here's an example:
SELECT u.*
FROM mybucket u
WHERE DATE_DIFF_STR(NOW_STR(), login, 'day') > 30;
You'll also need an index, something like:
CREATE INDEX ix_login_date ON mybucket (login);
Though I can't promise that's the best index, it will at least get you started.
I used DATE_DIFF_STR and NOW_STR, but there are other ways to manipulate dates. Check out Date Functions in the documentation. And since you are new to N1QL, I'd recommend checking out the interactive N1QL tutorial.
The following query is more efficient because it can push the predicate to IndexScan when index key matched with one side of predicate relation operator. If you have expression that derives from index key, it gets all the values and filter in the query engine.
CREATE INDEX ix_login_date ON mybucket (login);
SELECT u.*
FROM mybucket AS u
WHERE u.login < DATE_ADD_STR(NOW_STR(), 'day', -30) ;

Query one document per association from MongoDB

I'm investigating how MongoDB would work for us. One of the most used queries is used to get latest (or from a given time) measurements for each station. There is thousands of stations and each station has tens of thousands of measurements.
So we plan to have one collection for stations and another for measurements.
In SQL we would do the query with
SELECT * FROM measurements
INNER JOIN (
SELECT max(meas_time) station_id
FROM measurements
WHERE meas_time <= 'time_to_query'
GROUP BY station_id
) t2 ON t2.station_id = measurements.station_id
AND t2.meas_time = measurements.meas_time
This returns one measurement for each station, and the measurement is the newest one before the 'time_to_query'.
What query should be used in MongoDB to produce the same result? We are really using Rails and MongoId, but it should not matter.
update:
This question is not about how to perform a JOIN in MongoDB. The fact that in SQL getting the right data out of the table requires a join doesn't necessary mean that in MongoDB we would also need a join. There is only one table used in the query.
We came up with this query
db.measurements.aggregate([{$group:{ _id:{'station_id':"$station_id"}, time:{$max:'$meas_time'}}}]);
with indexes
db.measurements.createIndex({ station_id: 1, meas_time: -1 });
Even though it seems to give the right data it is really slow. Takes roughly a minute to get a bit over 3000 documents from a collection of 65 million.
Just found that MongoDB is not using the index in this query even though we are using the 3.2 version.
I guess worst case solution would be something like this (out of my head):
meassures = []
StationId.all.each do |station|
meassurement = Meassurment.where(station_id: station.id, meas_time <= 'time_to_query').order_by(meas_time: -1).limit(1)
meassures << [station.name, meassurement.measure, ....]
end
It depends on how much time query can take. Data should anyway be indexed by station_id and meas_time.
How much time does the SQL query take?

MongoDB selecting documents where the maximum time is less than a constant

I'm trying to select a set of documents from a MongoDB where the maximum time is less than a timeout.
So like this for sql
select distinct item_id where max(time) < 140034857
Basically it's a collection of successful comms operations for devices and I'm looking for the ones which have had none in the last hour.
You can use the $lt operator to find documents where the value of the time key is less than a specified value.
db.Collection.find({"time" : {$lt: 140034857}})

MongoDB query with an 'or' condition

So I have an embedded document that tracks group memberships. Each embedded document has an ID pointing to the group in another collection, a start date, and an optional expire date.
I want to query for current members of a group. "Current" means the start time is less than the current time, and the expire time is greater than the current time OR null.
This conditional query is totally blocking me up. I could do it by running two queries and merging the results, but that seems ugly and requires loading in all results at once. Or I could default the expire time to some arbitrary date in the far future, but that seems even uglier and potentially brittle. In SQL I'd just express it with "(expires >= Now()) OR (expires IS NULL)" -- but I don't know how to do that in Mongo.
Any ideas? Thanks very much in advance.
Just thought I'd update in-case anyone stumbles across this page in the future. As of 1.5.3, mongo now supports a real $or operator: http://www.mongodb.org/display/DOCS/Advanced+Queries#AdvancedQueries-%24or
Your query of "(expires >= Now()) OR (expires IS NULL)" can now be rendered as:
{$or: [{expires: {$gte: new Date()}}, {expires: null}]}
In case anyone finds it useful, www.querymongo.com does translation between SQL and MongoDB, including OR clauses. It can be really helpful for figuring out syntax when you know the SQL equivalent.
In the case of OR statements, it looks like this
SQL:
SELECT * FROM collection WHERE columnA = 3 OR columnB = 'string';
MongoDB:
db.collection.find({
"$or": [{
"columnA": 3
}, {
"columnB": "string"
}]
});
MongoDB query with an 'or' condition
db.getCollection('movie').find({$or:[{"type":"smartreply"},{"category":"small_talk"}]})
MongoDB query with an 'or', 'and', condition combined.
db.getCollection('movie').find({"applicationId":"2b5958d9629026491c30b42f2d5256fa8",$or:[{"type":"smartreply"},{"category":"small_talk"}]})
Query objects in Mongo by default AND expressions together. Mongo currently does not include an OR operator for such queries, however there are ways to express such queries.
Use "in" or "where".
Its gonna be something like this:
db.mycollection.find( { $where : function() {
return ( this.startTime < Now() && this.expireTime > Now() || this.expireTime == null ); } } );
db.Lead.find(
{"name": {'$regex' : '.*' + "Ravi" + '.*'}},
{
"$or": [{
'added_by':"arunkrishna#aarux.com"
}, {
'added_by':"aruna#aarux.com"
}]
}
);
Using a $where query will be slow, in part because it can't use indexes. For this sort of problem, I think it would be better to store a high value for the "expires" field that will naturally always be greater than Now(). You can either store a very high date millions of years in the future, or use a separate type to indicate never. The cross-type sort order is defined at here.
An empty Regex or MaxKey (if you language supports it) are both good choices.