DataSketches Theta Sketch not working properly - druid

I am having problems with getting correct distinct count numbers from Theta Sketch DataSketches module.
The ingestion spec I am using looks sth like this
"granularitySpec" :
{
"type" : "uniform",
"segmentGranularity" : "HOUR",
"queryGranularity" : "HOUR",
"intervals": ["${hourToProcess.intervalFormat}"]
}
..........
"dimensionsSpec" :
{
"dimensions" : [
"dimension1",
"dimension2",
......
"dimensionN"
]
}
..........
"timestampSpec" :
{
"format" : "${hourToProcess.ingestionDateFormat}",
"column" : "eventTimestamp"
}
..........
"metricsSpecs" :
[
.........,
{"type": "thetaSketch", "name": "uniqueUsers", "fieldName": "uniqueUsers"}
........
]
The field uniqueUsers is a String.
If I query Druid in the following way without any filtering or grouping operation
{
"type" : "thetaSketch",
"fieldName" : "uniqueUsers",
"isInputThetaSketch": true
}
the results are correct. But If I do any sort of filtering or grouping by dimensions
"filter": {
"type": "selector",
"dimension": "dimensionX",
"value": "1"
}
the results are much higher than the expected values.
Is there anything wrong internally with Theta Sketch or with my configuration?
I also want to add that if I use DataSketches HLL instead of ThetaSketch I get much better results.

Since the ThetaSketch supports set-based operations (union, intersect, difference), you'll need to specify the filter, aggregations, and postAggregations sections in your query. These sections are critical to compute the correct results. In my opinion, the filter section is the most critical since it defines the dimension and values that will be merged in the postAggregations section.
The following Druid doc is very helpful. However, their example is a groupBy query. I believe their example assumes the data is in raw form and a ThetaSketch needs to be computed at query time. In your case, your sketch is already pre-computed at ingest, so, a timeseries query will be much faster.
https://druid.apache.org/docs/latest/development/extensions-core/datasketches-theta.html
Hope that helps.

Related

dynamic size of subdocument mongodb

I'm using mongodb and mongoose for my web application. The web app is used for registration for swimming competitions and each competition can have X number of races. My data structure as of now:
{
"_id": "1",
"name": "Utmanaren",
"location": "town",
"startdate": "20150627",
"enddate": "20150627"
"race" : {
"gender" : "m"
"style" : "freestyle"
"length" : "100"
}
}
Doing this i need to determine and define the number of races for every competition. A solution i tried is having a separate document and having a Id for which competition a races belongs to, like below.
{
"belongsTOId" : "1"
"gender" : "m"
"style" : "freestyle"
"length" : "100"
}
{
"belongsTOId" : "1"
"gender" : "f"
"style" : "butterfly"
"length" : "50"
}
Is there a way of creating and defining dynamic number of races as a subdocument while using Mongodb?
Thanks!
You have basically two approaches of modelling your data structure; you can either design a schema where you can reference or embed the races document.
Let's consider the following example that maps swimming competition and multiple races relationships. This demonstrates the advantage of embedding over referencing if you need to view many data entities in context of another. In this one-to-many relationship between competition and race data, the competition has multiple races entities:
// db.competition schema
{
"_id": 1,
"name": "Utmanaren",
"location": "town",
"startdate": "20150627",
"enddate": "20150627"
"races": [
{
"gender" : "m"
"style" : "freestyle"
"length" : "100"
},
{
"gender" : "f"
"style" : "butterfly"
"length" : "50"
}
]
}
With the embedded data model, your application can retrieve the complete swimming competition information with just one query. This design has other merits as well, one of them being data locality. Since MongoDB stores data contiguously on disk, putting all the data you need in one document ensures that the spinning disks will take less time to seek to a particular location on the disk. The other advantage with embedded documents is the atomicity and isolation in writing data. To illustrate this, say you want to remove a competition which has a race "style" property with value "butterfly", this can be done with one single (atomic) operation:
db.competition.remove({"races.style": "butterfly"});
For more details on data modelling in MongoDB, please read the docs Data Modeling Introduction, specifically Model One-to-Many Relationships with Embedded Documents
The other design option is referencing documents follow a normalized schema where the race documents contain a reference to the competition document:
// db.race schema
{
"_id": 1,
"competition_id": 1,
"gender": "m",
"style": "freestyle",
"length": "100"
},
{
"_id": 2,
"competition_id": 1,
"gender": "f",
"style": "butterfly",
"length": "50"
}
The above approach gives increased flexibility in performing queries. For instance, to retrieve all child race documents where the main parent entity competition has id 1 will be straightforward, simply create a query against the collection race:
db.race.find({"competiton_id": 1});
The above normalized schema using document reference approach also has an advantage when you have one-to-many relationships with very unpredictable arity. If you have hundreds or thousands of race documents per given competition, the embedding option has so many setbacks in as far as spacial constraints are concerned because the larger the document, the more RAM it uses and MongoDB documents have a hard size limit of 16MB.
If your application frequently retrieves the race data with the competition information, then your application needs to issue multiple queries to resolve the references.
The general rule of thumb is that if your application's query pattern is well-known and data tends to be accessed only in one way, an embedded approach works well. If your application queries data in many ways or you unable to anticipate the data query patterns, a more normalized document referencing model will be appropriate for such case.
Ref:
MongoDB Applied Design Patterns: Practical Use Cases with the Leading NoSQL Database By Rick Copeland
You basically want to update the data, so you should upsert the data which is basically an update on the subdocument key.
Keep an array of keys in the main document.
Insert the sub-document and add the key to the list or update the list.
To push single item into the field ;
db.yourcollection.update( { $push: { "races": { "belongsTOId" : "1" , "gender" : "f" , "style" : "butterfly" , "length" : "50"} } } );
To push multiple items into the field it allows duplicate in the field;
db.yourcollection.update( { $push: { "races": { $each: [ { "belongsTOId" : "1" , "gender" : "f" , "style" : "butterfly" , "length" : "50"}, { "belongsTOId" : "2" , "gender" : "m" , "style" : "horse" , "length" : "70"} ] } } } );
To push multiple items without duplicated items;
db.yourcollection.update( { $addToSet: { "races": { $each: [ { "belongsTOId" : "1" , "gender" : "f" , "style" : "butterfly" , "length" : "50"}, { "belongsTOId" : "2" , "gender" : "m" , "style" : "horse" , "length" : "70"} ] } } } );
$pushAll deprecated since version 2.4, so we use $each in $push instead of $pushAll.
While using $push you will be able to sort and slice items. You might check the mongodb manual.

Aggregating filter for Key

If I have a document as follows:
{
"_id" : ObjectId("54986d5531a011bb5fb8e0ee"),
"owner" : "54948a5d85f7a9527a002917",
"type" : "group",
"deleted" : false,
"participants" : {
"54948a5d85f7a9527a002917" : {
"last_message_id" : null
},
"5491234568f7a9527a002917" : {
"last_message_id" : null
}
"1234567aaaa7a9527a002917" : {
"last_message_id" : null
}
},
}
How do I do a simple filter for all documents this have participant "54948a5d85f7a9527a002917"?
Thanks
Trying to query structures like this does not work well. There are a whole whole host of problems with modelling like this, but the most clear problem is using "data" as the names for "keys".
Try to think a little RDBMS like, at least in the concepts of the limitations to what a database cannot or should not do. You wouldn't design a "table" in a schema that had something like "54948a5d85f7a9527a002917" as the "column" name now would you? But this is essentially what you are doing here.
MongoDB can query this, but not in an efficient way:
db.collection.find({
"participants.54948a5d85f7a9527a002917": { "$exists": true }
})
Naturally this looks for the "presence" of a key in the data. While the query form is available, it does not make efficient use of such things as indexes where available as indexes apply to "data" and not the "key" names.
A better structure and approach is this:
{
"_id" : ObjectId("54986d5531a011bb5fb8e0ee"),
"owner" : "54948a5d85f7a9527a002917",
"type" : "group",
"deleted" : false,
"participants" : [
{ "_id": "54948a5d85f7a9527a002917" },
{ "_id": "5491234568f7a9527a002918" },
{ "_id": "1234567aaaa7a9527a002917" }
]
}
Now the "data" you are looking for is actual "data" associated with a "key" ( possibly ) and inside an array for binding to the parent object. This is much more efficient to query:
db.collection.find({
"participants._id": "54948a5d85f7a9527a002917"
})
It's much better to model that way than what you are presently doing and it makes sense to the consumption of objects.
BTW. It's probably just cut and paste in your question but you cannot possibly duplicate keys such as "54948a5d85f7a9527a002917" as you have. That is a basic hash rule that is being broken there.

Index strategy for queries with dynamic match criteria

I have a collection which is going to hold machine data as well as mobile data, the data is captured on channel and is maintained at single level no embedding of object , the structure is like as follows
{
"Id": ObjectId("544e4b0ae4b039d388a2ae3a"),
"DeviceTypeId":"DeviceType1",
"DeviceTypeParentId":"Parent1",
"DeviceId":"D1",
"ChannelName": "Login",
"Timestamp": ISODate("2013-07-23T19:44:09Z"),
"Country": "India",
"Region": "Maharashtra",
"City": "Nasik",
"Latitude": 13.22,
"Longitude": 56.32,
//and more 10 - 15 fields
}
Most of the queries are aggregation queries, as used for Analytics dashboard and real-time analysis , the $match pipeline is as follows
{$match:{"DeviceTypeId":{"$in":["DeviceType1"]},"Timestamp":{"$gte":ISODate("2013-07-23T00:00:00Z"),"$lt":ISODate("2013-08-23T00:00:00Z")}}}
or
{$match:{"DeviceTypeParentId":{"$in":["Parent1"]},"Timestamp":{"$gte":ISODate("2013-07-23T00:00:00Z"),"$lt":ISODate("2013-08-23T00:00:00Z")}}}
and many of my DAL layer find queries and findOne queries are mostly on criteria DeviceType or DeviceTypeParentId.
The collection is huge and its growing, I have used compound index to support this queries, indexes are as follows
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "DB.channel_data"
},
{
"v" : 1,
"key" : {
"DeviceType" : 1,
"Timestamp" : 1
},
"name" : "DeviceType_1_Timestamp_1",
"ns" : "DB.channel_data"
},
{
"v" : 1,
"key" : {
"DeviceTypeParentId" : 1,
"Timestamp" : 1
},
"name" : "DeviceTypeParentId_1_Timestamp_1",
"ns" : "DB.channel_data"
}
]
Now we are going to add support for match criteria on DeviceId and if I follow same strategy as I did for DeviceType and DeviceTypeParentId is not good,as I fell by my current approach I'm creating many indexes and all most all will be same and huge.
So is their any good way to do indexing . I have read a bit about Index Intersection but not sure how will it be helpful.
If any wrong approach is followed by me please point it out as this is my first project and first time I am using MongoDB.
Those indexes all look appropriate for your queries, including the new one you're proposing. Three separate indexes supporting your three kinds of queries are the overall best option in terms of fast queries. You could put indexes on each field and let the planner use index intersection, but it won't be as good as the compound indexes. The indexes are not the same since they support different queries.
I think the real question is, are the (apparently) large memory footprint of the indices actually a problem at this point? Do you have a lot of page faults because of paging indexes and data out of disk?

MongoDB: How can I order by distance considering multiple fields?

I have a collection that stores information about doctors. Each doctor can work in private practices and/or in hospitals.
The collection has the following relevant fields (there are geospatial indexes on both privatePractices.address.loc and hospitals.address.loc):
{
"name" : "myName",
"privatePractices" : [{
"_id": 1,
"address" : {
"loc" : {
"lng" : 2.1608502864837646,
"lat" : 41.3943977355957
}
}
},
...
],
"hospitals" : [{
"_id": 5,
"address" : {
"loc" : {
"lng" : 2.8192520141601562,
"lat" : 41.97784423828125
}
}
},
...
]
}
I am trying to query that collection to get a list of doctors ordered by distance from a given point. This is where I am stuck:
The following queries return a list of doctors ordered by distance to the point defined in $nearSphere, considering only one of the two location types:
{ "hospitals.address.loc" : { "$nearSphere" : [2.1933, 41.4008] } }
{ "privatePractices.address.loc" : { "$nearSphere" : [2.1933, 41.4008] } }
What I want is to get the doctors ordered by the nearest hospital OR private practice, whatever is the nearest. Is it possible to do it on a single Mongo query?
Plan B is to use the queries above and then manually order the results outside Mongo (eg. using Linq). To do this, my two queries should return the distance of each hospital or private practice to the $nearSphere point. Is it possible to do that in Mongo?
EDIT - APPLIED SOLUTION (MongoDB 2.6):
I took my own approach inspired by what Neil Lunn suggests in his answer: I added a field in the Doctor document for sorting purposes, containing an array with all the locations of the doctor.
I tried this approach in MongoDB 2.4 and MongoDB 2.6, and the results are different.
Queries on 2.4 returned duplicate doctors that had more than a location, even if the _id was included in the query filter. Queries on 2.6 returned valid results.
I would have been hoping for a little more information here, but the basics still apply. So the general problem you have stumbled on is trying to have "two" location fields on what appears to be your doctors documents.
There is another problem with the approach. You have the "locations" within arrays in your document/ This would not give you an error when creating the index, but it also is not going to work like you expect. The big problem here is that being within an array, you might find the document that "contains" the nearest location, but then the question is "which one", as nothing is done to affect the array content.
Core problem though is you cannot have more than one geo-spatial index per query. But to really get what you want, turn the problem on it's head, and essentially attach the doctors to the locations, which is the other way around.
For example here, a "practices" collection or such:
{
"type": "Hospital",
"address" : {
"loc" : {
"lng" : 2.8192520141601562,
"lat" : 41.97784423828125
}
},
"doctors": [
{ "_id": 1, "name": "doc1", "specialty": "bones" },
{ "_id": 2, "name": "doc2", "specialty": "heart" }
]
}
{
"type": "Private",
"address" : {
"loc" : {
"lng" : 2.1608502864837646,
"lat" : 41.3943977355957
}
},
"doctors": [
{ "_id": 1, "name": "doc1", "specialty": "bones" },
{ "_id": 3, "name": "doc3", "specialty": "brain" }
]
}
The advantage here is that you have here is that as a single collection and all in the same index you can simply get both "types" and correctly ordered by distance or within bounds or whatever your geo-queries need be. This avoids the problems with the other modelling form.
As for the "doctors" information, of course you actually keep a separate collection for the full doctor information themselves, and possibly even keep an array of the _id values for the location documents there as well. But the main point here is that you can generally live with "embedding" some of the useful search information in a collection here that will help you.
That seems to be the better option here, and matching a doctor to criteria from inside the location is something that can be done, where as finding or sorting the nearest entry inside array is something that is not going to be supported by MongoDB itself, and would result in you applying the Math yourself in processing the results.

MongoDB: Speed up aggregate by Indexing or find a different solution?

Ok, MongoDB experts, please take a look at my collection:
[{
"_id" : "item_0",
"Name" : "Item 0",
"Description" : "Some description for this item...",
"Properties" : {
"a" : 5.0,
"b" : 0.0,
"c" : 6.0,
"d" : 6.0,
"e" : 2.0,
"f" : 0.0,
"g" : 9.0,
"h" : 3.0,
"i" : 4.0,
"j" : 5.0
}
},
{ // 5.000-10.000 more items... }
]
I am using this aggregate to multiply a set of selected properties (in this case a, b, c and d), to then sort them by their product:
{
"aggregate": "item",
"pipeline": [
{
"$project": {
"_id": 1,
"Name": 1,
"s": {
"$multiply": [
"$Properties.a",
"$Properties.b",
"$Properties.c",
"$Properties.d"
]
}
}
},
{
"$sort": {
"s": -1
}
},
{
"$limit": 100
}
]
}
Now this works fine and all, but when the number of items and properties increase the time to execute the aggregate will be increased a lot!
Is there any better way (more efficient) to achieve something like this? The search for the highest product (multiple of a set of properties) must be snappy. If there is a way to index this, with all different combinations of properties and have them cached or something? It's OK that the indexing takes a while, as long as the querying is fast!
Thanks for any help in this matter, I appreciate it a lot!
Given your requirement for faster searching and efficiency, I think a better approach would be to use Map/Reduce with an output collection (at least until such time as the Aggregation Framework supports using a collection for output).
There are several advantages to using an output collection for your use case.
In particular:
you can have flexible indexing and sorting
the results do not have to be calculated in real-time for every query
you are not limited by the 16Mb BSON document size for inline results
You can use the merge output option for Map/Reduce to update calculations in your output collection (essentially, this would be your cache).
Depending on how often your various properties are updated, I would investigate an incremental approach based on a "last updated" timestamp or some other criteria that allows you to determine when values need to be recalculated. This would allow you to keep the batch sizes more manageable as your collection grows.