I need some help/advice on how to replicate some SQL behaviour in MongoDB.
Specifically, given this collection:
{
"_id" : ObjectId("577ebc0660084921141a7857"),
"tournament" : "Wimbledon",
"player1" : "Agassi",
"player2" : "Lendl",
"sets" : [{
"score1" : 6,
"score2" : 4,
"tiebreak" : false
}, {
"score1" : 7,
"score2" : 6,
"tiebreak" : true
}, {
"score1" : 7,
"score2" : 6,
"tiebreak" : true
}]
}
{
"_id" : ObjectId("577ebc3560084921141a7858"),
"tournament" : "Wimbledon",
"player1" : "Ivanisevic",
"player2" : "McEnroe",
"sets" : [{
"score1" : 4,
"score2" : 6,
"tiebreak" : false
}, {
"score1" : 3,
"score2" : 6,
"tiebreak" : false
}, {
"score1" : 6,
"score2" : 4,
"tiebreak" : false
}]
}
{
"_id" : ObjectId("577ebc7560084921141a7859"),
"tournament" : "Roland Garros",
"player1" : "Navratilova",
"player2" : "Graf",
"sets" : [{
"score1" : 5,
"score2" : 7,
"tiebreak" : false
}, {
"score1" : 6,
"score2" : 3,
"tiebreak" : false
}, {
"score1" : 7,
"score2" : 7,
"tiebreak" : true
}, {
"score1" : 7,
"score2" : 5,
"tiebreak" : false
}]
}
And these two distinct aggregations:
1) Aggregation ALFA: this aggregation is purposely strange, in the sense that it is designed to find all matches where at least 1 tiebreak is true but only show sets where tiebreak is false. Please don't consider the logic of it, it is crafted to allow full freedom to the user.
{
$match: {
"tournament": "Wimbledon",
"sets.tiebreak": true
}
},
{
$project: {
"tournament": 1,
"player1": 1,
"sets": {
$filter: {
input: "$sets",
as: "set",
cond: {
$eq: ["$$set.tiebreak", false]
}
}
}
}
}
2) Aggregation BETA: this aggregation is purposely strange, in the sense that it is designed to find all matches where at least 1 tiebreak is false but only show sets where tiebreak is true. Please don't consider the logic of it, it is crafted to allow full freedom to the user. Please note that player1 is hidden from the results.
{
$match: {
"tournament": "Roland Garros",
"sets.tiebreak": false
}
},
{
$project: {
"tournament": 1,
"sets": {
$filter: {
input: "$sets",
as: "set",
cond: {
$eq: ["$$set.tiebreak", true]
}
}
}
}
}
Now suppose that these two aggregations purpose is to delimit what part of the database a user can see, in the sense that those two queries delimit all the documents (and details) that are visible to the user. This is similar to 2 sql views that user has rights to access.
I need/want to try to rewrite the previous distinct aggregations in only one. Can this be achieved?
It is mandatory to keep all restriction that were set in Aggregation A & B, without loosing any control on data and without leaking and data that was not available in query A or B.
Specifically, matches in Wimbledon can only be seen if they had at least one set which ended with a tiebreak. Player1 field CAN be seen. Single sets must be hidden if they did not end with a tiebreak and hidden otherwise. If needed, it is acceptable, but not desirable, to not see player1 at all.
Conversely, matches in Roland Garros can be seen only if they had at least one set which ended without a tie break. Player1 field MUST be hidden. Single sets must be seen if they ended with a tiebreak and hidden otherwise.
Again, the purpose is to UNION the two aggregations while keeping the limits imposed by the two aggregations.
MongoDB is version 3.5, can be upgraded to unstable releases if needed.
here's my two cents for the issue:
if you wish to avoid empty sets when
a "Wimbledon" doc has all true tibreaks,
or "Roland Garros" has all false tiebreaks
you may reshape the query:
...
{
$and: [{
"sets.tiebreak": true,
}, {
"sets.tiebreak": false
}],
$or: [{
"tournament": "Wimbledon"
}, {
"tournament": "Roland Garros"
}]
}
...
and use it in:
aggregate pipeline http://pastebin.com/cM6mNsuC
mapReduce (if performance is no a big issue..) http://pastebin.com/MShihSQL
Related
I need a collection with structure like this:
{
"_id" : ObjectId("5ffc3e2df14de59d7347564d"),
"name" : "MyName",
"pays" : "de",
"actif" : 1,
"details" : {
"pt" : {
"title" : "MongoTime PT",
"availability_message" : "In stock",
"price" : 23,
"stock" : 1,
"delivery_location" : "Portugal",
"price_shipping" : 0,
"updated_date" : ISODate("2022-03-01T20:07:20.119Z"),
"priority" : false,
"missing" : 1,
},
"fr" : {
"title" : "MongoTime FR",
"availability_message" : "En stock",
"price" : 33,
"stock" : 1,
"delivery_location" : "France",
"price_shipping" : 0,
"updated_date" : ISODate("2022-03-01T20:07:20.119Z"),
"priority" : false,
"missing" : 1,
}
}
}
How can i create an index for each subdocument in 'details' ?
Or maybe it's better to do an array ?
Doing a query like this is currently very long (1 hour). How can I do ?
query = {"details.pt.missing": {"$in": [0, 1, 2, 3]}, "pays": 'de'}
db.find(query, {"_id": false, "name": true}, sort=[("details.pt.updated_date", 1)], limit=300)
An array type would be better, as there are advantages.
(1) You can include a new field which has values like pt, fr, xy, ab, etc. For example:
details: [
{ type: "pt", title : "MongoTime PT", missing: 1, other_fields: ... },
{ type: "fr", title : "MongoTime FR", missing: 1, other_fields: ... },
{ type: "xy", title : "MongoTime XY", missing: 2, other_fields: ... },
// ...
]
Note the introduction of the new field type (this can be any name representing the field data).
(2) You can also index on the array sub-document fields, which can improve query performance. Array field indexes are referred as Multikey Indexes.
The index can be on a field used in a query filter. For example, "details.missing". This key can also be part of a Compound Index. This can help a query filter like below:
{ pays: "de", "details.type": "pt", "details.missing": { $in: [ 0, 1, 2, 3 ] } }
NOTE: You can verify the usage of an index in a query by generating a Query Plan, applying the explain method on the find.
(3) Also, see Embedded Document Pattern as explained in the Model One-to-Many Relationships with Embedded Documents.
I'm pretty brand new to Mongo and queries still, so that said, I'm trying to build a query that will find me results that match these three types of dog breeds and in addition to that, check for additional two specs. And finally, sort all by age. All the data comes from a csv file (scrnshot), there aren't any sub categories to any of the entries.
db.animals.find({
"animal_id" : 1,
"breed" : "Labrador Retriever Mix",
"breed" : "Chesapeake Bay Retriever",
"breed" : "Newfoundland",
$and : [ { "age_upon_outcome_in_weeks" :{"$lt" : 156, "$gte" : 26} ],
$and: {"sex_upon_outcome" : "Intact Female"}}).sort({"age_upon_outcome_in_weeks" : 1})
This is throwing a number of errors, such as :
Error: error: {
"ok" : 0,
"errmsg" : "$and must be an array",
"code" : 2,
"codeName" : "BadValue"
}
What am I messing up? Or is there a better way to do it?
As mentionend by takis in the comments, you cannot repeat a key in a mongo query - you have to imagine that your query document becomes a json object, and each time a key is repeated is replaces the previous one. To go around this problem, mongodb supports $or and $and operators. For complex queries like this one, I would recommend starting with a global each containing a single constraint or a $or constraint. Your query becomes this:
db.coll.find({
"$and": [
{ "animal_id": 1 },
{ "age_upon_outcome_in_weeks": { "$lt": 156, "$gte": 26 } },
{ "sex_upon_outcome": "Intact Female" },
{ "$or": [
{ "breed": "Labrador Retriever Mix" },
{ "breed": "Chesapeake Bay Retriever" },
{ "breed": "Chesapeake Bay Retriever" },
{ "breed": "Newfoundland" }
]
}
]
})
.sort({"age_upon_outcome_in_weeks" : 1})
--- edit
You can also consider using the $in instead of the $or:
db.coll.find({
"animal_id": 1,
"age_upon_outcome_in_weeks": { "$lt": 156, "$gte": 26 },
"sex_upon_outcome": "Intact Female",
"breed": { "$in": [
"Labrador Retriever Mix",
"Chesapeake Bay Retriever",
"Chesapeake Bay Retriever",
"Newfoundland"
] }
})
.sort({"age_upon_outcome_in_weeks" : 1})
I have a records collection which has primary_id (unique), secondary_id, status fields among others. The ids are alphanumeric fields (ex. 'ABCD0000') and the status is a numeric (1 - 5).
One of the queries that would be frequently used is to filter by id (equality or range) and status.
examples:
records where primary_id between 'ABCD0000' - 'ABCN0000' and status is 2 or 3, sort by primary_id.
records where secondary_id between 'ABCD0000' - 'ABCD0000' and status is 2 or 3, sort by primary_id (or secondary_id if that would help).
The status in the filter will mostly be (status in (2,3)).
Initially we had an single index on each of the fields. But the query times out when the range is large. I have tried adding multiple indexes (single & compound) and with different ways to write the filter but couldn't get a decent performance. Now I have those indexes:
[
{primary_id: 1},
{secondary_id: 1},
{status: 1},
{primary_id: 1, status: 1},
{status: 1, primary_id: 1},
{status: 1, secondary_id: 1}
]
This query (with or without sort on primary_id)
{ $and: [
{ primary_id: { $gte: 'ABCD0000' } },
{ primary_id: { $lte: 'ABCN0000' } },
{status: { $in: [2,3] } }
] }
use the following plan:
...
"winningPlan" : {
"stage" : "FETCH",
"filter" : {
"status" : {
"$in" : [
2,
3
]
}
},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"primary_id" : 1
},
"indexName" : "primary_idx",
"isMultiKey" : false,
"multiKeyPaths" : {
"primary_id" : [ ]
},
"isUnique" : true,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"primary_id" : [
"[\"ABCD0000\", \"ABCN0000\"]"
]
}
}
},
So, It seems that the FETCH step takes long time if the number of returned rows is large. Surprisingly, while running initial tests the status, primary_id compound index was sometimes picked as the winning plan and that was super-fast (few seconds). But for some reason its not been picked by Mongo anymore. I guess when the query needs to sort by primary_id this compound index wont be picked, as i understood from the Mongo docs
If the query does not specify an equality condition on an index prefix that precedes or overlaps with the sort specification, the operation will not efficiently use the index.
I tried to change the query as below but that is still not optimized
{$or: [
{ $and: [ { primary_id: { $gte: 'ABCD0000' } }, { primary_id: { $lte: 'ABCN0000' } }, { status: 2 } ]},
{ $and: [ { primary_id: { $gte: 'ABCD0000' } }, { primary_id: { $lte: 'ABCN0000' } }, { status: 3 } ]}
]}
Any suggestions on what would be a better indexing or query strategy?
I would try with 2 indexes
primary_id, status and secondary_id, status.
If timeout is still occurring, can you increase the query time out value? - considering the large data-set that you are trying to read from.
If those indexes don't help and good response time is expected , then you should look at hardware constraints - is your hardware good enough (read mongodb's working set size). Either scale up the server/hardware or look at sharding if performance is really a concern and your data size is going to grow.
OR - store status 2 and 3 in separate collections to reduce the "working set size" while querying for those.
In my Meteor app, I have a collection of documents with an array of subdocuments that look like this:
/* 1 */
{
"_id" : "5xF9iDTj3reLDKNHh",
"name" : "Lorem ipsum",
"revisions" : [
{
"number" : 0,
"comment" : "Dolor sit amet",
"created" : ISODate("2016-02-11T01:22:45.588Z")
}
],
"number" : 1
}
/* 2 */
{
"_id" : "qTF8kEphNoB3eTNRA",
"name" : "Consecitur quinam",
"revisions" : [
{
"comment" : "Hoste ad poderiquem",
"number" : 1,
"created" : ISODate("2016-02-11T23:25:46.033Z")
},
{
"number" : 0,
"comment" : "Fagor questibilus",
"created" : ISODate("2016-02-11T01:22:45.588Z")
}
],
"number" : 2
}
What I want to do is query this collection and sort the result set by the maximum date in the created field of the revisions array. Something I haven't been able to pull off yet. Some constraints I have are:
Just sorting by revisions.created doesn't cut it, because the date used from the collection depends on the sort direction. I have to use the maximum date in the set regardless of sort order.
I cannot rely on post-query manipulation of an unsorted result set, so, this must be done by a proper query or aggregation by the database.
There's no guarantee that the revisions array will be pre-sorted.
There may be extra fields in some documents and those have to come along, so careful with $project.
Meteor is still using MongoDB 2.6, newer API features are no good :(
The basic problem with what you are asking here comes down to the fact that the data in question is within an "array", and therefore there are some basic assumptions made by MongoDB as to how this gets handled.
If you applied a sort in "descending order", then MongoDB will do exactly what you ask and sort the documents by the "largest" value of the specified field within the array:
.sort({ "revisions.created": -1 ))
But if instead you sort in "ascending" order then of course the reverse is true and the "smallest" value is considered.
.sort({ "revisions.created": 1 })
So the only way of doing this means working out which is the maximum date from the data in the array, and then sorting on that result. This basically means applying .aggregate(), which for meteor is a server side operation, being unfortunately something like this:
Collection.aggregate([
{ "$unwind": "$revisions" },
{ "$group": {
"_id": "$_id",
"name": { "$first": "$name" },
"revisions": { "$push": "$revisions" },
"number": { "$first": "$number" }
"maxDate": { "$max": "$revisions.created" }
}},
{ "$sort": { "maxDate": 1 }
])
Or at best with MongoDB 3.2, where $max can be applied directly to an array expression:
Collection.aggregate([
{ "$project": {
"name": 1,
"revisions": 1,
"number": 1,
"maxDate": {
"$max": {
"$map": {
"input": "$revisions",
"as": "el",
"in": "$$el.created"
}
}
}
}},
{ "$sort": { "maxDate": 1 } }
])
But really both are not that great, even if the MongoDB 3.2 approach has way less overhead than what is available to prior versions, it's still not as good as you can get in terms of performance due to the need to pass through the data and work out the value to sort on.
So for best performance, "always" keep such data you are going to need "outside" of the array. For this there is the $max "update" operator, which will only replace a value within the document "if" the provided value is "greater than" the existing value already there. i.e:
Collection.update(
{ "_id": "qTF8kEphNoB3eTNRA" },
{
"$push": {
"revisions": { "created": new Date("2016-02-01") }
},
"$max": { "maxDate": new Date("2016-02-01") }
}
)
This means that the value you want will "always" be already present within the document with the expected value, so it is just now a simple matter of sorting on that field:
.sort({ "maxDate": 1 })
So for my money, I would go though the existing data with either of the .aggregate() statements available, and use those results to update each doccument to contain a "maxDate" field. Then change the coding of all additions and revisions of array data to apply that $max "update" on every change.
Having a solid field rather than a calculation always makes much more sense if you are using it often enough. And the maintenance is quite simple.
In any case, considering the above applied example date, which is "less than" the other maximum dates present would return for me in all forms:
{
"_id" : "5xF9iDTj3reLDKNHh",
"name" : "Lorem ipsum",
"revisions" : [
{
"number" : 0,
"comment" : "Dolor sit amet",
"created" : ISODate("2016-02-11T01:22:45.588Z")
}
],
"number" : 1,
"maxDate" : ISODate("2016-02-11T01:22:45.588Z")
}
{
"_id" : "qTF8kEphNoB3eTNRA",
"name" : "Consecitur quinam",
"revisions" : [
{
"comment" : "Hoste ad poderiquem",
"number" : 1,
"created" : ISODate("2016-02-11T23:25:46.033Z")
},
{
"number" : 0,
"comment" : "Fagor questibilus",
"created" : ISODate("2016-02-11T01:22:45.588Z")
},
{
"created" : ISODate("2016-02-01T00:00:00Z")
}
],
"number" : 2,
"maxDate" : ISODate("2016-02-11T23:25:46.033Z")
}
Which correctly places the first document at the top of the sort order with consideration to the "maxDate".
I am using the mongodb agreegate framework, and this is how my normal object looks like
{
"_id" : "6b109972c9bd9d16a09b70b96686f691bfe2f9b6",
"history" : [
{
"dtEntry" : 1428929906,
"type" : "I",
"refname" : "ref1"
},
{
"dtEntry" : 1429082064,
"type" : "U",
"refname" : "ref1"
}
],
"c" : "SomeVal",
"p" : "anotherVal"
}
here the history.dtEntry is an epoch value (please don't advise me to change this to isodate before entering here, Its out of my scope).
I want to project the c,p history.type,and history.dtEntry as (day of month).
db.mydataset.aggregate({$project:{c:1,p:1,type:"$history.type",DtEntry:"$history.dtEntry",dater:{$dayOfMonth:new Date(DtEntry)}}})
if I use any epoch value directly the day of month comes out just fine, but I have to pass the value of dtEntry and neither of the ways seem to work for me
I have tried
$dayOfMonth:new Date(DtEntry)
$dayOfMonth:new Date(history.DtEntry)
$dayOfMonth:new Date("history.DtEntry")
$dayOfMonth:new Date("$history.DtEntry")
I found some other ways may be this will help you, check below aggregation query
db.collectionName.aggregate({
"$unwind": "$history"
}, {
"$project": {
"c": 1,
"p": 1,
"type": "$history.type",
"dater": {
"$dayOfMonth": {
"$add": [new Date(0), {
"$multiply": ["$history.dtEntry", 1000]
}]
}
}
}
})