Confusion with the order of logical operators in Mongo - mongodb

I am new to Mongo DB and while doing some practising, I came across a weird problem. The schema being:
{
"_id" : ObjectId("5c8eccc1caa187d17ca6ed29"),
"city" : "CLEVELAND",
"zip" : "35049",
"loc" : {
"y" : 33.992106,
"x" : 86.559355
},
"pop" : 2369,
"state" : "AL"
} ...
I want to find the number of cities, that have a population of more than 5000 but less than 1000000.
Both these queries, this:
db.zips.find({"$nor":[{"pop":{"$lt":5000}},{"pop":{"$gt":"1000000"}}]}).count()
and this:
db.zips.find({"$nor":[{"pop":{"$gt":1000000}},{"pop":{"$lt":"5000"}}]}).count()
give different results.
The first one gives 11193 and the second one gives 29470. Since I am from MySql background, both the queries are making no difference to me. According to me, both are the same and should return the number of zip codes with a population of less than 1000000 and more than 5000. Please help me understand.
Thanks in advance.

$gte and $lte should be used to compare same data type.
your first query quoted "100000" and your second query quoted "5000", the two queries ended up as not the same, since you are comparing Numeric data type in one, and string in another.

Related

Mongo text search does not seem to use the initial condition to reduce the number of documents to search

I have a collection with text search enabled and and index of the FName column. If I perform a search like this:
db.items.find({ "FName" : /^customer\\rainfall\\geometry\ types\\customer\ 1\ manhole\ monitors\\region\ 10\\area\ 100\\catchment\ 1000\\/ }).count()
Then this is very fast and returns a value of 300 as there are 300 matching documents.
If, however, I run this query:
db.items.find([{ "FName" : /^customer\\rainfall\\geometry\ types\\customer\ 1\ manhole\ monitors\\region\ 10\\area\ 100\\catchment\ 1000\\/ }, "$text" : { "$search" : "\"\\average rainfall\"" }]).count(), it can take up to 20 seconds to complete. If I run the initial query, get all the documents and carry out the search client side it is much faster - a second or two.
How can this be faster given that the second query is actually being run on the database? It almost looks like the FName filter is not being used first but I cannot prove this.
Any ideas?
Thanks
Ian

DocumentDb Compound Query really slow for Date

I am using DocumentDb database(total size ~13TB)
This is the schema of the database, all the keys are single indexed
[
{
{key : {"customerId" : 1}},
{key : {"typeOfProduct" : 1}},
{key : {"date": 1}},
}
]
1st query: db.collectionName.find({"customerId" : <SAMPLE-CUSTOMER>}).count()
returns 500 documents
2nd query: db.collectionName.find({"customerId" : <SAMPLE-CUSTOMER>, "typeOfProduct" : <TYPE>}).count()
returns 200 documents
3nd query db.collectionName.find({"customerId" : <SAMPLE-CUSTOMER>, "date" : { "$gte" : NumberLong(1584055385383), "$lte" : NumberLong(1584141785383)}}).count()
query just keeps on running infinitely
In the 2nd query, My understanding is that DocumentDb first gets all the documents for matching customerId which is 500, and then iteratively searches for matching typeOfProduct through each document.
In the 3rd query, it is supposed to be working in a similar manner, but the query keeps on running infinitely.
Can someone explain why this is happening? Why is it so slow with the Date? Is it because of the size of the database or am I writing the query wrong?
I'm not sure if you have one three-field index (1) or three single-field indexes (2).
Both first and second query uses that index because they're prefix-matching it. In general, the third query could use the index easily, skipping the second field. You can check it using explain().
It's not obvious how Mongo will pick the index, but knowing your data structure, you can suggest one, using hint(index) ({customerId: 1} in your case). To check which one is being used for this query by default, use explain().

Continuing a Query (paginating) on a compound index

I have a (hopefully quick) question about MongoDB queries on compound indexes.
Say I have a data set (for example, comments) which I want to sort descending by score, and then date:
{ "score" : 10, "date" : ISODate("2014-02-24T00:00:00.000Z"), ...}
{ "score" : 10, "date" : ISODate("2014-02-18T00:00:00.000Z"), ...}
{ "score" : 10, "date" : ISODate("2014-02-12T00:00:00.000Z"), ...}
{ "score" : 9, "date" : ISODate("2014-02-22T00:00:00.000Z"), ...}
{ "score" : 9, "date" : ISODate("2014-02-16T00:00:00.000Z"), ...}
...
My understanding thus far is that I can make a compound index to support this query, which looks like {"score":-1,"date":-1}. (For clarity's sake, I am not using a date in the index, but an ObjectID for unique, roughly time-based order)
Now, say I want to support paging through the comments. The first page is easy enough, I can just stick a .limit(n) option on the end of the cursor. What I'm struggling with is continuing the search.
I have been referring to MongoDB: The Definitive Guide by Kristina Chodorow. In this book, Kristina mentions that using skip() on large datasets is not very performant, and recommends using range queries on parameters from the last seen result (eg. the last seen date).
What I would like to do is perform a range query that acts on two fields, but treats the second field as secondary to the first (just like the index is sorted.) Since my compound index is already sorted in exactly the order I want, it seems like there should be some way to jump into the search by pointing at a specific element in the index and traversing it in the sort order. However, from my (admittedly rudimentary) understanding of queries in MongoDB this doesn't seem possible.
As far as I can see, I have three options:
Using skip() anyway
Using either an $or query or two distinct queries: {$or : [{"score" : lastScore, "date" : { $lt : lastDate}}, {'score' : {$lt : lastScore}]}
Using the $max special query option
Number 3 seems like the closest to ideal for me, but the reference text notes that 'you should generally use "$lt" instead of "$max"'.
To summarize, I have a few questions:
Is there some way to perform the operation I described, that I may have missed? (Jumping into an index and traversing it in the sort order)
If not, of the three options I described (or any I have overlooked), which would (very generally speaking) give the most consistent performance under the compound index?
Why is $lt preferred over $max in most cases?
Thanks in advance for your help!
Another option is to store score and date in a sub-document and then index the sub-document. For example:
{
"a" : { "score" : 9,
"date" : ISODate("2014-02-22T00:00:00Z") },
...
}
db.foo.ensureIndex( { a : 1 } )
db.foo.find( { a : { $lt : { score : lastScore,
date: lastDate } } } ).sort( { a : -1 } )
With this approach you need to ensure that the fields in the BSON sub-document are always stored in the same order, otherwise the query won't match what you expect since index key comparison is binary comparison of the entire BSON sub-document.
I would go with using $max to specify the upper bound, in conjunction with $hint to make sure that the database uses the index you want. The reason that $lt is in general preferred over $max is because $max selects the index using the specified index bounds. This means:
the index chosen may not necessarily be the best choice.
if multiple indexes exist on same fields with different sort orders, the selection of the index may be ambiguous.
The above points are covered in further detail here.
One last point: max is equivalent to $lte, not $lt, so using this approach for pagination you'll need to skip over the first returned document to avoid outputting the same document twice.

MongoDB incremental mapReduce, select only new documents, added after last mapReduce

Let's say I have a collection with documents that looks like this (just simplified example, but it should show the scheme):
> db.data.find()
{ "_id" : ObjectId("4e9c1f27aa3dd60ee98282cf"), "type" : "A", "value" : 11 }
{ "_id" : ObjectId("4e9c1f33aa3dd60ee98282d0"), "type" : "A", "value" : 58 }
{ "_id" : ObjectId("4e9c1f40aa3dd60ee98282d1"), "type" : "B", "value" : 37 }
{ "_id" : ObjectId("4e9c1f50aa3dd60ee98282d2"), "type" : "B", "value" : 1 }
{ "_id" : ObjectId("4e9c1f56aa3dd60ee98282d3"), "type" : "A", "value" : 85 }
{ "_id" : ObjectId("4e9c1f5daa3dd60ee98282d4"), "type" : "B", "value" : 12 }
Now I need to collect some statistics on that collection. For example:
db.data.mapReduce(function(){
emit(this.type,this.value);
},function(key,values){
var total = 0;
for(i in values) {total+=values[i]};
return total;
},
{out:'stat'})
will collect totals in 'stat' collection.
> db.stat.find()
{ "_id" : "A", "value" : 154 }
{ "_id" : "B", "value" : 50 }
At this point everything is perfect, but I've stuck on the next move:
'data' collection is constantly updated with new data (old documents stays unchanged, only inserts, no updates)
I would like to periodically update 'stat' collection, but do not want to query the whole 'data' collection every time, so I choose to run incremental mapReduce
It may seems good to just update 'stat' collection on every insert in 'data' collection and do no use mapReduce, but the real case is more complex than this example and I would like to get statistics only on demand.
To do this I should be able to query only documents, that was added after my last mapReduce
As far as I understand I cannot rely on ObjectId property, just store the last one and later select every document with ObjectId > stored because ObjectId is not equal autoincrement ids in SQL databases (for example different shards will produce different ObjectIds).
I can change ObjectId generator, but not sure how to do it better in sharded environment
So the question is:
Is it any way to select only documents, added after the last mapReduce to run incremental mapReduce or may be there is another strategy to update statistic data on constantly growing collection?
You can cache the time and use it as a barrier for your next incremental map-reduce.
We're testing this at work and it seems to be working. Correct me if I'm wrong, but you can't safely do map-reduce while an insert is happening across shards. The versions become inconsistent and your map-reduce operation will fail. (If you find a solution to this, please do let me know! :)
We use bulk-inserts instead, once every 5 minutes. Once all the bulk inserts are done, we run the map-reduce like this (in Python):
m = Code(<map function>)
r = Code(<reduce function>)
# pseudo code
end = last_time + 5 minutes
# Use time and optionally any other keys you need here
q = bson.SON([("date" : {"$gte" : last_time, "$lt" : end})])
collection.map_reduce(m, r, out=out={"reduce": <output_collection>}, query=q)
Note that we used reduce and not merge, because we don't want to override what we had before; we want to combine the old results and the new result with the same reduce function.
You can get just the time portion of the ID using _id.getTime() (from: http://api.mongodb.org/java/2.6/org/bson/types/ObjectId.html). That should be sortable across all shards.
EDIT: Sorry, that was the java docs... The JS version appears to be _id.generation_time.in_time_zone(Time.zone), from http://mongotips.com/b/a-few-objectid-tricks/
I wrote up a complete pymongo-based solution that uses incremental map-reduce and caches the time, and expects to run in a cron job. It locks itself so two can't run concurrently:
https://gist.github.com/2233072
""" This method performs an incremental map-reduce on any new data in 'source_table_name'
into 'target_table_name'. It can be run in a cron job, for instance, and on each execution will
process only the new, unprocessed records.
The set of data to be processed incrementally is determined non-invasively (meaning the source table is not
written to) by using the queued_date field 'source_queued_date_field_name'. When a record is ready to be processed,
simply set its queued_date (which should be indexed for efficiency). When incremental_map_reduce() is run, any documents
with queued_dates between the counter in 'counter_key' and 'max_datetime' will be map/reduced.
If reset is True, it will drop 'target_table_name' before starting.
If max_datetime is given, it will only process records up to that date.
If limit_items is given, it will only process (roughly) that many items. If multiple
items share the same date stamp (as specified in 'source_queued_date_field_name') then
it has to fetch all of those or it'll lose track, so it includes them all.
If unspecified/None, counter_key defaults to counter_table_name:LastMaxDatetime.
"""
We solve this issue using 'normalized' ObjectIds. Steps that we are doing:
normalize id - take timestap from the current/stored/last processed id and set other
parts of id to its min values. C# code: new ObjectId(objectId.Timestamp,
0, short.MinValue, 0)
run map-reduce with all items that have id
greater than our normalized id, skip already processed items.
store last processed id, and mark all processed items.
Note: Some boundary items will be processed several times. To fix it we set some sort of a flag in the processed items.

How to improve query performance with operators like $nin, $in for Mongodb

I have a reasonably large dataset of over 3 million documents that have tags similar to StackOverflow that uses tags for each question. The schema that I use for storing the tags is as follows:
{"id": 12345, "tags":["tag1", "tag2", "tag3"]}, {"id": 12346, "tags":["tag2", "tag3"]}
I have a multi-key index created on tags field. When I am performing queries using $in or $nin operators to find the intersection, union of the tags, the performance is around 7 seconds on a server class machine. Is there anything that I can do to improve the speed of query search?
EDIT 1:
Here is the explain plan as requested. What I observed is that the queries returned much faster after I restarted my server and just ran just the mongodb server. The queries performed much faster(< 50ms). I suspect the indexes were not cached in memory, although I had ample unused ram available and my index (800MB) could easily fit in memory.
db.tagsCollection.find( { "tags" : { $in : ['tag1', 'tag2'], $nin : ['tag4', '
tag5', 'tag6', 'tag7'] } } ).explain();
{
"cursor" : "BtreeCursor tags_1 multi",
"nscanned" : 6145193,
"nscannedObjects" : 6145192,
"n" : 969386,
"millis" : 19640,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : true,
"indexOnly" : false,
"indexBounds" : {
"tags" : [
[
"tag1",
"tag1"
],
[
"tag2",
"tag2"
]
]
}
}
Note
This is what I thought of as an optimization ( though you might need to test it )
Instead of storing tags,store a small key which identifies all the tags particular document has.
say for post#125 the tags are : PHP, MongoDb , database .
a) clean the tags like convert all of them to small case etc
and then sort them alphabetically .
current tags will be : database,mongodb,php
b) Have a seperate collection which stores integer to tag mapping :
{ "_id" : 1 , "t" : "mongodb" }
{ "_id" : 2 , "t" : "php" } and so on store all the possible tags for your website
c) to store a document, create the tag key using tags to number map from previous collection.
so curent database,mongodb,php will become something like 1-12-2
d) store your document like :
{ "id" : 12345 , "tags" : [1,12,3] }
QUERYING :
The use of integers instead of strings on an indexed field would reduce the index size by great extent, and also make querying faster as compared to a string index.
Not sure about amount of performance gain, but still worth a try to compare to your current implementation.
Check the size of your multi-key tags index using db.col.stats(). If it doesn't fit in RAM then you might be disk-bound and incurring some disk IO cost. If the index fits entirely in memory then I'm not sure what else you can do, apart from throw more hardware at it, unless you can optimise the queries themselves.
Do you need to search through all the data, or can you query a subset that's filtered by another indexed field? Or can you eliminate the $nin queries, which will tend to be slower because the have to iterate every tag, where as $in only has to iterate until it finds a match.
If you want performance to be super fast and dont have space contraints, I would suggest to have separate collection of tags with video id array and have an index on tag name.
Here is another suggestion but I've had not a chance to test it.
{
tags:{
items:[ 'a', 'b', 'c' ],
mixed:{
a:1, // hash value for a tag
b:2, // hash value for b tag
c:3 // hash value for c tag
}
}
}
and search query is
db.demo.find({ 'tags.mixed.a':1, 'tags.mixed.b':2 })
if possible have to create compound index for tags.mixed