How is fine-grained search achiveable with mongodb, without the use of external engines? Take this object as example
{
genre: 'comedy',
pages: 380,
year: 2013,
bestseller: true,
author: 'John Doe'
}
That is being searched by the following:
db.books.find({
pages: { '>': 100 },
year: { '>': 2000 },
bestseller: true,
author: "John Doe"
});
Pretty straightforward so far. Now suppose that there are a bit more values in the document, and that I am making more refined searches and I have a pretty big collection.
First thing I would do is to create indexes. But, how does it work? I have read that the index intersection, as defined in here https://jira.mongodb.org/browse/SERVER-3071 is not doable. That means that if I set the index to "year" and "pages" I will not really optimize the AND operations in searches.
So how can the searches be optimized for having many parameters?
Thanks in advance.
It seems like you are asking about compound indexes in mongodb. Compound indexes allow you to create a single index on multiple fields in a document. By creating compound indexes you can make these large/complex queries while still using an index.
On a more general note, if you create a basic index on a field that is highly selective, your search can end up being very quick. Using your example, if you had an index on author, the query engine would use that index to find all the entries where author == "John Doe". Presumably there are not that many books with that specific author relative to the number of books in the entire collection. So, even if the rest of your query is fairly complex, it is only evaluated over those few documents with the matching author. Thus, by structuring your indexes properly you can get a significant performance gain without having to have any complex indexes.
Related
I am a beginner at MongoDB.
I am using version 3.2.
I read in several places that MongoDB can use only one index in a query, but the pieces of information I found seem a little bit outdated, and I couldn't find something in the official docs.
I have a collection of ~500M products with this form:
{_id: ObjectId('574d92332a2b10d7618b4575'), title: A, category_id: ObjectId('574d92332a2b10d7618b4575'), price: 30.23, rating:5 },
{_id: ObjectId('574d92332a2b10d7618b4575'), title: B, category_id: ObjectId('574d92332a2b10d7618b4575'), price: 20.23, rating:3 },
{_id: ObjectId('574d92332a2b10d7618b4575'), title: C, category_id: ObjectId('574d92332a2b10d7618b4575'), price: 10.23, rating:4 }
I need to find all products per category, and sort it by rating, then by price, but the final user may also be wanting to just sort it by price directly.
Every single query will need the category_id to be passed, it is compulsery.
I created 3 indexes: {category_id:1}, {rating:1} and {price:1}.
These queries are fast:
Most expensive products per category
db.products.find({category_id:ObjectId('574d92332a2b10d7618b4575')}).sort({price:-1})
Best products per category
db.products.find({category_id:ObjectId('574d92332a2b10d7618b4575')}).sort({rating:-1})
Worst products per category
db.products.find({category_id:ObjectId('574d92332a2b10d7618b4575')}).sort({rating:1})
But this query is incredibly slow
Best products per category, then cheapest
db.products.find({category_id:ObjectId('574d92332a2b10d7618b4575')}).sort({rating:-1, price:1})
If you were me, which indexes would you create, and why?
I'm starting to think that having price and rating alone is stupid, because every query will need the category_id, so maybe my indexes should include category_id, but what confuses me is the last paragraph of the official doc about compound indexes.
I already read this whole section on the official page of MongoDB but I can't find an answer to my specific problem.
You should create compound indexes to satisfy your queries, and they should in most cases include your query terms and your sort criteria.
The confusing paragraph that I believe you are referring to is regarding when there are multiple sort criteria, e.g. a compound sort. When you have a compound sort, both the order and the direction of the index entries does matter. If you're only sorting by a single value, the direction of the index (1 or -1, ascending or descending) does not matter.
See this SO question for more details and examples. Another good resource is this Optimizing Compound Indexes blog post.
You might want to consider if you really need to allow such a compound sort, for your example it seems more common from most e-commerce sites that you'd only sort by either rating or price but not both.
Use compound index,
only one index is considered by Mongo at the time of query execution unless there is an OR condition and use .explain("executionStats") to see that.
db.collection.find({your query}).explain("executionStats")
if you execute the above query, you can find the "queryPlanner" object in result which has the detail of winningPlan(index finally considered) and rejectedPlan(all indexes which are initially considered but not good enough as winning one)
I've read a lot of documentation and examples here in Stackoverflow but I'm not really sure about my conclusions so this is why I'm askingfor help.
Imagine we have a collection Films and a collection Users and we want to know, which users have seen a film, and which films has seen an user.
One way to design this in MongoDb is:
User:
{
"name":"User1",
"films":[filmId1, filmId2, filmId3, filmId4] //ObjectIds from Films
}
Film:
{
"name": "The incredible MongoDb Developer",
"watched_by": [userId1, userId2, userId3] //ObjectsIds from User
}
Ok, this may work if the amount of users/films is low, but for example if we expect that one film will have a 800k users the size of the array will be near to: 800k * 12 bytes ~ 9.5MB which is nearly to the 16MB max for a BSON file.
In this case, there are other approach than the typical relational-world way that is create an intermediate collection for the relations?
Also I don't know if read and parse a JSON about 10MB will have a better performance in comparison with the classic relational way.
Thank you
For films, if you include the viewers, you might eventually hit the 16MB size limit of BSON documents, as you correctly stated.
Putting the films a user has seen into an array is a viable way, depending on your use cases. Especially if you want to have relations with attributes (say date and place of viewing), doing updates and statistical analysis becomes less performant (you would need to $unwind your docs first, subsequent $matches become more costly and whatnot).
If your relations have or may have attributes, I'd go with what you describe as the classical relational way, since it answers your most likely use cases as good as embedding and allow for higher performance from my experience:
Given a collection with a structure like
{
_id: someObjectId,
date: ISODate("2016-05-05T03:42:00Z"),
movie: "nameOfMovie",
user: "username"
}
You have everything at hand to answer the following sample questions easily:
For a given user, which movies has he seen in the last 3 month, in descending order of date?
db.views.aggregate([
{$match:{user:userName, date:{$gte:threeMonthAgo}}},
{$sort:{date:-1}},
{$group:{_id:"$user",viewed:{$push:{movie:"$movie",date:"$date"}}}}
])
or, if you are ok with an iterator, even easier with:
db.views.find({user:username, date:{$get:threeMonthAgo}}).sort({date:-1})
For a given movie, how many users have seen it on May 30th this year?
db.views.aggregate([
{$match:{
movie:movieName,
date{
$gte:ISODate("2016-05-30T00:00:00"),
$lt:ISODate("2016-05-31T00:00:00")}
}},
{$group:{
_id: "$movie",
views: {$sum:1}
}}
])
The reason why I use an aggregation here instead of a .count() on the result is SERVER-3645
For a given movie, show all users which have seen it.
db.views.find({movie:movieName},{_id:0,user:1})
There is a thing to note: Since we used the usernames and movie names, respectively, we do not need a JOIN (or something similar), which should give us good performance. Plus we do not have to do rather costly update operations when adding entries. Instead of an update, we simply insert the data.
Background
I am storing table rows as MongoDb documents, with each column having a name. Let's say table has these columns of interest: Identifier, Person, Date, Count. The MongoDb document also has some extra fields separate from the table data, represented by timestamp. Columns are not fixed (which is why I use schema-free database to store them in the first place).
There will be need to do various complex, but so far unspecified queries. I am not very concerned about performance, though query performance may conceivably become a bottleneck. Once inserted, documents will not be modifed (a new document with same Identifier will be created instead), and insertions are not very frequent (let's say, 1000 new MongoDb documents per day). So amount of data will steadily grow over time.
Example
The straight-forward approach is having a collection of MongoDb documents like:
{
_id: XXXX,
insertDate: ISODate("2012-10-15T21:26:17Z"),
flag: true,
data: {
Identifier: "AB002",
Person: "John002",
Date: ISODate("2013-11-16T21:26:17Z"),
Count: 1
}
}
Now I have seen an alternative approach (for example in accepted answer of this question), using array with two fields per object:
{
_id: XXXX,
insertDate: ISODate("2012-10-15T21:26:17Z"),
flag: true,
data: [
{ field: "Identifier", value: "AB002" },
{ field: "Person", value: "John001" },
{ field: "Date", value: ISODate("2013-11-16T21:26:17Z") },
{ field: "Count", value: 1 }
]
}
Questions
Does the 2nd approach make any sense at all?
If yes, then how to choose which to use? Especially, are there some specific kinds of queries which are easy/cheap with one approach, hard/costly with another? Any "rules of thumb" on which way to go, or pro-con lists for both? Example real-life cases of one aproach being inconvenient would be especially valuable.
In your specific example the First version is a lot more appropriate and simple. You have to think in terms of how you would query your document.
It is a lot simpler to query your database like this: db.collection.find({"data.Identifier": "AB002"})
Although I'm not 100% sure why you even need the inner document. Why can't structure your document like:
{
_id: "AB002",
insertDate: ISODate("2012-10-15T21:26:17Z"),
flag: true,
Person: "John002",
Date: ISODate("2013-11-16T21:26:17Z"),
Count: 1
}
Pros of first example:
Simple to query
Enforces unique keys, but your data won't have two columns with the same name anyway
I would assume mongoDB would generate better query plans because the structure is a lot more simple (haven't tested)
Pros of second example:
Allows multiple entries with the same key/field, but I don't feel that is useful in your case
A single index on the array can be used for all of its entries regardless of their field name
I don't think that the situation in the other example here and yours are the same. In the other example, they're creating a list of items with one of two answers, which would be more appropriately in an array, and the goal is to return a list of subdocuments that match the criteria. In your example, you're really just describing an object since they all hold different types of information, and you won't need to retrieve searchable bits of the subdocuments.
I have a list of about 50 tags in an array, and want to search through my documents to find records that match these tags.
Because they're user-submitted and mongoDB is case-sensitive, I'm using /wildcard/i as a means of searching. I know this is not the fastest way to do a search but I can't think of a better solution.
I can do my query in two ways. The first is to run a for loop over my tags array, and for each result, perform:
db.collection.find({tags: /<tag[x]>/i})
Or, I can collect all of the tags and run one single lookup using $or, like so:
db.collection.find({$or:[{tags:/<tag1>/i},{tags:/<tag2>/i},{tags:/<tag3>/i}, ... {tags:/<tag50>/i}]});
I have tried both, and found using $or to be significantly faster - but because of the work-in-progress state of my application, it's very difficult to tell whether this is because it's actually faster or whether my app is causing significant overhead in other areas (it is).
So for clarification, in MongoDB is a big query performed once faster than small queries performed many times?
EDIT: Another example would be whether looking up 3 individual records based on _id is faster than doing one lookup using {$or:[{_id: ObjectId([id1])},{_id: ObjectId([id2])},{_id: ObjectId([id3])}]}. Is less more?
I recommend you adjust your schema so it keeps a normalized array of tags. When you insert a new document, do it like this:
tags : [ "business", "Computing", "PayPal" ],
lowercaseTags : [ "business", "computing", "paypal" ]
Similarly when you update the tags, update both arrays.
Create an index on lowercaseTags, and then when you want to query them, use a single query with the $in operator, and the normalized form of the search terms.
For example, to search for business iTunes YouTube, use this query:
db.collection.find( { tags : $in: [ "business", "itunes", "youtube" ] } )
This answer gives an example of this approach. It should be loads faster than what you have.
An alternate approach you can take is to create a text index and use the text command.
Both of these approaches are geared toward index optimization, and designing your schema to work well with Mongo. The payoff should be a lot higher than whatever difference there is between a single $or query and 50 simpler queries.
Suppose I have a collection in a mongo database with the following documents
{
"name" : "abc",
"email": "abc#xyz.com",
"phone" : "+91 1234567890"
}
The collection has a lot of objects (a million or so), and my application, apart from regularly adding objects to this collection, does a few different types of finds on this data.
One method does a find with all the three attributes (name, email and phone), so I can make a composite index for those three fields to make sure this find works effiently.
db.mycollection.ensureIndex({name:1,email:1,phone:1})
Now, I also have methods in my application which fetch all the objects with the same name (bad example, I know). So I need an index for the name field.
db.mycollection.ensureIndex({name:1})
Gradually, my application grows to a point where I have to index the other fields.
Now, my question. If I have each of the attributes indexed individually, does it still make sense to maintain composite indices for all three attributes (or 2 of the attributes)?
Obviously, this is a bad example... If I were making a collection to store multiple contact info for a person, I'd use arrays. But, this question is purely about the indexes.
It depends on your queries.
If you are doing a query such as:
db.mycollection.find({"name": "abc", email: "abc#xyz.com", phone: "+91 1234567890"});
then a composite index would be the most efficient.
Just to answer my own question for sake of completion:
Compound indexes don't mean that each of the individual attributes are indexed, only the first attribute in the compound index can be used alone in a find with efficiency. The idea is to strike a balance and optimize queries, as too many indexes increase disk storage and insertion time.