MongoDB: Optimization of Performance: Aggregation Pipeline (one collection) VS Aggregation plus Additional Query on Seperate Collection - mongodb

I would like to know what is faster in terms of querying for mongodb.
Lets say I would like to search for income information based on areas
And a person can have many residencies in different states. And each polygon area would have an associated income for that individual.
I have outlined two options for querying this information, I would like to know which would be faster to search.
1) To have a single collection which has two types of documents.
Document1: has a geospatial index on it with polygons, and will have
2dsphere index on it. It will be searched with aggregation to return ids that will link to document 2. Essentially taking the place of a relation in mysql.
Document2: has other information (lets say income amount) and different indexes, but has an id
which the first document also has to reference it.
And also has an index on income amount.
The two documents are searched with an aggregation pipeline.
Stage 1 of pipeline: searching document1 geospatially for items and getting the id value .
Stage 2 of pipeline: using id found in document1 to search second document. As well searched by income type.
2) Seperating out the documents where each has its own collection and avoiding aggregation.
querying collection1 for geospatial and using the person id's found to query collection2 for income info.
3) A third option involving polyglot database, a combination of mongodb and postigs: Query postgis for the id and then use that to search mongodb collecton. I am including this option since I believe postgis to be faster for querying geospatially than mogo but I am curious if the speed of postgis will not matter due to latency of now querying two databases.
The end goal is to pull back data based on a geospatial radius. One geospatial polygon representing area where the person lives and does business for that income.
maps to 1 relational id and each relational id maps to many sets of data. Essentially I have a many to 1 to many relationship.
Many geospatials map to 1 person which maps to many data sets.

You should generally keep collections limited to a single type of document.
Solution 1 will not work. You cannot use the aggregation pipeline the way you are describing (if I'm understanding you correctly). Also, it sounds as though you are thinking in a relational way about a solution using a non-relational database.
Solution 2 will work but it will not have optimum performance. This solution sounds even more like a relational database solution where the collections are being treated like tables.
Solution 3 will probably work but as you said it will now require two databases.
All three of these solutions are progressively pulling these two types of documents further and further away from one another. I believe the best solution for a document database like MongoDB is to embed this data. It's impossible without a real example of your documents and without a clear understanding of your application to suggest an exact solution. But in general embedding data is preferred over creating relationships between documents in MongoDB. As long as no document will ever get to be over 16MB it's worth looking into whether embedding is the right solution.

Related

Are there technical downsides to using a single collection over multiple collections in MongoDB?

Since MongoDB is schemaless, I could just drop all my documents into a single collection, with a key collection and an index on that key.
For example this:
db.getCollection('dogs').find()
db.getCollection('cars').find()
Would become this:
db.getCollection('all').find({'collection': 'dogs'})
db.getCollection('all').find({'collection': 'cars'})
Is there any technical downside to doing this?
There are multiple reasons to have different collections, maybe the two most importants are:
Performance: even if MongoDB has been designed to be flexible, it doesn't prevent the need to have indexes on fields that will be used during the search. You would have dramatic response times if the collection is too heterogeneous.
Maintenability/evolutivity: design should be driven by the usecases (usually you'll store the data as it's received by the application) and the design should be explicit to anyone looking at the database collections
MongoDB University is a great e-learning platform, it is free and there is in particular this course:
M320: Data Modeling
schema questions are often better understood by working backwards from the queries you'll rely on and how the data will get written.... if you were going to query Field1 AND Field2 together in 1 query statement you do want them in the same collection....dogs and cars don't sound very related while dogs and cats do...so really look at how you're going to want to query.....joining collections is not really ideal - doable via $lookup but not ideal....

Reading the similar data from more than two collections in mongoDB

I am novice user to MongoDB. In our application the data size for each table quite bit large, So I decided to split the same into different collections even though it is same of kind. The only difference is the "id" between each document(documents in one collection is under one category) in the all the collections. So we decided to insert the data into number collections and each collections will be having certain number of documents. currently I have 10 collections of same of kind of document data.
My requirement is
1) to get the data from all the collections in a single query to display in application home page.
2) I do need to get the data by using sorting and filtering before fetching.
I have gone through some of the posts in the stackoverflow saying that use Mongo-3.2 $lookup aggregation for this requirement. but I am suspecting If I use $lookup for 10 collections, there might be performance Issue and too complex query.
since I have divided the my same kind of data into number of collections(Each collection will have the documents which comes under one category, Like that I have the 10 categories, so I need to use 10 collections).
Could any body please suggest me whether my approach is correct?
If you have a lot data, how could you display all of them in a webpage?
My understanding is that you will only display a portion of the dataset by querying the database. Since you didn't mention how many records you have, it's not easy to make a recommendation.
Based on the vague description, sharding is the solution, you should check out the official doc.
However, before you do sharding, and since you mentioned are a novice user, you probably want to check your databases' indexing, data models, and benchmark your performance first.
Hope this helps.
You should store all 10 types of data in 1 collection, not 10. Don't make things more difficult than they need to be.

All vs All comparisons on MongoDB

We are planning to use MongoDB for a general purpose system and it seems well suited to the particular data and use cases we have.
However we have one use case where we will need to compare every document (of which there could be 10s of millions) with every other document. The 'distance measure' could be pre computed offline by another system but we are concerned about the online performance of MongoDB when we want to query - eg when we want to see the top 10 closest documents in the entire collection to a list of specific documents ...
Is this likely to be slow? Also can this be done across documents (eg query for the top10 closest documents in one collection to a document in another collection)...
Thanks in advance,
FK

Mongodb : multiple specific collections or one "store-it-all" collection for performance / indexing

I'm logging different actions users make on our website. Each action can be of different type : a comment, a search query, a page view, a vote etc... Each of these types has its own schema and common infos. For instance :
comment : {"_id":(mongoId), "type":"comment", "date":4/7/2012,
"user":"Franck", "text":"This is a sample comment"}
search : {"_id":(mongoId), "type":"search", "date":4/6/2012,
"user":"Franck", "query":"mongodb"} etc...
Basically, in OOP or RDBMS, I would design an Action class / table and a set of inherited classes / tables (Comment, Search, Vote).
As MongoDb is schema less, I'm inclined to set up a unique collection ("Actions") where I would store these objects instead of multiple collections (collection Actions + collection Comments with a link key to its parent Action etc...).
My question is : what about performance / response time if I try to search by specific columns ?
As I understand indexing best practices, if I want "every users searching for mongodb", I would index columns "type" + "query". But it will not concern the whole set of data, only those of type "search".
Will MongoDb engine scan the whole table or merely focus on data having this specific schema ?
If you create sparse indexes mongo will ignore any rows that don't have the key. Though there is the specific limitation of sparse indexes that they can only index one field.
However, if you are only going to query using common fields there's absolutely no reason not to use a single collection.
I.e. if an index on user+type (or date+user+type) will satisfy all your querying needs - there's no reason to create multiple collections
Tip: use date objects for dates, use object ids not names where appropriate.
Here is some useful information from MongoDB's Best Practices
Store all data for a record in a single document.
MongoDB provides atomic operations at the document level. When data
for a record is stored in a single document the entire record can be
retrieved in a single seek operation, which is very efficient. In some
cases it may not be practical to store all data in a single document,
or it may negatively impact other operations. Make the trade-offs that
are best for your application.
Avoid Large Documents.
The maximum size for documents in MongoDB is 16MB. In practice most
documents are a few kilobytes or less. Consider documents more like
rows in a table than the tables themselves. Rather than maintaining
lists of records in a single document, instead make each record a
document. For large media documents, such as video, consider using
GridFS, a convention implemented by all the drivers that stores the
binary data across many smaller documents.

should I create one or many collections in mongodb in order to insert and search faster?

I am fairly new to mongodb. I am creating a web app which allows insert and search for multiple products such as laptop, hard driver, webcamm... My question is should I place all of them in a same collection such as "computer" or should I place each product in their own collection like "laptop", "hard driver", "webcam" so that when user search and insert for a product it will be faster ?
Thanks a lot
Generally speaking you should use one collection per "type" of thing you're storing. It sounds like all the examples you've given above would fall under a product "type" and should be in the same collection. Documents in the same collection need not all have the same fields, though for products you will probably have several fields in common across all documents: name, price, manufacturer, etc; each document "sub-type" might have several fields in common, like hard drives might all have RPM, storage capacity, form factor, interface (SATA2/3, IDE, etc).
The rationale for this advice is that MongoDB queries are performed on a single collection at a time. If you want to show search results that cover the different categories of products you have, then this is simple with one collection, but more difficult with several (and less performant).
As far as query performance is concerned, be sure to create indexes on the fields that you are searching on. If you allow search by product name or manufacturer, you would have an index on name, and another index on manufacturer. Your exact indexes will vary depending on the fields you have in your documents.
Insert speed, on the other hand, will be faster the fewer indexes you have (since each index potentially needs to be updated each time you save or update a document), so it is important not to create more indexes than you'll actually need.
For more on these topics, see the MongoDB docs on schema design and indexes, as well as the presentations on 10gen.com from 10gen speakers and MongoDB users.
I suggest to start with one collection. It is much simpler to search through the one collection rather then collection per product. And in the future if you queries against collection become slow you can start thinking about how to speed up them.
Mongodb has fairly fast indexes and it was design to be scalable, so once you will need scale your database -- replica sets and auto sharing in place.