Which mongo document schema/structure is correct? - mongodb

I have two document formats which I can't decide is the mongo way of doing things. Are the two examples equivalent? The idea is to search by userId and have userId be indexed. It seems to me the performance will be equal for either schemas.
multiple bookmarks as separate documents in a collection:
{
userId: 123,
bookmarkName: "google",
bookmarkUrl: "www.google.com"
},
{
userId: 123,
bookmarkName: "yahoo",
bookmarkUrl: "www.yahoo.com"
},
{
userId: 456,
bookmarkName: "google",
bookmarkUrl: "www.google.com"
}
multiple bookmarks within one document per user.
{
userId: 123,
bookmarks:[
{
bookmarkName: "google",
bookmarkUrl: "www.google.com"
},
{
bookmarkName: "yahoo",
bookmarkUrl: "www.yahoo.com"
}
]
},
{
userId: 456,
bookmarks:[
{
bookmarkName: "google",
bookmarkUrl: "www.google.com"
}
]
}

The problem with the second option is that it causes growing documents. Growing documents are bad for write performance, because the database will have to constantly move them around the database files.
To improve write performance, MongoDB always writes each document as a consecutive sequence to the database files with little padding between each document. When a document is changed and the change results in the document growing beyond the current padding, the document needs to be deleted and moved to the end of the current file. This is a quite slow operation.
Also, MongoDB has a hardcoded limit of 16MB per document (mostly to discourage growing documents). In your illustrated use-case this might not be a problem, but I assume that this is just a simplified example and your actual data will have a lot more fields per bookmark entry. When you store a lot of meta-data with each entry, that 16MB limit could become a problem.
So I would recommend you to pick the first option.

I would go with the option 2 - multiple bookmarks within one document per user because this schema would take advantage of MongoDB’s rich documents also known as “denormalized” models.
Embedded data models allow applications to store related pieces of information in the same database record. As a result, applications may need to issue fewer queries and updates to complete common operations. Link

There are two tools that allow applications to represent these
relationships: references and embedded documents.
When designing data models, always consider the application usage of
the data (i.e. queries, updates, and processing of the data) as well
as the inherent structure of the data itself.
The Second type of structure represents an Embedded type.
Generally Embedded type structure should be chosen when our application needs:
a) better performance for read operations.
b) the ability to request and retrieve
related data in a single database operation.
c) Data Consistency, to update related data in a single atomic write operation.
In MongoDB, operations are atomic at the document level. No single
write operation can change more than one document. Operations that
modify more than a single document in a collection still operate on
one document at a time. Ensure that your application stores all fields
with atomic dependency requirements in the same document. If the
application can tolerate non-atomic updates for two pieces of data,
you can store these data in separate documents. A data model that
embeds related data in a single document facilitates these kinds of
atomic operations.
d) to issue fewer queries and updates to complete common operations.
When not to choose:
Embedding related data in documents may lead to situations where
documents grow after creation. Document growth can impact write
performance and lead to data fragmentation. (limit of 16MB per
document)
Now let's compare the structures from a developer's perspective:
Say I want to see all the bookmarks of a particular user:
The first type would require an aggregation to be applied on all the documents.
minimum set of functions that would be required to get the aggregated results, $match,$group(with $push operator):
db.collection.aggregate([{$match:{"userId":123}},{$group:{"_id":"$userId","bookmarkNames":{$push:"$bookmarkName"},"bookMarkUrls:{$push:"$bookmarkUrl"}"}}])
or a find() which returns multiple documents to be iterated.
Wheras the Embedded type would allow us to fetch it using a $match in the find query.
db.collection.find({"userId":123});
This just indicates the added overhead from the developer's point of view. We would view the first type as an unwinded form of the embedded document.
The first type, multiple bookmarks as separate documents in a collection,
is normally used in case of logging. Where the log entries are huge and will have a TTL, time to live. The collections in that case, would be capped collections. Where documents would be automatically deleted after a particular period of time.
Bottomline, if your documents size would not grow beyond 16 MB at any particular time opt for the Embedded type. it would save developing effort as well.
See Also: MongoDB relationships: embed or reference?

Related

mongodb - combining multiple collections, but each individual query might not need all columns

Here is the scenario :
We have 2 tables (issues, anomalies) in BigQuery, which we plan to combine into a single document in MongoDB, since the 2 collections (issues, anomalies) is data about particular site.
[
{
"site": "abc",
"issues": {
--- issues data --
},
"anomalies": {
-- anomalies data --
}
}
]
There are some queries which require the 'issues' data, while others require 'anomalies' data.
In the future, we might need to show 'issues' & 'anomalies' data together, which is the reason why i'm planning to combine the two in a single document.
Questions on the approach above, wrt performance/volume of data read:
When we read the combined document, is there a way to read only specific columns (so the data volume read is not huge) ?
Or does this mean that when we read the document, the entire document is loaded in memory ?
Pls let me know.
tia!
UPDATE :
going over the mongoDB docs, we can use projections to pull only the required data from mongoDB documents.
Also, in this case - the data that is transferred over the network is only the data the specific fields that is read.
However the mongoDB server will still have to select the specific fields from the documents.

Mongodb Index or not to index

quick question on whether to index or not. There are frequent queries to a collection that looks for a specific 'user_id' within an array of a doc. See below -
_id:"bQddff44SF9SC99xRu",
participants:
[
{
type:"client",
user_id:"mi7x5Yphuiiyevf5",
screen_name:"Bob",
active:false
},
{
type:"agent",
user_id:"rgcy6hXT6hJSr8czX",
screen_name:"Harry",
active:false
}
]
}
Would it be a good idea to add an index to 'participants.user_id'? The array is added to frequently and occasionally items are removed.
Update
I've added the index after testing locally with the same set of data and this certainly seems to have decreased the high CPU usage on the mongo process. As there are only a small number of updates to these documents I think it was the right move. I'm looking at more possible indexes and optimisation now.
Why do you want to index? Do you have significant latency problems when querying? Or are you trying to optimise in advance?
Ultimately there are lots of variables here which make it hard to answer. Including but not limited to:
how often is the query made
how many documents in the collection
how many users are in each document
how often you add/remove users from the document after the document is inserted.
do you need to optimise inserts/updates to the collection
It may be that indexing isn't the answer, but rather how you have structured you data.

mongodb- schema design

I am using mongodb as my backend. I have data for movies, music, books and more which I am storing in one single collection. The compulsory fields for every bson entry are "_id", "name", "category". Rest of the fields depend upon the category to which the entry belongs.
For example, I have a movie record stored like.
{
"_id": <some_id>,
"name": <movie_name>,
"category": "movie",
"director": <director_name>,
"actors": <list_of_actors>,
"genre": <list_of_genre>
}
For music, I have,
{
"_id": <some_id>,
"name": <movie_name>,
"category": "music"
"record_label": <label_name>
"length": <length>
"lyrics": <lyrics>
}
Now I have 12 different categories for which only _id, name and category are common fields. Rest the fields are all different for different categories. Is my decision to store all data in one single collection fine or should I make different collections per category.
A single collection is best if you're searching across categories. Having the single collection might slow performance on inserts, but if you don't have a high write need, that shouldn't matter.
MongoDB allows you to store any field structure in a document even if every document is different, so that isn't a concern. By having those 3 consistent fields then you can use those as part of the index and to handle your queries. This is a good example of where a schemaless database helps because you can store everything in a single collection.
There is no performance hit for using a single collection in this way. Indeed, there is actually a benefit because you can shard the collection as a scaling strategy later. Sharding is done on a collection level so you could shard based on the _id field to have them evenly distributed, or use your category field to have certain categories per shard, or even a combination.
One thing to be aware of is future query requirements. If you do need to index the other fields then you can use sparse indexes which mean that documents without the indexed fields won't be in the index, so won't take any space in the index; a handy optimisation.
You should also be aware of growing the documents if you made updates. This does have a major performance impact.

Mongodb : multiple specific collections or one "store-it-all" collection for performance / indexing

I'm logging different actions users make on our website. Each action can be of different type : a comment, a search query, a page view, a vote etc... Each of these types has its own schema and common infos. For instance :
comment : {"_id":(mongoId), "type":"comment", "date":4/7/2012,
"user":"Franck", "text":"This is a sample comment"}
search : {"_id":(mongoId), "type":"search", "date":4/6/2012,
"user":"Franck", "query":"mongodb"} etc...
Basically, in OOP or RDBMS, I would design an Action class / table and a set of inherited classes / tables (Comment, Search, Vote).
As MongoDb is schema less, I'm inclined to set up a unique collection ("Actions") where I would store these objects instead of multiple collections (collection Actions + collection Comments with a link key to its parent Action etc...).
My question is : what about performance / response time if I try to search by specific columns ?
As I understand indexing best practices, if I want "every users searching for mongodb", I would index columns "type" + "query". But it will not concern the whole set of data, only those of type "search".
Will MongoDb engine scan the whole table or merely focus on data having this specific schema ?
If you create sparse indexes mongo will ignore any rows that don't have the key. Though there is the specific limitation of sparse indexes that they can only index one field.
However, if you are only going to query using common fields there's absolutely no reason not to use a single collection.
I.e. if an index on user+type (or date+user+type) will satisfy all your querying needs - there's no reason to create multiple collections
Tip: use date objects for dates, use object ids not names where appropriate.
Here is some useful information from MongoDB's Best Practices
Store all data for a record in a single document.
MongoDB provides atomic operations at the document level. When data
for a record is stored in a single document the entire record can be
retrieved in a single seek operation, which is very efficient. In some
cases it may not be practical to store all data in a single document,
or it may negatively impact other operations. Make the trade-offs that
are best for your application.
Avoid Large Documents.
The maximum size for documents in MongoDB is 16MB. In practice most
documents are a few kilobytes or less. Consider documents more like
rows in a table than the tables themselves. Rather than maintaining
lists of records in a single document, instead make each record a
document. For large media documents, such as video, consider using
GridFS, a convention implemented by all the drivers that stores the
binary data across many smaller documents.

Mongodb data storage performance - one doc with items in array vs multiple docs per item

I have statistical data in a Mongodb collection saved for each record per day.
For example my collection looks roughly like
{ record_id: 12345, date: Date(2011,12,13), stat_value_1:12345, stat_value_2:98765 }
Each record_id/date combo is unique. I query the collection to get statistics per record for a given date range using map-reduce.
As far as read query performance, is this strategy superior than storing one document per record_id containing an array of statistical data just like the above dict:
{ _id: record_id, stats: [
{ date: Date(2011,12,11), stat_value_1:39884, stat_value_2:98765 },
{ date: Date(2011,12,12), stat_value_1:38555, stat_value_2:4665 },
{ date: Date(2011,12,13), stat_value_1:12345, stat_value_2:265 },
]}
On the pro side I will need one query to get the entire stat history of a record without resorting to the slower map-reduce method, and on the con side I'll have to sum up the stats for a given date range in my application code and if a record outgrows is current padding size-wise there's some disc reallocation that will go on.
I think this depends on the usage scenario. If the data set for a single aggregation is small like those 700 records and you want to do this in real-time, I think it's best to choose yet another option and query all individual records and aggregate them client-side. This avoids the Map/Reduce overhead, it's easier to maintain and it does not suffer from reallocation or size limits. Index use should be efficient and connection-wise, I doubt there's much of a difference: most drivers batch transfers anyway.
The added flexibility might come in handy, for instance if you want to know the stat value for a single day across all records (if that ever makes sense for your application). Should you ever need to store more stat_values, your maximum number of dates per records would go down in the subdocument approach. It's also generally easier to work with db documents rather than subdocuments.
Map/Reduce really shines if you're aggregating huge amounts of data across multiple servers, where otherwise bandwidth and client concurrency would be bottlenecks.
I think you can reference to here, and also see foursquare how to solve this kind of problem here . They are both valuable.