JSON Schema with dynamic key field in MongoDB - mongodb

Want to have a i18n support for objects stored in mongodb collection
currently our schema is like:
{
_id: "id"
name: "name"
localization: [{
lan: "en-US",
name: "name_in_english"
}, {
lan: "zh-TW",
name: "name_in_traditional_chinese"
}]
}
but my thought is that field "lan" is unique, can I just use this field as a key, so the structure would be
{
_id: "id"
name: "name"
localization: {
"en-US": "name_in_english",
"zh-TW": "name_in_traditional_chinese"
}
}
which would be neater and easier to parse (just localization[language] would get the value i want for specific language).
But then the question is: Is this a good practice in storing data in MongoDB? And how to pass the json-schema check?

It is not a good practice to have values as keys. The language codes are values and as you say you can not validate them against a schema. It makes querying against it impossible. For example, you can't figure out if you have a language translation for "nl-NL" as you can't compare against keys and neither is it possible to easily index this. You should always have descriptive keys.
However, as you say, having the languages as keys makes it a lot easier to pull the data out as you can just access it by ['nl-NL'] (or whatever your language's syntax is).
I would suggest an alternative schema:
{
your_id: "id_for_name"
lan: "en-US",
name: "name_in_english"
}
{
your_id: "id_for_name"
lan: "zh-TW",
name: "name_in_traditional_chinese"
}
Now you can :
set an index on { your_id: 1, lan: 1 } for speedy lookups
query for each translation individually and just get that translation:
db.so.find( { your_id: "id_for_name", lan: 'en-US' } )
query for all the versions for each id using this same index:
db.so.find( { your_id: "id_for_name" } )
and also much easier update the translation for a specific language:
db.so.update(
{ your_id: "id_for_name", lan: 'en-US' },
{ $set: { name: "ooga" } }
)
Neither of those points are possible with your suggested schemas.

Obviously the second schema example is much better for your task (of course, if lan field is unique as you mentioned, that seems true to me also).
Getting element from dictionary/associated array/mapping/whatever_it_is_called_in_your_language is much cheaper than scanning whole array of values (and in current case it's also much efficient from the storage size point of view (remember that all fields are stored in MongoDB as-is, so every record holds the whole key name for json field, not it's representation or index or whatever).
My experience shows that MongoDB is mature enough to be used as a main storage for your application, even on high-loads (whatever it means ;) ), and the main problem is how you fight database-level locks (well, we'll wait for promised table-level locks, it'll fasten MongoDB I hope a lot more), though data loss is possible if your MongoDB cluster is built badly (dig into docs and articles over Internet for more information).
As for schema check, you must do it by means of your programming language on application side before inserting records, yeah, that's why Mongo is called schemaless.

There is a case where an object is necessarily better than an array: supporting upserts into a set. For example, if you want to update an item having name 'item1' to have val 100, or insert such an item if one doesn't exist, all in one atomic operation. With an array, you'd have to do one of two operations. Given a schema like
{ _id: 'some-id', itemSet: [ { name: 'an-item', val: 123 } ] }
you'd have commands
// Update:
db.coll.update(
{ _id: id, 'itemSet.name': 'item1' },
{ $set: { 'itemSet.$.val': 100 } }
);
// Insert:
db.coll.update(
{ _id: id, 'itemSet.name': { $ne: 'item1' } },
{ $addToSet: { 'itemSet': { name: 'item1', val: 100 } } }
);
You'd have to query first to know which is needed in advance, which can exacerbate race conditions unless you implement some versioning. With an object, you can simply do
db.coll.update({
{ _id: id },
{ $set: { 'itemSet.name': 'item1', 'itemSet.val': 100 } }
});
If this is a use case you have, then you should go with the object approach. One drawback is that querying for a specific name requires scanning. If that is also needed, you can add a separate array specifically for indexing. This is a trade-off with MongoDB. Upserts would become
db.coll.update({
{ _id: id },
{
$set: { 'itemSet.name': 'item1', 'itemSet.val': 100 },
$addToSet: { itemNames: 'item1' }
}
});
and the query would then simply be
db.coll.find({ itemNames: 'item1' })
(Note: the $ positional operator does not support array upserts.)

Related

Mongo Oplog - Extracting Specifics of Updates

Say I have an entry in the inventory collection that looks like
{ _id: 1, item: "polarizing_filter", tags: [ "electronics", "camera" ]}
and I issue the command
db.inventory.update(
{ _id: 1 },
{ $addToSet: { tags: "accessories" } }
)
I have an oplog tailer, and would like to know that, specifically, "accessories" has been added to this document's tags field. As far as I can tell, the oplog always normalizes commands to use $set and $unset to maintain idempotency. In this case, the field of the entry describing the update would show something like
{$set : { tags : ["electronics", "camera", "accessories"] } }
which makes it impossible to know which tags were actually added by this update. Is there anyway to do this? I'm also curious about the analogous case in which fields are modified through deletion, e.g. through $pull. Solutions outside of the realm of an oplog tailer are welcome, as well as pointers to documentation of this command normalization process - I can't find it.
Thanks!

Mongo error 16996 during aggregation - too large document produced

I am parsing Wikipedia dumps in order to play with the link-oriented metadata. One of the collections is named articles and it is in the following form:
{
_id : "Tree",
id: "18955875",
linksFrom: " [
{
name: "Forest",
count: 6
},
[...]
],
categories: [
"Trees",
"Forest_ecology"
[...]
]
}
The linksFrom field stores all articles this article points to, and how many times that happens. Next, I want to create another field linksTo with all the articles that point to this article. In the beginning, I went through the whole collection and updated every article, but since there's lots of them it takes too much time. I switched to aggregation for performance purposes and tried it on a smaller set - works like a charm and is super fast in comparison with the older method. The aggregation pipeline is as follows:
db.runCommand(
{
aggregate: "articles",
pipeline : [
{
$unwind: "$linksFrom"
},
{
$sort: { "linksFrom.count": -1 }
},
{
$project:
{
name: "$_id",
linksFrom: "$linksFrom"
}
},
{
$group:
{
_id: "$linksFrom.name",
linksTo: { $push: { name: "$name", count: { $sum : "$linksFrom.count" } } },
}
},
{
$out: "TEMPORARY"
}
] ,
allowDiskUse: true
}
)
However, on a large dataset being the english Wikipedia I get the following error after a few minutes:
{
"ok" : 0,
"errmsg" : "insert for $out failed: { connectionId: 24, err: \"BSONObj size: 24535193 (0x1766099) is invalid. Size must be between 0 and 16793600(16MB) First element: _id: \"United_States\"\", code: 10334, n: 0, ok: 1.0 }",
"code" : 16996
}
I understand that there are too many articles, which link to United_States article and the corresponding document's size grows above 16MB, currently almost 24MB. Unfortunately, I cannot even check if that's the case (error messages sometimes tend to lie)... Because of that, I'm trying to change the model so that the relationship between articles is stored with IDs rather than long names but I'm afraid that might not be enough - especially because my plan is to merge the two collections for every article later...
The question is: does anyone have a better idea? I don't want to try to increase the limit, I'm rather thinking about a different approach of storing this data in the database.
UPDATE after comment by Markus
Markus is correct, I am using a SAX parser and, as a matter of fact, I'm already storing all the links in a similar way. Apart from articles I have three more collections - one with links and two others, labels and stemmed-labels. The first one stores all links that occur in the dump in the following way:
{
_id : "tree",
stemmedName: "tree",
targetArticle: "Christmas_tree"
}
_id stores the text that is used to represent a given link, stemmedName represents stemmed _id and targetArticle marks what article this text pointed to. I'm in the middle of adding sourceArticle to this one, because it's obviously a good idea.
The second collection labels contains documents as follows:
{
_id : "tree",
targetArticles: [
{
name: "Christmas_tree",
count: 1
},
{
name: "Tree",
count: 166
}
[...]
]
}
The third stemmed-labels is analogous to the labels with its _id being a stemmed version of the root label.
So far, the first collection links serves as a baseline for the two other collections. I group the labels together by their name so that I only do one lookup for every phrase and then I can immiedately get all target articles with one query. Then I use the articles and labels collections in order to:
Look for label with a given name.
Get all articles it might
point to.
Compare the incoming and outcoming links for these
articles.
This is where the main question comes. I thought that it's better if I store all possible articles for a given phrase in one document rather than leave them scattered in the links collection. Only now did it occur to me, that - as long as the lookups are indexed - the overall performance might be the same for one big document or many smaller ones! Is this a correct assumption?
I think your data model is wrong. It may well be (albeit a bit theoretical) that individual articles (let's stick with the wikipedia example) are linked more often than you could store in a document. Embedding only works with One-To(-Very)-Few™ relationships.
So basically, I think you should change your model. I will show you how I would do it.
I will use the mongoshell and JavaScript in this example, since it is the lingua franca. You might need to translate accordingly.
The questions
Lets begin with the questions you want to have answered:
For a given article, which other articles link to that article?
For a given article, to which other articles does that article link to?
For a given article, how many articles link to it?
Optional: For a given article, to how many articles does it link to?
The crawling
What I would do basically is to implement a SAX parser on the articles, creating a new document for each article link you encounter. The document itself should be rather simple:
{
"_id": new ObjectId(),
// optional, for recrawling or pointing out a given state
"date": new ISODate(),
"article": wikiUrl,
"linksTo": otherWikiUrl
}
Note that you should not do an insert, but an upsert. The reason for this is that we do not want to document the number of links, but the articles linked to. If we did an insert, the same combination of article and linksTocould occur multiple times.
So our statement when encountering a link would look like this for example:
db.links.update(
{ "article":"HMS_Warrior_(1860)", "linksTo":"Royal_Navy" },
{ "date": new ISODate(), "article":"HMS_Warrior_(1860)", "linksTo":"Royal_Navy" },
{ upsert:true }
)
Answering the questions
As you might already guess, answering the questions becomes pretty straightforward now. I have use the following statements for creating a few documents:
db.links.update(
{ "article":"HMS_Warrior_(1860)", "linksTo":"Royal_Navy" },
{ "date": new ISODate(), "article":"HMS_Warrior_(1860)", "linksTo":"Royal_Navy" },
{ upsert:true }
)
db.links.update(
{ "article":"Royal_Navy", "linksTo":"Mutiny_on_the_Bounty" },
{ "date":new ISODate(), "article":"Royal_Navy", "linksTo":"Mutiny_on_the_Bounty" },
{ upsert:true }
)
db.links.update(
{ "article":"Mutiny_on_the_Bounty", "linksTo":"Royal_Navy"},
{ "date":new ISODate(), "article":"Mutiny_on_the_Bounty", "linksTo":"Royal_Navy" },
{ upsert:true }
)
For a given article, which other articles link to that article?
We found out that we should not use an aggregation, since that might exceed the size limit. But we don't have to. We simply use a cursor and gather the results:
var toLinks =[]
var cursor = db.links.find({"linksTo":"Royal_Navy"},{"_id":0,"article":1})
cursor.forEach(
function(doc){
toLinks.push(doc.article);
}
)
printjson(toLinks)
// Output: [ "HMS_Warrior_(1860)", "Mutiny_on_the_Bounty" ]
For a given article, to which other articles does that article link to?
This works pretty much like the first question – we basically only change the query:
var fromLinks = []
var cursor = db.links.find({"article":"Royal_Navy"},{"_id":0,"linksTo":1})
cursor.forEach(
function(doc){
fromLinks.push(doc.linksTo)
}
)
printjson(fromLinks)
// Output: [ "Mutiny_on_the_Bounty" ]
For a given article, how many articles link to it?
It should be obvious that in case you already have answered question 1, you could simply check toLinks.length. But let's assume you haven't. There are two other ways of doing this
Using .count()
You can use this method on replica sets. On sharded clusters, this doesn't work well. But it is easy:
db.links.find({ "linksTo":"Royal_Navy" }).count()
// Output: 2
Using an aggregation
This works on any environment and isn't much more complicated:
db.links.aggregate([
{ "$match":{ "linksTo":"Royal_Navy" }},
{ "$group":{ "_id":"$linksTo", "isLinkedFrom":{ "$sum":1 }}}
])
// Output: { "_id" : "Royal_Navy", "isLinkedFrom" : 2 }
Optional: For a given article, to how many articles does it link to?
Again, you can answer this question by reading the length of the array from question 2 of use the .count()method. The aggregation again is simple
db.links.aggregate([
{ "$match":{ "article":"Royal_Navy" }},
{ "$group":{ "_id":"$article", "linksTo":{ "$sum":1 }}}
])
// Output: { "_id" : "Royal_Navy", "linksTo" : 1 }
Indices
As for the indices, I haven't really checked them, but individual indices on the fields is probably what you want:
db.links.createIndex({"article":1})
db.links.createIndex({"linksTo":1})
A compound index will not help much, since order matters and we do no always ask for the first field. So this is probably as optimized as it can get.
Conclusion
We are using an extremely simple, scalable model and rather simple queries and aggregations to get the questions answered you have to the data.

How to apply constraints in MongoDB?

I have started using MongoDB and I am fairly new to it.
Is there any way by which I can apply constraints on documents in MongoDB?
Like specifying a primary key or taking an attribute as unique?
Or specifying that a particular attribute is greater than a minimum value?
MongoDB 3.2 Update
Document validation is now supported natively by MongoDB.
Example from the documentation:
db.createCollection( "contacts",
{ validator: { $or:
[
{ phone: { $type: "string" } },
{ email: { $regex: /#mongodb\.com$/ } },
{ status: { $in: [ "Unknown", "Incomplete" ] } }
]
}
} )
Original answer
To go beyond the uniqueness constraint available natively in indexes, you need to use something like Mongoose and its ability to support field-based validation. That will give you support for things like minimum value, but only when updates go through your Mongoose schemas/models.
Being a "schemaless" database, some of the things you mention must be constrained from the application side, rather than the db side. (such as "minimum value")
However, you can create indexes (keys to query on--remember that a query can only use one index at a time, so it's generally better to design your indexes around your queries, rather than just index each field you might query against):
http://www.mongodb.org/display/DOCS/Indexes#Indexes-Basics
And you can also create unique indexes, which will enforce uniqueness similar to a unique constraint (it does have some caveats, such as with array fields):
http://www.mongodb.org/display/DOCS/Indexes#Indexes-unique%3Atrue

$unwind an object in aggregation framework

In the MongoDB aggregation framework, I was hoping to use the $unwind operator on an object (ie. a JSON collection). Doesn't look like this is possible, is there a workaround? Are there plans to implement this?
For example, take the article collection from the aggregation documentation . Suppose there is an additional field "ratings" that is a map from user -> rating. Could you calculate the average rating for each user?
Other than this, I'm quite pleased with the aggregation framework.
Update: here's a simplified version of my JSON collection per request. I'm storing genomic data. I can't really make genotypes an array, because the most common lookup is to get the genotype for a random person.
variants: [
{
name: 'variant1',
genotypes: {
person1: 2,
person2: 5,
person3: 7,
}
},
{
name: 'variant2',
genotypes: {
person1: 3,
person2: 3,
person3: 2,
}
}
]
It is not possible to do the type of computation you are describing with the aggregation framework - and it's not because there is no $unwind method for non-arrays. Even if the person:value objects were documents in an array, $unwind would not help.
The "group by" functionality (whether in MongoDB or in any relational database) is done on the value of a field or column. We group by value of field and sum/average/etc based on the value of another field.
Simple example is a variant of what you suggest, ratings field added to the example article collection, but not as a map from user to rating but as an array like this:
{ title : title of article", ...
ratings: [
{ voter: "user1", score: 5 },
{ voter: "user2", score: 8 },
{ voter: "user3", score: 7 }
]
}
Now you can aggregate this with:
[ {$unwind: "$ratings"},
{$group : {_id : "$ratings.voter", averageScore: {$avg:"$ratings.score"} } }
]
But this example structured as you describe it would look like this:
{ title : title of article", ...
ratings: {
user1: 5,
user2: 8,
user3: 7
}
}
or even this:
{ title : title of article", ...
ratings: [
{ user1: 5 },
{ user2: 8 },
{ user3: 7 }
]
}
Even if you could $unwind this, there is nothing to aggregate on here. Unless you know the complete list of all possible keys (users) you cannot do much with this. [*]
An analogous relational DB schema to what you have would be:
CREATE TABLE T (
user1: integer,
user2: integer,
user3: integer
...
);
That's not what would be done, instead we would do this:
CREATE TABLE T (
username: varchar(32),
score: integer
);
and now we aggregate using SQL:
select username, avg(score) from T group by username;
There is an enhancement request for MongoDB that may allow you to do this in the aggregation framework in the future - the ability to project values to keys to vice versa. Meanwhile, there is always map/reduce.
[*] There is a complicated way to do this if you know all unique keys (you can find all unique keys with a method similar to this) but if you know all the keys you may as well just run a sequence of queries of the form db.articles.find({"ratings.user1":{$exists:true}},{_id:0,"ratings.user1":1}) for each userX which will return all their ratings and you can sum and average them simply enough rather than do a very complex projection the aggregation framework would require.
Since 3.4.4, you can transform object to array using $objectToArray
See:
https://docs.mongodb.com/manual/reference/operator/aggregation/objectToArray/
This is an old question, but I've run across a tidbit of information through trial and error that people may find useful.
It's actually possible to unwind on a dummy value by fooling the parser this way:
db.Opportunity.aggregate(
{ $project: {
Field1: 1, Field2: 1, Field3: 1,
DummyUnwindField: { $ifNull: [null, [1.0]] }
}
},
{ $unwind: "$DummyUnwindField" }
);
This will produce 1 row per document, regardless of whether or not the value exists. You may be able tinker with this to generate the results you want. I had hoped to combine this with multiple $unwinds to (sort of like emit() in map/reduce), but alas, the last $unwind wins or they combine as an intersection rather than union which makes it impossible to achieve the results I was looking for. I am sadly disappointed with the aggregate framework functionality as it doesn't fit the one use case I was hoping to use it for (and seems strangely like a lot of the questions on StackOverflow in this area are asking) - ordering results based on match rate. Improving the poor map reduce performance would have made this entire feature unnecessary.
This is what I found & extended.
Lets create experimental database in mongo
db.copyDatabase('livedb' , 'experimentdb')
Now Use experimentdb & convert Array to object in your experimentcollection
db.getCollection('experimentcollection').find({}).forEach(function(e){
if(e.store){
e.ratings = [e.ratings]; //Objects name to be converted to array eg:ratings
db.experimentcollection.save(e);
}
})
Some nerdy js code to convert json to flat object
var flatArray = [];
var data = db.experimentcollection.find().toArray();
for (var index = 0; index < data.length; index++) {
var flatObject = {};
for (var prop in data[index]) {
var value = data[index][prop];
if (Array.isArray(value) && prop === 'ratings') {
for (var i = 0; i < value.length; i++) {
for (var inProp in value[i]) {
flatObject[inProp] = value[i][inProp];
}
}
}else{
flatObject[prop] = value;
}
}
flatArray.push(flatObject);
}
printjson(flatArray);

Ways to implement data versioning in MongoDB

Can you share your thoughts how would you implement data versioning in MongoDB. (I've asked similar question regarding Cassandra. If you have any thoughts which db is better for that please share)
Suppose that I need to version records in an simple address book. (Address book records are stored as flat json objects). I expect that the history:
will be used infrequently
will be used all at once to present it in a "time machine" fashion
there won't be more versions than few hundred to a single record.
history won't expire.
I'm considering the following approaches:
Create a new object collection to store history of records or changes to the records. It would store one object per version with a reference to the address book entry. Such records would looks as follows:
{
'_id': 'new id',
'user': user_id,
'timestamp': timestamp,
'address_book_id': 'id of the address book record'
'old_record': {'first_name': 'Jon', 'last_name':'Doe' ...}
}
This approach can be modified to store an array of versions per document. But this seems to be slower approach without any advantages.
Store versions as serialized (JSON) object attached to address book entries. I'm not sure how to attach such objects to MongoDB documents. Perhaps as an array of strings.
(Modelled after Simple Document Versioning with CouchDB)
The first big question when diving in to this is "how do you want to store changesets"?
Diffs?
Whole record copies?
My personal approach would be to store diffs. Because the display of these diffs is really a special action, I would put the diffs in a different "history" collection.
I would use the different collection to save memory space. You generally don't want a full history for a simple query. So by keeping the history out of the object you can also keep it out of the commonly accessed memory when that data is queried.
To make my life easy, I would make a history document contain a dictionary of time-stamped diffs. Something like this:
{
_id : "id of address book record",
changes : {
1234567 : { "city" : "Omaha", "state" : "Nebraska" },
1234568 : { "city" : "Kansas City", "state" : "Missouri" }
}
}
To make my life really easy, I would make this part of my DataObjects (EntityWrapper, whatever) that I use to access my data. Generally these objects have some form of history, so that you can easily override the save() method to make this change at the same time.
UPDATE: 2015-10
It looks like there is now a spec for handling JSON diffs. This seems like a more robust way to store the diffs / changes.
There is a versioning scheme called "Vermongo" which addresses some aspects which haven't been dealt with in the other replies.
One of these issues is concurrent updates, another one is deleting documents.
Vermongo stores complete document copies in a shadow collection. For some use cases this might cause too much overhead, but I think it also simplifies many things.
https://github.com/thiloplanz/v7files/wiki/Vermongo
Here's another solution using a single document for the current version and all old versions:
{
_id: ObjectId("..."),
data: [
{ vid: 1, content: "foo" },
{ vid: 2, content: "bar" }
]
}
data contains all versions. The data array is ordered, new versions will only get $pushed to the end of the array. data.vid is the version id, which is an incrementing number.
Get the most recent version:
find(
{ "_id":ObjectId("...") },
{ "data":{ $slice:-1 } }
)
Get a specific version by vid:
find(
{ "_id":ObjectId("...") },
{ "data":{ $elemMatch:{ "vid":1 } } }
)
Return only specified fields:
find(
{ "_id":ObjectId("...") },
{ "data":{ $elemMatch:{ "vid":1 } }, "data.content":1 }
)
Insert new version: (and prevent concurrent insert/update)
update(
{
"_id":ObjectId("..."),
$and:[
{ "data.vid":{ $not:{ $gt:2 } } },
{ "data.vid":2 }
]
},
{ $push:{ "data":{ "vid":3, "content":"baz" } } }
)
2 is the vid of the current most recent version and 3 is the new version getting inserted. Because you need the most recent version's vid, it's easy to do get the next version's vid: nextVID = oldVID + 1.
The $and condition will ensure, that 2 is the latest vid.
This way there's no need for a unique index, but the application logic has to take care of incrementing the vid on insert.
Remove a specific version:
update(
{ "_id":ObjectId("...") },
{ $pull:{ "data":{ "vid":2 } } }
)
That's it!
(remember the 16MB per document limit)
If you're looking for a ready-to-roll solution -
Mongoid has built in simple versioning
http://mongoid.org/en/mongoid/docs/extras.html#versioning
mongoid-history is a Ruby plugin that provides a significantly more complicated solution with auditing, undo and redo
https://github.com/aq1018/mongoid-history
I worked through this solution that accommodates a published, draft and historical versions of the data:
{
published: {},
draft: {},
history: {
"1" : {
metadata: <value>,
document: {}
},
...
}
}
I explain the model further here: http://software.danielwatrous.com/representing-revision-data-in-mongodb/
For those that may implement something like this in Java, here's an example:
http://software.danielwatrous.com/using-java-to-work-with-versioned-data/
Including all the code that you can fork, if you like
https://github.com/dwatrous/mongodb-revision-objects
If you are using mongoose, I have found the following plugin to be a useful implementation of the JSON Patch format
mongoose-patch-history
Another option is to use mongoose-history plugin.
let mongoose = require('mongoose');
let mongooseHistory = require('mongoose-history');
let Schema = mongoose.Schema;
let MySchema = Post = new Schema({
title: String,
status: Boolean
});
MySchema.plugin(mongooseHistory);
// The plugin will automatically create a new collection with the schema name + "_history".
// In this case, collection with name "my_schema_history" will be created.
I have used the below package for a meteor/MongoDB project, and it works well, the main advantage is that it stores history/revisions within an array in the same document, hence no need for an additional publications or middleware to access change-history. It can support a limited number of previous versions (ex. last ten versions), it also supports change-concatenation (so all changes happened within a specific period will be covered by one revision).
nicklozon/meteor-collection-revisions
Another sound option is to use Meteor Vermongo (here)