MongoDB Complex Record Performance - mongodb

I have a large collection 300K records of very large and complex records. Aggregation framework seems to struggle when working with nested arrays and objects. If I wanted to set up Views where I have simplified the record, is it better to have many records of very basic data that comprises a single complex record or a single flat record that has hundreds of dynamic field names?
Here is an example
var obj = {
orderID:1,
items:[
{
sku:"abc123",
price:"$1.21"
},
{
sku:"aaa111",
price:"$2.21"
},
{
sku:"bbb222",
price:"$3.21"
}
]
}
var vertical = [ //Each item is a record in the collection
{
orderID:1,
sku:"abc123",
price:"$1.21"
},
{
orderID:1,
sku:"aaa111",
price:"$3.21"
},
{
orderID:1,
sku:"bbb222",
price:"$3.21"
}
];
var flat = { // A single record in the collection
orderID:1,
abc123:"$1.21",
aaa111:"$2.21",
bbb222:"$3:21"
}
In my case, each record includes anywhere from 200-1000 items in the arrays within the record. So think of this scaled up to the extreme.
With vertical, I have well defined fields(indexable) and the records are small. But in my DB it would generate around 50M of these bite sized records.
With flat, I only have to pull a single record and as long as I know the fields I am looking for, it can work. But the fields are dynamic and not indexable.
With the only purpose of this being for reporting use, what is the best way to go?

Related

Calculating collection stats for a subset of documents in MongoDB

I know the cardinal rule of SE is to not ask a question without giving examples of what you've already tried, but in this case I can't find where to begin. I've looked at the documentation for MongoDB and it looks like there are only two ways to calculate storage usage:
db.collection.stats() returns the statistics about the entire collection. In my case I need to know the amount of storage being used to by a subset of data within a collection (data for a particular user).
Object.bsonsize(<document>) returns the storage size of a single record, which would require a cursor function to calculate the size of each document, one at a time. My only concern with this approach is performance with large amounts of data. If a single user has tens of thousands of documents this process could take too long.
Does anyone know of a way to calculate the aggregate document size of set of records within a collection efficiently and accurately.
Thanks for the help.
This may not be the most efficient or accurate way to do it, but I ended up using a Mongoose plugin to get the size of the JSON representation of the document before it's saved:
module.exports = exports = function defaultPlugin(schema, options){
schema.add({
userId: { type: mongoose.Schema.Types.ObjectId, ref: "User", required: true },
recordSize: Number
});
schema.pre('save', function(next) {
this.recordSize = JSON.stringify(this).length;
next();
});
}
This will convert the schema object to a JSON representation, get it's length, then store the size in the document itself. I understand that this will actually add a tiny bit of extra storage to record the size, but it's the best I could come up with.
Then, to generate a storage report, I'm using a simple aggregate call to get the sum of all of the recordSize values in the collection, filtered by userId:
mongoose.model('YouCollectionName').aggregate([
{
$match: {
userId: userId
}
},
{
$group: {
_id: null,
recordSize: { $sum: '$recordSize'},
recordCount: { $sum: 1 }
}
}
], function (err, results) {
//Do something with your results
});

How to store an ordered set of documents in MongoDB without using a capped collection

What's a good way to store a set of documents in MongoDB where order is important? I need to easily insert documents at an arbitrary position and possibly reorder them later.
I could assign each item an increasing number and sort by that, or I could sort by _id, but I don't know how I could then insert another document in between other documents. Say I want to insert something between an element with a sequence of 5 and an element with a sequence of 6?
My first guess would be to increment the sequence of all of the following elements so that there would be space for the new element using a query something like db.items.update({"sequence":{$gte:6}}, {$inc:{"sequence":1}}). My limited understanding of Database Administration tells me that a query like that would be slow and generally a bad idea, but I'm happy to be corrected.
I guess I could set the new element's sequence to 5.5, but I think that would get messy rather quickly. (Again, correct me if I'm wrong.)
I could use a capped collection, which has a guaranteed order, but then I'd run into issues if I needed to grow the collection. (Yet again, I might be wrong about that one too.)
I could have each document contain a reference to the next document, but that would require a query for each item in the list. (You'd get an item, push it onto the results array, and get another item based on the next field of the current item.) Aside from the obvious performance issues, I would also not be able to pass a sorted mongo cursor to my {#each} spacebars block expression and let it live update as the database changed. (I'm using the Meteor full-stack javascript framework.)
I know that everything has it's advantages and disadvantages, and I might just have to use one of the options listed above, but I'd like to know if there is a better way to do things.
Based on your requirement, one of the approaches could be to design your schema, in such a way that each document has the capability to hold more than one document and in itself act as a capped container.
{
"_id":Number,
"doc":Array
}
Each document in the collection will act as a capped container, and the documents will be stored as array in the doc field. The doc field being an array, will maintain the order of insertion.
You can limit the number of documents to n. So the _id field of each container document will be incremental by n, indicating the number of documents a container document can hold.
By doing these you avoid adding extra fields to the document, extra indices, unnecessary sorts.
Inserting the very first record
i.e when the collection is empty.
var record = {"name" : "first"};
db.col.insert({"_id":0,"doc":[record]});
Inserting subsequent records
Identify the last container document's _id, and the number of
documents it holds.
If the number of documents it holds is less than n, then update the
container document with the new document, else create a new container
document.
Say, that each container document can hold 5 documents at most,and we want to insert a new document.
var record = {"name" : "newlyAdded"};
// using aggregation, get the _id of the last inserted container, and the
// number of record it currently holds.
db.col.aggregate( [ {
$group : {
"_id" : null,
"max" : {
$max : "$_id"
},
"lastDocSize" : {
$last : "$doc"
}
}
}, {
$project : {
"currentMaxId" : "$max",
"capSize" : {
$size : "$lastDocSize"
},
"_id" : 0
}
// once obtained, check if you need to update the last container or
// create a new container and insert the document in it.
} ]).forEach( function(check) {
if (check.capSize < 5) {
print("updating");
// UPDATE
db.col.update( {
"_id" : check.currentMaxId
}, {
$push : {
"doc" : record
}
});
} else {
print("inserting");
//insert
db.col.insert( {
"_id" : check.currentMaxId + 5,
"doc" : [ record ]
});
}
})
Note that the aggregation, runs on the server side and is very efficient, also note that the aggregation would return you a document rather than a cursor in versions previous to 2.6. So you would need to modify the above code to just select from a single document rather than iterating a cursor.
Inserting a new document in between documents
Now, if you would like to insert a new document between documents 1 and 2, we know that the document should fall inside the container with _id=0 and should be placed in the second position in the doc array of that container.
so, we make use of the $each and $position operators for inserting into specific positions.
var record = {"name" : "insertInMiddle"};
db.col.update(
{
"_id" : 0
}, {
$push : {
"doc" : {
$each : [record],
$position : 1
}
}
}
);
Handling Over Flow
Now, we need to take care of documents overflowing in each container, say we insert a new document in between, in container with _id=0. If the container already has 5 documents, we need to move the last document to the next container and do so till all the containers hold documents within their capacity, if required at last we need to create a container to hold the overflowing documents.
This complex operation should be done on the server side. To handle this, we can create a script such as the one below and register it with mongodb.
db.system.js.save( {
"_id" : "handleOverFlow",
"value" : function handleOverFlow(id) {
var currDocArr = db.col.find( {
"_id" : id
})[0].doc;
print(currDocArr);
var count = currDocArr.length;
var nextColId = id + 5;
// check if the collection size has exceeded
if (count <= 5)
return;
else {
// need to take the last doc and push it to the next capped
// container's array
print("updating collection: " + id);
var record = currDocArr.splice(currDocArr.length - 1, 1);
// update the next collection
db.col.update( {
"_id" : nextColId
}, {
$push : {
"doc" : {
$each : record,
$position : 0
}
}
});
// remove from original collection
db.col.update( {
"_id" : id
}, {
"doc" : currDocArr
});
// check overflow for the subsequent containers, recursively.
handleOverFlow(nextColId);
}
}
So that after every insertion in between , we can invoke this function by passing the container id, handleOverFlow(containerId).
Fetching all the records in order
Just use the $unwind operator in the aggregate pipeline.
db.col.aggregate([{$unwind:"$doc"},{$project:{"_id":0,"doc":1}}]);
Re-Ordering Documents
You can store each document in a capped container with an "_id" field:
.."doc":[{"_id":0,","name":"xyz",...}..]..
Get hold of the "doc" array of the capped container of which you want
to reorder items.
var docArray = db.col.find({"_id":0})[0];
Update their ids so that after sorting the order of the item will change.
Sort the array based on their _ids.
docArray.sort( function(a, b) {
return a._id - b._id;
});
update the capped container back, with the new doc array.
But then again, everything boils down to which approach is feasible and suits your requirement best.
Coming to your questions:
What's a good way to store a set of documents in MongoDB where order is important?I need to easily insert documents at an arbitrary
position and possibly reorder them later.
Documents as Arrays.
Say I want to insert something between an element with a sequence of 5 and an element with a sequence of 6?
use the $each and $position operators in the db.collection.update() function as depicted in my answer.
My limited understanding of Database Administration tells me that a
query like that would be slow and generally a bad idea, but I'm happy
to be corrected.
Yes. It would impact the performance, unless the collection has very less data.
I could use a capped collection, which has a guaranteed order, but then I'd run into issues if I needed to grow the collection. (Yet
again, I might be wrong about that one too.)
Yes. With Capped Collections, you may lose data.
An _id field in MongoDB is a unique, indexed key similar to a primary key in relational databases. If there is an inherent order in your documents, ideally you should be able to associate a unique key to each document, with the key value reflecting the order. So while preparing your document for insertion, explicitly add an _id field as this key (if you do not, mongo creates it automatically with a BSON objectid).
As far as retrieving the results are concerned, MongoDB does not guarantee the order of return documents unless you explicitly use .sort() . If you do not use .sort(), the results are usually returned in natural order (order of insertion).Again, there is no guarantee on this behavior.
I'd advise you to override _id with your order while inserting, and use a sort while retrieving. Since _id is a necessary and auto-indexed entity, you will not be wasting any space defining a sort key, and storing the index for it.
For abitrary sorting of any collection, you'll need a field to sort it on. I call mine "sequence".
schema:
{
_id: ObjectID,
sequence: Number,
...
}
db.items.ensureIndex({sequence:1});
db.items.find().sort({sequence:1})
Here is a link to some general sorting database answers that may be relevant:
https://softwareengineering.stackexchange.com/questions/195308/storing-a-re-orderable-list-in-a-database/369754
I suggest going with Floating point solution - adding a position column:
Use a floating-point number for the position column.
You can then reorder the list changing only the position column in the "moved" row.
If your user wants to position "red" after "blue" but before "yellow" Then you just need to calculate
red.position = ((yellow.position - blue.position) / 2) + blue.position
After a few re-positions in the same place (Cuttin in half every time) - you might reach a wall - it's better that if you reach a certain threshold - to resort the list.
When retrieving it you can simply say col.sort() to get it sorted and no need for any client-side code (Like in the case of a Linked list solution)

MongoDB: How can I execute multiple dereferrence in a single query?

There are several collections, i.e. Country, Province, City, Univ.
Just like in the real world, every country has several provinces, and every province has several cities, and every city has several universities.
How can I know whether a university is in the given country?For example, country0 may have some universities, what are their _ids?
Documents in those collections are showed below:
{
_id:"country0",
provinces:[
{
$ref:"Province",
$id:"province0"
},
...
]
}
{
_id:"province0",
belongs:{$ref:"Country", $id:"country0"},
cities:[
{
$ref:"City",
$id:"city0"
}
...
]
}
{
_id:"city0",
belongsTo:{$ref:"Province",$id:"province0"},
univs:[
{
$ref:"Univ",
$id:"univ0"
}
...
]
}
{
_id:"univ0",
address:{$ref:"City", $id:"city0"}
}
If there are only two collections, I know fetch() may be useful.
Also, python drivers may be useful, but I can't know their performance well, because I can't use db.system.profile in a .py file.
MongoDB doesn't do joins. N queries are required to get information from N collections. In this situation, to get the _id's of universities in a given country in an array one could do the following (in the mongo shell):
> var country = db.countries.findOne({ "_id": "country0" });
> var province_ids = [];
> country.provinces.forEach(function(province) { province_ids.push(province["$id"]); });
> var provinces = db.provinces.find({ "_id": { "$in": province_ids });
> var city_ids = [];
> provinces.forEach(function(province) { province.cities.forEach(function(city) { city_ids.push(city["$id"]); }); });
> var cities = db.cities.find({ "_id": { "$in": city_ids } });
> univ_ids = [];
> cities.forEach(function(city) { city.univs.forEach(function(univ) { univ_ids.push(univ["$id"]); }); });
It's also possible to accomplish the same thing using the belongsTo field, using similar steps. This is cumbersome and it seems like there should be a better way. There is! Normalize the data. Countries have provinces, which have cities, which have universities, but the relationships are fixed and not of huge cardinality. For doing queries like "what universities are in a given country?" I would suggest storing province documents entirely within countries and university documents entirely within city documents. You could store cities inside of province documents, or inside country documents directly, but a province or country could have hundreds or thousands of cities and this might be too much information for one document (16MB limit per document in MongoDB). Having provinces in countries and universities in cities reduces the number of queries necessary to two.
Another option is to store more information in each child document. Essentially you have a forest (a collection of trees): countries are parents of provinces which are parents of cities which are parents of universities. The belongsTo field is a parent reference. You could store a reference to all ancestors instead of just the parent. Then finding all universities in a given country is one query on the universities collection.
> db.universities.findOne();
{
_id: "univ0",
city: "city0",
province: "province0",
country: "country0"
}
> db.universities.find({ "country": "country0" });
The schema design that is best for you depends on the types of queries your application will need to answer and their relative frequencies and importance. I can't determine that from your question so I can't firmly recommend one schema over another.
As to your mini-question about performance and the db.system.profile collection, note that db.system.profile is a collection. You can query it from a .py file using a driver.

How can I model my meteor collection to feed three different reactive views

I am having some difficulty structuring my data so that I can benefit from reactivity in Meteor. Mainly nesting arrays of objects makes queries tricky.
The three main views I am trying to project this data onto are
waiter: shows order for one table, each persons meal (items nested, essentially what I have below)
kitchen manager: columns of orders by table (only needs table, note, and the items)
cook: columns of items, by category where started=true (only need item info)
Currently I have a meteor collection of order objects like this:
Order {
table: "1",
waiter: "neil",
note: "note from kitchen",
meals: [
{
seat: "1",
items: [ {n: "potato", category: "fryer", started: false },
{n: "water", category: "drink" }
]
},
{
seat: "2",
items: [ {n: "water", category: "drink" } ]
},
]
}
Is there any way to query inside the nested array and apply some projection, or do I need to look at an entirely different data model?
Assuming you're building this app for one restaurant, there shouldn't be many active orders at any given time—presumably you don't have thousands of tables. Even if you want to keep the orders in the database after they're served, you could add a field active to separate out the current ones.
Then your query is simple: activeOrders = Orders.find({active: true}).fetch(). The fetch returns an array, which you could loop through several times for each of your views, using nested if and for loops as necessary to dig down into the child objects. See also Underscore's _.pluck. You don't need to get everything right with some complicated Mongo query, and in fact your app will run faster if you query once and process the results however many times you need to.

how do I do 'not-in' operation in mongodb?

I have two collections - shoppers (everyone in shop on a given day) and beach-goers (everyone on beach on a given day). There are entries for each day, and person can be on a beach, or shopping or doing both, or doing neither on any day. I want to now do query - all shoppers in last 7 days who did not go to beach.
I am new to Mongo, so it might be that my schema design is not appropriate for nosql DBs. I saw similar questions around join and in most cases it was suggested to denormalize. So one solution, I could think of is to create collection - activity, index on date, embed actions of user. So something like
{
user_id
date
actions {
[action_type, ..]
}
}
Insertion now becomes costly, as now I will have to query before insert.
A few of suggestions.
Figure out all the queries you'll be running, and all the types of data you will need to store. For example, do you expect to add activities in the future or will beach and shop be all?
Consider how many writes vs. reads you will have and which has to be faster.
Determine how your documents will grow over time to make sure your schema is scalable in the long term.
Here is one possible approach, if you will only have these two activities ever. One record per user per day.
{ user: "user1",
date: "2012-12-01",
shopped: 0,
beached: 1
}
Now your query becomes even simpler, whether you have two or ten activities.
When new activity comes in you always have to update the correct record based on it.
If you were thinking you could just append a record to your collection indicating user, date, activity then your inserts are much faster but your queries now have to do a LOT of work querying for both users, dates and activities.
With proposed schema, here is the insert/update statement:
db.coll.update({"user":"username", "date": "somedate"}, {"shopped":{$inc:1}}, true)
What that's saying is: "for username on somedate increment their shopped attribute by 1 and create it if it doesn't exist aka "upsert" (that's the last 'true' argument).
Here is the query for all users on a particular day who did activity1 more than once but didn't do any of activity2.
db.coll.find({"date":"somedate","shopped":0,"danced":{$gt:1}})
Be wary of picking a schema where a single document can have continuous and unbounded growth.
For example, storing everything in a users collection where the array of dates and activities keeps growing will run into this problem. See the highlighted section here for explanation of this - and keep in mind that large documents will keep getting into your working data set and if they are huge and have a lot of useless (old) data in them, that will hurt the performance of your application, as will fragmentation of data on disk.
Remember, you don't have to put all the data into a single collection. It may be best to have a users collection with a fixed set of attributes of that user where you track how many friends they have or other semi-stable information about them and also have a user_activity collection where you add records for each day per user what activities they did. The amount or normalizing or denormalizing of your data is very tightly coupled to the types of queries you will be running on it, which is why figure out what those are is the first suggestion I made.
Insertion now becomes costly, as now I will have to query before insert.
Keep in mind that even with RDBMS, insertion can be (relatively) costly when there are indices in place on the table (ie, usually). I don't think using embedded documents in Mongo is much different in this respect.
For the query, as Asya Kamsky suggest you can use the $nin operator to find everyone who didn't go to the beach. Eg:
db.people.find({
actions: { $nin: ["beach"] }
});
Using embedded documents probably isn't the best approach in this case though. I think the best would be to have a "flat" activities collection with documents like this:
{
user_id
date
action
}
Then you could run a query like this:
var start = new Date(2012, 6, 3);
var end = new Date(2012, 5, 27);
db.activities.find({
date: {$gte: start, $lt: end },
action: { $in: ["beach", "shopping" ] }
});
The last step would be on your client driver, to find user ids where records exist for "shopping", but not for "beach" activities.
One possible structure is to use an embedded array of documents (a users collection):
{
user_id: 1234,
actions: [
{ action_type: "beach", date: "6/1/2012" },
{ action_type: "shopping", date: "6/2/2012" }
]
},
{ another user }
Then you can do a query like this, using $elemMatch to find users matching certain criteria (in this case, people who went shopping in the last three days:
var start = new Date(2012, 6, 1);
db.people.find( {
actions : {
$elemMatch : {
action_type : { $in: ["shopping"] },
date : { $gt : start }
}
}
});
Expanding on this, you can use the $and operator to find all people went shopping, but did not go to the beach in the past three days:
var start = new Date(2012, 6, 1);
db.people.find( {
$and: [
actions : {
$elemMatch : {
action_type : { $in: ["shopping"] },
date : { $gt : start }
}
},
actions : {
$not: {
$elemMatch : {
action_type : { $in: ["beach"] },
date : { $gt : start }
}
}
}
]
});