Server side paging and grouping of large dataset - mongodb

I'll try to explain the issue as best I can. Implement a grid with server paging. On request for N entities, DB should return a set of data which should be grouped or better said transformed in such a way that when the transformation phase is done it should result in those N entities.
Best way as I can see is something like this:
Query_all_data() => Result; (10000000 documents)
Transform(Result) => Transformed (100 groups)
Transformed.Skip(N).Take(N)
Transformation phase should be something like this:
Result = [d0, d1, d2..., dN]
Transformed = [
{ info: "foo", docs: [d0. d2, d21, d67, d100042] },
{ info: "bar", docs: [d3. d28, d121, d6271, d100042] },
{ info: "baz", docs: [d41. d26, d221, d567, d100043] },
{ info: "waz", docs: [d22. d24, d241, d167, d1000324] }
]
Every object in Transformed is an entity in grid.
I'm not sure if it's important but the DB in question is MongoDB and all documents are stored in one collection. Now, the huge pitfall of this approach is that it's way to slow on large dataset which will most certainly be the case.
Is there a better approach. Maybe different DB design?

#dakt, you can store your data in couple of different ways based on how you are going to use the data. In the process it may also be useful to store data in de-normalized form where in some duplication of data may occur.
Store data as individual documents as mentioned in your problem statement
Store the data in transformed format in your problem statement. It looks like you have a consistent way of mapping the docs to some tag. If so, why not maintain documents such that they are always embedded for those tags. This certainly has limitation on number of docs that you may be able to contain base on the 16MB document limit.
I would suggest looking at the MongoDB use-cases - http://docs.mongodb.org/ecosystem/use-cases/ and see if any of those are similar to what you are trying to achieve.

Related

MongoDB Collection Structure

I have a collection named User.
User has many information that can be grouped together example:
{address : {street:"xx", city:"xx"}}
{history: {school:"xx", job: "xx"}}
{..}
So I want to know what the best practice is
1. First way, Using nested fields:
{user:
{address : {street:"xx", city:"xx"}}
{history: {school:"xx", job: "xx"}}
{..}
}
2. Second way: Just put them all together.
{user:
street:"xx",
city:"xx",
...
school:"xx",
job: "xx",
...
}
First way is obviously more readable for humans and it makes it easier for human to find relevant information.
What are the downside to grouping/nesting data like the first way?
Does it make querying of nested fields slower? Indexing issues? Any idea?
If you store the extra data under the user, it enables faster reading and writing of entire user document.
If you the extra data under separate collections, it may enable faster access find/update of that data (depending on your indexes). MongoDB does enable indexing fields in arrays
My suggestion is to try and list your common data access use cases, create a test database with a LOT of mock data, then test performances using queries and aggregations to defer between the different storage modelling options.

Where to store users followings/followers? User's document of Followings collection? Fat document VS. polymorphic documents

What I'm talking about is:
Meteor.users.findOne() =
{
_id: "..."
...
followers: {
users: Array[], // ["someUserId1", "someUserId2"]
pages: Array[] // ["somePageId1", "somePageId2"]
}
}
vs.
Followings.findOne() =
{
_id: "..."
followeeId: "..."
followeeType: "user"
followerId: "..."
}
I found second one totally inefficient because I need to use smartPublish to publish user's followers.
Meteor.smartPublish('userFollowers', function(userId) {
var coursors = [],
followings = Followings.find({followeeId: userId});
followings.forEach(function(following) {
coursors.push(Meteor.users.find({_id: following.followerId}));
});
return coursors;
});
And I can't filter users inside the iron-router. I cache subscriptions so there may be more users than I need.
I want to do something like this:
data: function() {
return {
users: Meteor.users.find({_id: {$in: Meteor.user().followers.users}})
};
},
A bad thing about using nested arrays inside the Document is that if I've added an item to followers.users[], the whole array will be sent back to the client.
So what do you think? Is it better to keep such data inside the user Document so it'll become fat? May be it's a 'Meteor way' of solving such problems.
I think it's a better idea to keep it nested inside the user document. Storing it in a separate collection leads to a lot of unnecessary duplication, and every time the publish function is run you have to scan the entire collection again. If you're worrying about the arrays growing too large, in most cases, don't (generally, a full-text novel only takes a few hundred kb). Plus, if you're publishing your user document already, you don't have to pull any new documents into memory; you already have everything you need.
This MongoDB blog post seems to advocate a similar approach (see one-to-many section). It might be worth checking out.
You seem to be aware of the pros and cons of each option. Unfortunately, your question is mostly opinion based.
Generally, if your follower arrays will be small in size and don't change often, keep them embedded.
Otherwise a dedicated collection is the way to go.
For that case, you might want to take a look at https://atmospherejs.com/cottz/publish which seems very efficient in what it does and very easy to implement syntactically.

Mongo Collections and Meteor Reactivity

I'm trying to decide the best approach for an app I'm working on. In my app each user has a number of custom forms for example X user will have custom forms and Y user will have 5 different forms customized to their needs.
My idea is to create a mongo db collection for each custom form, at the start I wouldn't have to many users I understand the mongo collection limit is set to 24000 (I think not sure). If that's correct I'm ok for now.
But I think this might create issues down the line but also not sure this is the best approach for performance, management and so forth.
The other option is to create one collocation "forms" and add custom data under an object field like so
{
_id: dfdfd34df4efdfdfdf,
data: {}
}
My concern with this is one Meteor reactivity and scale.
First I'm expecting each user to fill out each form at least 30 to 50 times per week, so I'm expecting the collection size to increase very fast. Which makes me question this approach and go with the collection option which breaks down the size.
My second concern or question is well Meteor be able to identify changes in the first level object and second level object. As I need the data to be reactive.
First Level
{
_id: dfdfd34df4efdfdfdf,
data: {}
}
Second Level
{
_id: dfdfd34df4efdfdfdf,
data: {
Object:
{
name:Y, _id: random id
}
}
}
The answer is somewhat here limits of number of collections in databases
It's not a yes or no but it's clear regrading the mongo collection limit. As for Meteor reactivity that's another topic.

Links vs References in Document databases

I am confused with the term 'link' for connecting documents
In OrientDB page http://www.orientechnologies.com/orientdb-vs-mongodb/ it states that they use links to connect documents, while in MongoDB documents are embedded.
Since in MongoDB http://docs.mongodb.org/manual/core/data-modeling-introduction/, documents can be referenced as well, I can not get the difference between linking documents or referencing them.
The goal of Document Oriented databases is to reduce "Impedance Mismatch" which is the degree to which data is split up to match some sort of database schema from the actual objects residing in memory at runtime. By using a document, the entire object is serialized to disk without the need to split things up across multiple tables and join them back together when retrieved.
That being said, a linked document is the same as a referenced document. They are simply two ways of saying the same thing. How those links are resolved at query time vary from one database implementation to another.
That being said, an embedded document is simply the act of storing an object type that somehow relates to a parent type, inside the parent. For example, I have a class as follows:
class User
{
string Name
List<Achievement> Achievements
}
Where Achievement is an arbitrary class (its contents don't matter for this example).
If I were to save this using linked documents, I would save User in a Users collection and Achievement in an Achievements collection with the List of Achievements for the user being links to the Achievement objects in the Achievements collection. This requires some sort of joining procedure to happen in the database engine itself. However, if you use embedded documents, you would simply save User in a Users collection where Achievements is inside the User document.
A JSON representation of the data for an embedded document would look (roughly) like this:
{
"name":"John Q Taxpayer",
"achievements":
[
{
"name":"High Score",
"point":10000
},
{
"name":"Low Score",
"point":-10000
}
]
}
Whereas a linked document might look something like this:
{
"name":"John Q Taxpayer",
"achievements":
[
"somelink1", "somelink2"
]
}
Inside an Achievements Collection
{
"somelink1":
{
"name":"High Score",
"point":10000
}
"somelink2":
{
"name":"High Score",
"point":10000
}
}
Keep in mind these are just approximate representations.
So to summarize, linked documents function much like RDBMS PK/FK relationships. This allows multiple documents in one collection to reference a single document in another collection, which can help with deduplication of data stored. However it adds a layer of complexity requiring the database engine to make multiple disk I/O calls to form the final document to be returned to user code. An embedded document more closely matches the object in memory, this reduces Impedance Mismatch and (in theory) reduces the number of disk I/O calls.
You can read up on Impedance Mismatch here: http://en.wikipedia.org/wiki/Object-relational_impedance_mismatch
UPDATE
I should add, that choosing the right database to implement for your needs is very important from the start. If you have a lot of questions about each database, it might make sense to contact each supplier and get some of their training material. MongoDB offers 2 free courses you can take to learn more about their product and best uses at MongoDB University. OrientDB does offer training, however it is not free. It might be best to try contacting them directly and getting some sort of pre-sales training (if you are looking to license the db), usually they will put you in touch with some sort of pre-sales consultant to help you evaluate their product.
MongoDB works like RDBMS where the object id is like a foreign key. This means a "JOIN" that is run-time expensive. OrientDB, instead, has direct links that are created only once and have a very low run-time cost.

How to model very-many to very-many relations in MongoDB

I'm working on porting a database to MongoDB and have run into some problems with the document size limit.
My understanding is that if you're going to always view one entity in the context of another entity, that embedding is the way to go.
However the data (genomic) has so many entities of each type, that even just storing the _id field in the embedded document puts me over the 16 MB size limit:
Genome
{
...
has_reactions:[id1, id2, ... idn] // Where n is really large
}
I've also tried modelling it the other way, but hit the same limitation:
Reaction
{
...
in_genomes:[id1, id2, ... idn] // Still really large
}
The MongoDB documentation gives great examples for one-to-one, and one-to-many relations, but doesn't have much to say on many-to-many.
In traditional SQL, I'd model this with a Genome, Reaction, and GenomeReaction set of tables. Is that the only way to go here as well?
Edit:
As for the background, reaction is a metabolic reaction, though it doesn't really matter what genomes and reactions mean in this context. It could just as well be a relationship between the types of gaskets in each of my widgets. It's a standard many-to-many relationship where both instances of "many" can be a very large number.
I'm aware that Mongo doesn't allow joins, but that's easily solved with using multiple queries, which is the recommended way of handling document references in Mongo.
We haven't chosen Mongo as a solution, we're just evaluating it as a possible solution. It looked attractive because it is billed as being able to handle "huMONGOus datasets", so I was a bit surprised by this limitation.
In all of our other use cases, Mongo has worked well. It's just this particular relationship that I'm unable to port from mysql to mongo without using a Genome, Reaction, and GenomeReaction set of collections. I can easily do this, but I was hoping that there was a more mongoy way to handle it.
Perhaps mongo doesn't handle many-to-many relationships well, which would explain its conspicuous absence from the list of data model scenarios in its docs.
After asking about this on the official mongo-db mailing list, I discovered that the recommended way to handle scenarios like this is to either use the three collection mapping I mentioned in my original post, or to use a "hybrid schema design" where one of the collections is stored in buckets.
So you'd have something like:
// genomes collection
{
_id: 1,
genome_thingee: 'blah blah'
...
}
// reaction_buckets collection
{
_id: ObjectId(...),
genome_id: 1,
count: 100,
reactions: [
{ reaction-key1: value, reaction-key 2: value},
{ reaction-key1: value, reaction-key 2: value},
{ reaction-key1: value, reaction-key 2: value},
{ reaction-key1: value, reaction-key 2: value},
{ reaction-key1: value, reaction-key 2: value},
...]
}
As you might imagine, there are all kinds of implications to this kind of model that your application has to take into account when adding or querying data.
While in the end this approach doesn't really appeal to me (and thus I decided to look at Neo4j at #Philipp's suggestion), I thought I'd post the solution in case anyone else needs to solve a similar problem and doesn't mind the hybrid/bucket approach.