MongoDB Atlas Search index on normalized/indexed model - mongodb

I'd like to use the fresh Atlas search index feature to perform search through my models.
It seems to me that the data model that I used can't be coupled with this mongo feature.
It seems to work really fine on embedded models, but for consistency reasons I can't nest objects, they are referenced by their id.
Example
Collection Product
{
name: "Foo product"
quantity: 3
tags: [
"id_123"
]
}
Collection Vendor
{
name: "Bar vendor"
address: ...
tags: [
"id_123"
]
}
Collection Tags
{
id: "id_123"
name: "food"
}
What I want
I want to type food in my search bar, and find the products associated to the tag food.
Detailed problematic
I have multiple business objects that are labelled by the same tag. I'd like to build a search index to search through my products, but I would want to $lookup before to denormalize my ids and to be able to find all the products that have the tag "food".
From the documentation, the $search operator must be the first operator of the aggregation pipeline, preventing me from lookup before searching. I had the idea to build a view first, to unpack the id with the correct tag to prepare the field. But impossible to build a search index on a view.
Is it completely impossible to make this work ? Do I need to give up on consistency on my tags by flattening and by embedding each of them directly in each model I need them to be able to use this feature ? That means if I want to update a tag, I need to find every business object that carry around the tag, and perform the update ?

I got in touch with the MongoDB support, and the Atlas Search proposed three ways to resolve this problem. I want to share the solutions with you if anybody steps on the same problem than I had to go through due to this model design.
Recommended: Transform the model in the Embedded way
The ideal MongoDB way of doing this would be to denormalize the model, and not using reference to various model. It has some drawbacks, like painful updates: each tags would be added directly in the model of Product, and Vendor, so there is no $lookup operations needed anymore. For my part, it is a no-go, the tags are planned to be updatable, and will be shared in almost every business objects I plan on making a model.
Collection Product
{
name: "Foo product"
quantity: 3
tags: [
"food"
]
}
Collection Vendor
{
name: "Bar vendor"
address: ...
tags: [
"food"
]
}
Not recommended but possible: Break the request in multiple parts
This would imply to keep the existing model, and to request the collections individually and resolving the sequential requests, application side.
We could put a Atlas Search index on Tags collection and use the research feature to find out the id of the tag we want. Then we could use this id to fetch directly in the Product/Vendor collection to find the product corresponding to the "food" tag. By tinkering the search application side, we could obtain satisfying results.
It is not the recommended way of doing it.
Theoretically my preferred way: Use the Materialized View feature
That is an intermediary solution, that will be the one I will try out. It is not perfect but for what I see, it tries to conciliated both of the capabilities of Referenced Model and Embedded model.
Atlas Search indexes are not usable on regular views. The workaround that can make this possible is Materialized view (which are more collection than view in definitive). This is made through the usage of the $merge operator which enables to save the results of ones aggregation pipeline in a collection. By re-running the pipeline, we can update the Materialized view. The trick is to make all required $lookup operations to denormalize the referenced model. Then use as final step the $merge operator to create the collection that supports the Atlas Search Index from scratch as any collection.
The only concern is the interval of update to choose for updating the Materialized view, that can be performance greedy. But on the paper, it is a really good solution for people like me that cannot (won't?) pay the price of painful updates strategy on Embedded models.

Related

MongoDB: Looking for advice on designing schema for improving query efficiency

I am fairly new to MongoDB and I’m looking for advice on designing the schema before I commit to going down this route. I’m developing a collaborative documentation system, where the user creates a document and invites other users to collaborate, much like Google docs.
There are two collections. The first one stores documents and the second one stores lists of collaborators. When the user creates a new document, they assign a list of collaborators to this document. In the simplest form, the schema would look something like this
The Document schema contains some data but it also maintains a reference to a document in the Collaborators collection
Document model
{
....
collaborators: ObjectId; // e.g. 0x507f1f77bcf86cd799439011
}
Collaborators collection contains documents that contain an array of roles for the collaborators.
Collaborators model
{
_id: 0x507f1f77bcf86cd799439011; // refererenced by Document model
collaborators: [
{userId: 1, role: "editor"},
{userId: 2, role: "commenter}
]
}
I will have an API that fetches all those documents where the logged-in user’s userId is in the list of collaborators referenced by the document. Without much experience with writing efficient queries, I think a two-step lookup will work but it won’t be very efficient.
Step 1 → Find all the collaborators lists which contain userId, and obtain their _id field
Step 2 → Find all documents that have collaborators field containing one of the values found in Step 1
Is there a more efficient way to construct this query particularly if the users fetch this list frequently?
If I should redesign the schema in some way so that the lookup can be efficient, I’d like to know.
I'm using mongoose client if that's relevant.
I realized using MongoDB aggregation framework is what I needed. I was able to use $lookup and $match stage to achieve what I want. Still not sure how expensive this is given that $lookup will perform left join.
Here’s an example if anybody wants to look.
https://mongoplayground.net/p/RPheBZESC0H

Many to many relationship on Mongodb based e-learning webapp?

I am relatively new to No-SQL databases. I am designing a data structure for an e-learning web app. There would be X quantity of courses and Y quantity of users.
Every user will be able to take any number of courses.
Every course will be compound of many sections (each section may be a video or a quiz).
I will need to keep track of every section a user takes, so I think the whole course should be part of the user set (for each user), like so:
{
_id: "ed",
name: "Eduardo Ibarra",
courses: [
{
name: "Node JS",
progress: "100%",
section: [
{name: "Introdiction", passed:"100%", field3:"x", field4:""},
{name: "Quiz 1", passed:"75%", questions:[...], field3:"x", field4:""},
]
},
{
name: "MongoDB",
progress: "65%",
...
}
]
}
Is this the best way to do it?
I would say that design your database depending upon your queries. One thing is for sure.. You will have to do some embedding.
If you are going to perform more queries on what a user is doing, then make user as the primary entity and embed the courses within it. You don't need to embed the entire course info. The info about a course is static. For ex: the data about Node JS course - i.e. the content, author of the course, exercise files etc - will not change. So you can keep the courses' info separately in another collection. But how much of the course a user has completed is dependent on the individual user. So you should only keep the id of the course (which is stored in the separate 'course' collection) and for each user you can store the information that is related to that (User, Course) pair embedded in the user collection itself.
Now the most important question - what to do if you have to perform queries which require 'join' of user and course collections? For this you can use javascript to first get the courses (and maybe store them in an array or list etc) and then fetch the user for each of those courses from the courses collection or vice-versa. There are a few drivers available online to help you accomplish this. One is UnityJDBC which is available here.
From my experience, I understand that knowing what you are going to query from MongoDB is very helpful in designing your database because the NoSQL nature of MongoDB implies that you have no correct way for designing. Every way is incorrect if it does not allow you in accomplishing your task. So clearly, knowing beforehand what you will do (i.e. what you will query) with the database is the only guide.

Mongo for Meteor data design: opposite of normalizing?

I'm new to Meteor and Mongo. Really digging both, but want to get feedback on something. I am digging into porting an app I made with Django over to Meteor and want to handle certain kinds of relations in a way that makes sense in Meteor. Given, I am more used to thinking about things in a Postgres way. So here goes.
Let's say I have three related collections: Locations, Beverages and Inventories. For this question though, I will only focus on the Locations and the Inventories. Here are the models as I've currently defined them:
Location:
_id: "someID"
beverages:
_id: "someID"
fillTo: "87"
name: "Beer"
orderWhen: "87"
startUnits: "87"
name: "Second"
number: "102"
organization: "The Second One"
Inventories:
_id: "someID"
beverages:
0: Object
name: "Diet Coke"
units: "88"
location: "someID"
timestamp: 1397622495615
user_id: "someID"
But here is my dilemma, I often need to retrieve one or many Inventories documents and need to render the "fillTo", "orderWhen" and "startUnits" per beverage. Doing things the Mongodb way it looks like I should actually be embedding these properties as I store each Inventory. But that feels really non-DRY (and dirty).
On the other hand, it seems like a lot of effort & querying to render a table for each Inventory taken. I would need to go get each Inventory, then lookup "fillTo", "orderWhen" and "startUnits" per beverage per location then render these in a table (I'm not even sure how I'd do that well).
TIA for the feedback!
If you only need this for rendering purposes (i.e. no further queries), then you can use the transform hook like this:
var myAwesomeCursor = Inventories.find(/* selector */, {
transform: function (doc) {
_.each(doc.beverages, function (bev) {
// use whatever method you want to receive these data,
// possibly from some cache or even another collection
// bev.fillTo = ...
// bev.orderWhen = ...
// bev.startUnits = ...
}
}
});
Now the myAwesomeCursor can be passed to each helper, and you're done.
In your case you might find denormalizing the inventories so they are a property of locations could be the best option, especially since they are a one-to-many relationship. In MongoDB and several other document databases, denormalizing is often preferred because it requires fewer queries and updates. As you've noticed, joins are not supported and must be done manually. As apendua mentions, Meteor's transform callback is probably the best place for the joins to happen.
However, the inventories may contain many beverage records and could cause the location records to grow too large over time. I highly recommend reading this page in the MongoDB docs (and the rest of the docs, of course). Essentially, this is a complex decision that could eventually have important performance implications for your application. Both normalized and denormalized data models are valid options in MongoDB, and both have their pros and cons.

MongoDB - manual references example

I was reading the manual references part from the MongoDB Database References documentation, but I don't really understand the part of the "second query to resolve the referenced fields". Could you give me an example of this query, so i can get a better idea of what they are talking about.
"Manual references refers to the practice of including one document’s _id field in another document. The application can then issue a second query to resolve the referenced fields as needed."
The documentation is pretty clear in the manual section you are referring to which is the section on Database References. The most important part in comprehending this is contained in the opening statement on the page:
"MongoDB does not support joins. In MongoDB some data is denormalized, or stored with related data in documents to remove the need for joins. However, in some cases it makes sense to store related information in separate documents, typically in different collections or databases."
The further information covers the topic of how you might choose to deal with accessing data that you store in another collection.
There is the DBRef specification which without going into too much more detail, may be implemented in some drivers as a way that when these are found in your documents they will automatically retrieve (expand) the referenced document into the current document. This would be implemented "behind the scenes" with another query to that collection for the document of that _id.
In the case of Manual References this is basically saying that there is merely a field in your document that has as it's content the ObjectId from another document. This only differs from the DBRef as something that will never be processed by a base driver implementation is leaves how you handle any further retrieval of that other document soley up to you.
In the case of:
> db.collection.findOne()
{
_id: <ObjectId>,
name: "This",
something: "Else",
ref: <AnotherObjectId>
}
The ref field in the document is nothing more than a plain ObjectId and does nothing special. What this allows you to do is submit your own query to get the Object details this refers to:
> db.othercollection.findOne({ _id: <AnotherObjectId > })
{
_id: <ObjectId>
name: "That"
something: "I am a sub-document to This!"
}
Keep in mind that all of this processes on the client side via the driver API. None of this fetching other documents happens on the server in any case.

Links vs References in Document databases

I am confused with the term 'link' for connecting documents
In OrientDB page http://www.orientechnologies.com/orientdb-vs-mongodb/ it states that they use links to connect documents, while in MongoDB documents are embedded.
Since in MongoDB http://docs.mongodb.org/manual/core/data-modeling-introduction/, documents can be referenced as well, I can not get the difference between linking documents or referencing them.
The goal of Document Oriented databases is to reduce "Impedance Mismatch" which is the degree to which data is split up to match some sort of database schema from the actual objects residing in memory at runtime. By using a document, the entire object is serialized to disk without the need to split things up across multiple tables and join them back together when retrieved.
That being said, a linked document is the same as a referenced document. They are simply two ways of saying the same thing. How those links are resolved at query time vary from one database implementation to another.
That being said, an embedded document is simply the act of storing an object type that somehow relates to a parent type, inside the parent. For example, I have a class as follows:
class User
{
string Name
List<Achievement> Achievements
}
Where Achievement is an arbitrary class (its contents don't matter for this example).
If I were to save this using linked documents, I would save User in a Users collection and Achievement in an Achievements collection with the List of Achievements for the user being links to the Achievement objects in the Achievements collection. This requires some sort of joining procedure to happen in the database engine itself. However, if you use embedded documents, you would simply save User in a Users collection where Achievements is inside the User document.
A JSON representation of the data for an embedded document would look (roughly) like this:
{
"name":"John Q Taxpayer",
"achievements":
[
{
"name":"High Score",
"point":10000
},
{
"name":"Low Score",
"point":-10000
}
]
}
Whereas a linked document might look something like this:
{
"name":"John Q Taxpayer",
"achievements":
[
"somelink1", "somelink2"
]
}
Inside an Achievements Collection
{
"somelink1":
{
"name":"High Score",
"point":10000
}
"somelink2":
{
"name":"High Score",
"point":10000
}
}
Keep in mind these are just approximate representations.
So to summarize, linked documents function much like RDBMS PK/FK relationships. This allows multiple documents in one collection to reference a single document in another collection, which can help with deduplication of data stored. However it adds a layer of complexity requiring the database engine to make multiple disk I/O calls to form the final document to be returned to user code. An embedded document more closely matches the object in memory, this reduces Impedance Mismatch and (in theory) reduces the number of disk I/O calls.
You can read up on Impedance Mismatch here: http://en.wikipedia.org/wiki/Object-relational_impedance_mismatch
UPDATE
I should add, that choosing the right database to implement for your needs is very important from the start. If you have a lot of questions about each database, it might make sense to contact each supplier and get some of their training material. MongoDB offers 2 free courses you can take to learn more about their product and best uses at MongoDB University. OrientDB does offer training, however it is not free. It might be best to try contacting them directly and getting some sort of pre-sales training (if you are looking to license the db), usually they will put you in touch with some sort of pre-sales consultant to help you evaluate their product.
MongoDB works like RDBMS where the object id is like a foreign key. This means a "JOIN" that is run-time expensive. OrientDB, instead, has direct links that are created only once and have a very low run-time cost.