I am building a real estate application using node and MongoDB. I have two major models
City
Property
I am now confused because I don't know if I should create a separate collection for cities and one for properties. Or I should put all the properties under it's city?
I am confused because I think when the application grow, large cities will be huge documents, which is a design decision should be done by the first.
Please let me know if you have a best practice way to handle this kind of situations.
As every property has only one city, this is a one-to-many relationship. In this case you have many options:
Firstly, remember the 16 MB document size restriction per document. So, how big the "many" is. How many properties per city?
One-to-Few (just a few hundred): embedding the "few" (property) in "one" (city).
One-to-Many (no more than a couple of thousand): child-referencing.
The ObjectIDs of the "many" (property) doc in an array in "one"(city) document.
One-to-Squillions: parent-referencing.
Store the ObjectId of the "one" (city) in the "many" (property) document.
Secondly, if there’s an high ratio of reads to updates, you can considering denormalization. Paying the price of slower and complex updates in order to get more efficient queries.
A proposed solution: having only one collection (properties) and in their documents embed the city document.
As probably, you are going to retrieve the properties by city, don't forget to create an index on the city field.
Recommending posts:
http://blog.mongodb.org/post/87200945828/6-rules-of-thumb-for-mongodb-schema-design-part-1
http://blog.mongodb.org/post/87892923503/6-rules-of-thumb-for-mongodb-schema-design-part-2
http://blog.mongodb.org/post/88473035333/6-rules-of-thumb-for-mongodb-schema-design-part-3
I assume the city center coordinates are not going to change very often. So embedding cities to properties is possible if the city documents are not very big and you need information abot them each time you read a property. On the other hand, you can put cities into a separate collection and link it from a property. Finally, you can embed most useful information about a city along with link to a property (mixed approach). Embedding a city to property will introduce some duplication but may increase read performance. To choose most appropriate option you need to understand what read/write requests the application will make most often.
Related
I'm working to develop an app with my team. It's based on Meteor and React. We have 2 collections: Rooms and Locations. Each room has an uniq location. We have a page where we list all the rooms and we can filter them. This is the most used feature. Insert of new room or new location can be done only by the admin.
We are design our filter (by date, by floor, by time, by location name). All the property we need are in the Rooms collection, excpetion done for the location name. We come out with two solutions:
duplicate the location name used in the filter also for each room in the Rooms collections.
get the list of rooms for each property.
I'm try to figure out which one is the best.
first option:
In that case we only need one collection: Rooms. Will cost O(n). The cost to add the location name to the new room will be the same since we already need to add the property id. The extra cost will be the space on MongoDB to save it.
second option.
In this solution we have all the data well structured in the DB. But to filter by location we need to parse each room and find the proper location in the location collections. Only this I think will cost O(n*m).
This is a simple case, we will never scale to much, but since I'm new to mongo I would like to know which one of the two approach can lead to have better performance.
I am working on an app where I use core data.
I already have tried to do it with one entity but that didn't work.
But I now have like twenty entities and my question is: is there a limit to the number of entities or a number recommended?
Is there a better way to store that amount of data?
UPDATE:
What I am storing are grads from school but not the A,b,c,d,e,f but a number from 1 to 10. And each grad has is own weighing(amount of times a number count) like some grad count 2 time because the are more imported.
So i first thought to have an array with a string for the name of the subject and then to array one store's the grad the other the corresponding weighing.
Like this:
var subjects: [String,[Int],[Int]]
but this isn't possible and I don't even know how I should put this in core data and get it back properly.
Because I couldn't figure it out, I thought of just making a entity for each subject but there are a lot of them so there for this question.
There's no limit to the number of entities, but it's possible to go overboard and create more than you actually need. The recommended number is "as many as you need and no more", which obviously will vary a great deal depending on the nature of the data and how the app uses it. Whether there's a better way than your current approach is totally dependent on the fine details of exactly what you're doing, and so is impossible to answer without a far more detailed question.
You could setup a Subject entity that has one-to-many relationships to ordered sets of Grade and Weight, like so:
However, each grade apparently has a corresponding weight, so it would be more accurate to store each grade's weight in the Grade entity:
This still may not represent your real-world model.
If your subject is something general, like math or english, you could have more than one subject per grade, (e.g., algebra, geometry, trigonometry), or more than one level per subject (e.g., algebra 1, algebra 2) which may or may not have a different grade.
If your subject is very specific, your data may end up spread across unique one-to-one relationships, instead of one-to-many relationships.
You would also need to consider whether you can use ordered or unordered relationships, or whether an attribute exists that you can use to sort an entity.
You should consider these different facets of what you're trying to model (as well as the specific fetches you'd want to perform), before you try to design or implement the model, to allow you to efficiently represent this particular object graph.
I'm more used to a relational database and am having a hard time thinking about how to design my database in mongoDB, and am even more unclear when taking into account some of the special considerations of database design for meteorjs, where I understand you often prefer separate collections over embedded documents/data in order to make better use of some of the benefits you get from collections.
Let's say I want to track students progress in high school. They need to complete certain required classes each school year in order to progress to the next year (freshman, sophomore, junior, senior), and they can also complete some electives. I need to track when the students complete each requirement or elective. And the requirements may change slightly from year to year, but I need to remember for example that Johnny completed all of the freshman requirements as they existed two years ago.
So I have:
Students
Requirements
Electives
Grades (frosh, etc.)
Years
Mostly, I'm trying to think about how to set up the requirements. In a relational DB, I'd have a table of requirements, with className, grade, and year, and a table of student_requirements, that tracks the students as they complete each requirement. But I'm thinking in MongoDB/meteorjs, I'd have a model for each grade/level that gets stored with a studentID and initially instantiates with false values for each requirement, like:
{
student: [studentID],
class: 'freshman'
year: 2014,
requirements: {
class1: false,
class2: false
}
}
and as the student completes a requirement, it updates like:
{
student: [studentID],
class: 'freshman'
year: 2014,
requirements: {
class1: false,
class2: [completionDateTime]
}
}
So in this way, each student will collect four Requirements documents, which are somewhat dictated by their initial instantiation values. And instead of the actual requirements for each grade/year living in the database, they would essentially live in the code itself.
Some of the actions I would like to be able to support are marking off requirements across a set of students at one time, and showing a grid of users/requirements to see who needs what.
Does this sound reasonable? Or is there a better way to approach this? I'm pretty early in this application and am hoping to avoid painting myself into a corner. Any help suggestion is appreciated. Thanks! :-)
Currently I'm thinking about my application data design too. I've read the examples in the MongoDB manual
look up MongoDB manual data model design - docs.mongodb.org/manual/core/data-model-design/
and here -> MongoDB manual one to one relationship - docs.mongodb.org/manual/tutorial/model-embedded-one-to-one-relationships-between-documents/
(sorry I can't post more than one link at the moment in an answer)
They say:
In general, use embedded data models when:
you have “contains” relationships between entities.
you have one-to-many relationships between entities. In these relationships the “many” or child documents always appear with or are viewed in the context of the “one” or parent documents.
The normalized approach uses a reference in a document, to another document. Just like in the Meteor.js book. They create a web app which shows posts, and each post has a set of comments. They use two collections, the posts and the comments. When adding a comment it's submitted together with the post_id.
So in your example you have a students collection. And each student has to fulfill requirements? And each student has his own requirements like a post has his own comments?
Then I would handle it like they did in the book. With two collections. I think that should be the normalized approach, not the embedded.
I'm a little confused myself, so maybe you can tell me, if my answer makes sense.
Maybe you can help me too? I'm trying to make a app that manages a flea market.
Users of the app create events.
The creator of the event invites users to be cashiers for that event.
Users create lists of stuff they want to sell. Max. number of lists/sellers per event. Max. number of position on a list (25/50).
Cashiers type in the positions of those lists at the event, to track what is sold.
Event creators make billings for the sold stuff of each list, to hand out the money afterwards.
I'm confused how to set up the data design. I need Events and Lists. Do I use the normalized approach, or the embedded one?
Edit:
After reading percona.com/blog/2013/08/01/schema-design-in-mongodb-vs-schema-design-in-mysql/ I found following advice:
If you read people information 99% of the time, having 2 separate collections can be a good solution: it avoids keeping in memory data is almost never used (passport information) and when you need to have all information for a given person, it may be acceptable to do the join in the application.
Same thing if you want to display the name of people on one screen and the passport information on another screen.
But if you want to display all information for a given person, storing everything in the same collection (with embedding or with a flat structure) is likely to be the best solution
I have a Student collection and a Person Collection.
Person contains the fields: name, address, etc
Student contains: rollno, and a person field that stores the person._id for this student
Now I want to show the name of student in the student template, but note that there's no name field in Student, I'll need to get that from that student's Person document.
Is there a way to get a mongodb cursor on the client that has the student information as well as selective field from that student's person document?
Also, is there a better or more standard way of achieving what I'm trying to achieve?
Note: I don't want to use redundancy and store the name field on the Student document, so that's not a solution
is there a better or more standard way of achieving what I'm trying to
achieve?
It sounds like you are trying to read all information about a student in one read - the only way to do that is to have all that information in a single document.
Flexible schema of document databases allow you to have documents in a single collection which are not required to have the same schema, aka number of fields.
So I would recommend that you consider why you actually need separate collections for person and student - this causes writes to two collections when you add a student (and while a single write is atomic, two writes are not) and it also causes the issue you have now where you need to have two separate reads to get all information about a student.
This SO question is somewhat related to your situation.
See the accepted answer in this thread:
Possible bug when observing a cursor, when deleting from collection
It involves using a modified version of the built-in _publishCursor titled publishModifiedCursor, which allows you to specify a callback to add properties to each document in the cursor you are publishing.
I would change your code to have a role / job attribute in the Person object.
It's semantic, at least for me, and think about the difficulty level in someone changing jobs in your original method vs simply changing the role.
Then you could just search
Persons.find {role: 'student'}
And that would just be totally analogous with having a student object.
As Asya said, the students can just have extra fields the other ones done have.
I am trying to figure out the equivalent of foreign keys and indexes in NoSQL KVP or Document databases. Since there are no pivotal tables (to add keys marking a relation between two objects) I am really stumped as to how you would be able to retrieve data in a way that would be useful for normal web pages.
Say I have a user, and this user leaves many comments all over the site. The only way I can think of to keep track of that users comments is to
Embed them in the user object (which seems quite useless)
Create and maintain a user_id:comments value that contains a list of each comment's key [comment:34, comment:197, etc...] so that that I can fetch them as needed.
However, taking the second example you will soon hit a brick wall when you use it for tracking other things like a key called "active_comments" which might contain 30 million ids in it making it cost a TON to query each page just to know some recent active comments. It also would be very prone to race-conditions as many pages might try to update it at the same time.
How can I track relations like the following in a NoSQL database?
All of a user's comments
All active comments
All posts tagged with [keyword]
All students in a club - or all clubs a student is in
Or am I thinking about this incorrectly?
All the answers for how to store many-to-many associations in the "NoSQL way" reduce to the same thing: storing data redundantly.
In NoSQL, you don't design your database based on the relationships between data entities. You design your database based on the queries you will run against it. Use the same criteria you would use to denormalize a relational database: if it's more important for data to have cohesion (think of values in a comma-separated list instead of a normalized table), then do it that way.
But this inevitably optimizes for one type of query (e.g. comments by any user for a given article) at the expense of other types of queries (comments for any article by a given user). If your application has the need for both types of queries to be equally optimized, you should not denormalize. And likewise, you should not use a NoSQL solution if you need to use the data in a relational way.
There is a risk with denormalization and redundancy that redundant sets of data will get out of sync with one another. This is called an anomaly. When you use a normalized relational database, the RDBMS can prevent anomalies. In a denormalized database or in NoSQL, it becomes your responsibility to write application code to prevent anomalies.
One might think that it'd be great for a NoSQL database to do the hard work of preventing anomalies for you. There is a paradigm that can do this -- the relational paradigm.
The couchDB approach suggest to emit proper classes of stuff in map phase and summarize it in reduce.. So you could map all comments and emit 1 for the given user and later print out only ones. It would require however lots of disk storage to build persistent views of all trackable data in couchDB. btw they have also this wiki page about relationships: http://wiki.apache.org/couchdb/EntityRelationship.
Riak on the other hand has tool to build relations. It is link. You can input address of a linked (here comment) document to the 'root' document (here user document). It has one trick. If it is distributed it may be modified at one time in many locations. It will cause conflicts and as a result huge vector clock tree :/ ..not so bad, not so good.
Riak has also yet another 'mechanism'. It has 2-layer key name space, so called bucket and key. So, for student example, If we have club A, B and C and student StudentX, StudentY you could maintain following convention:
{ Key = {ClubA, StudentX}, Value = true },
{ Key = {ClubB, StudentX}, Value = true },
{ Key = {ClubA, StudentY}, Value = true }
and to read relation just list keys in given buckets. Whats wrong with that? It is damn slow. Listing buckets was never priority for riak. It is getting better and better tho. btw. you do not waste memory because this example {true} can be linked to single full profile of StudentX or Y (here conflicts are not possible).
As you see it NoSQL != NoSQL. You need to look at specific implementation and test it for yourself.
Mentioned before Column stores look like good fit for relations.. but it all depends on your A and C and P needs;) If you do not need A and you have less than Peta bytes just leave it, go ahead with MySql or Postgres.
good luck
user:userid:comments is a reasonable approach - think of it as the equivalent of a column index in SQL, with the added requirement that you cannot query on unindexed columns.
This is where you need to think about your requirements. A list with 30 million items is not unreasonable because it is slow, but because it is impractical to ever do anything with it. If your real requirement is to display some recent comments you are better off keeping a very short list that gets updated whenever a comment is added - remember that NoSQL has no normalization requirement. Race conditions are an issue with lists in a basic key value store but generally either your platform supports lists properly, you can do something with locks, or you don't actually care about failed updates.
Same as for user comments - create an index keyword:posts
More of the same - probably a list of clubs as a property of student and an index on that field to get all members of a club
You have
"user": {
"userid": "unique value",
"category": "student",
"metainfo": "yada yada yada",
"clubs": ["archery", "kendo"]
}
"comments": {
"commentid": "unique value",
"pageid": "unique value",
"post-time": "ISO Date",
"userid": "OP id -> THIS IS IMPORTANT"
}
"page": {
"pageid": "unique value",
"post-time": "ISO Date",
"op-id": "user id",
"tag": ["abc", "zxcv", "qwer"]
}
Well in a relational database the normal thing to do would be in a one-to-many relation is to normalize the data. That is the same thing you would do in a NoSQL database as well. Simply index the fields which you will be fetching the information with.
For example, the important indexes for you are
Comment.UserID
Comment.PageID
Comment.PostTime
Page.Tag[]
If you are using NosDB (A .NET based NoSQL Database with SQL support) your queries will be like
SELECT * FROM Comments WHERE userid = ‘That user’;
SELECT * FROM Comments WHERE pageid = ‘That user’;
SELECT * FROM Comments WHERE post-time > DateTime('2016, 1, 1');
SELECT * FROM Page WHERE tag = 'kendo'
Check all the supported query types from their SQL cheat sheet or documentation.
Although, it is best to use RDBMS in such cases instead of NoSQL, yet one possible solution is to maintain additional nodes or collections to manage mapping and indexes. It may have additional cost in form of extra collections/nodes and processing, but it will give an solution easy to maintain and avoid data redundancy.