NoSQL how to lookup id in a collection - google-cloud-firestore

NoSQL noob here. I'm building an app using Firestore NoSQL. I'm looping through items where every item has a owner id (creator user id).
I want to display owner's name on the listing page. In traditional SQL, i have foreign key so I can just make reference to say, Item.Owner.FirstName
What's the best practice in NoSQL? Should I be saving owner name as a field at the time of saving the item? or do a lookup of each owner id to get user object whilst i'm looping through items?
Second option sounds expensive so i'm assuming 1st way is the way to go. Unless there's a better, more accepted way?

Both will work. You either reference the data in the other document in whatever way you see fit, or you duplicate information into the document that you intend to query to build the display. You just have to decide what which problem you want to deal with:
If you duplicate data among documents (known as "denormalization"), then you'll have to put effort into keeping them all up to date with each other, if that's what you require. So, writing one document might actually turn into writing multiple documents.
If you normalize your data with no duplication, then each of your queries will require more queries to get the related data from other documents. This could result in a drop in performance and an increase in cost for apps with heavy read loads.
Since we don't know the performance requirements and usage behavior of your app, there is no way to give specific advice. You will have to think carefully about which problem you want to have, perhaps based on complexity, performance, and overall cost.

Related

creating schema vs adding an additional field?

I want to store featured products like staff picks, featured products of each category in my system that will hold at most 10 documents. My priority is read performance over write performance but I also want to have an efficient storage system and I have three ways to do it in my mind:
Create a boolean field such as is_bestseller, is_staffpick in Products schema and query for it.
I think this is the simplest way to do it but I think it would require an additional query to check if the at most 10 limit has been reached.
Create a FeaturedProducts schema that holds references of product ids.
This is useful in the sense that if I want to add some additional info such as a featured product within the featured products then I could simply add a field in this schema. It would also be easy to check the at most 10 documents limit. I think this makes it more scalable but at the cost of performance?
Create a FeaturedProducts schema that will hold all the needed data.
I think performance wise this would be the best but I'm not sure if this is an efficient way to store data. Basically, I would just duplicate the data of a product and store it. Obviously, if I have to update product details then I have to update it in two places now but the read-to-write ratio heavily favors reading so I am willing to do this even if it's gonna require more logic regarding updating and deleting products. Also it would be easy to set at most 10 documents limit.
I tried to look for some examples regarding featured products but couldn't find anything useful. I am not sure what the best practice is here and which way to go about so any kind of help is appreciated.
The rule of thumb when modeling your data in MongoDB is:
Data that is accessed together should be stored together.
Havin that in mind I considered The Extended Reference Pattern a great options for you use case, here is a example from the MongoDB Blog.
Considere an e-commerce application where you have user collection, order collection and others. Where users and orders has a 1-N relation, embedding all of the information about a customer for each order just to reduce the JOIN operation results in a lot of duplicated information.
Instead of duplicating all of the information on the customer, we only copy the fields we access frequently.
This schema will have height read performance, because all the information will be store in a single document, at the cost of some duplicate data, but that is not completely bad considering that it can sever as history data.
Useful information:
Patterns
Design Anti-pattern
A potential solution is to use an index here so that you can maximize your query performance. You would create an additional boolean flag (as you indicated in your first solution) then index that query, with a cursor that limits the number of returned values.
For more ways to increase your query performance check out the official Mongo docs here. If you're curious as to how much more performant your queries become, you can use Mongo's explain() method to get benchmarks (more info here) and compare approaches.
Best of luck!

Mass Update NoSQL Documents: Bad Practice?

I'm storing two collections in a MongoDB database:
==Websites==
id
nickname
url
==Checks==
id
website_id
status
I want to display a list of check statuses with the appropriate website nickname.
For example:
[Google, 200] << (basically a join in SQL-world)
I have thousands of checks and only a few websites.
Which is more efficient?
Store the nickname of the website within the "check" directly. This means if the nickname is ever changed, I'll have to perform a mass update of thousands of documents.
Return a multidimensional array where the site ID is the key and the nickname is the value. This is to be used when iterating through the list of checks.
I've read that #1 isn't too bad (in the NoSQL) world and may, in fact, be preferred? True?
If it's only a few websites I'd go with option 1 - not as clean and normalized as in the relational/SQL world but it works and much less painful than trying to emulate joins with MongoDB. The thing to remember with MongoDB or any other NoSQL database is that you are generally making some kind of trade off - nothing is for free. I personally really value the schema-less document oriented data design and for the applications I use it for I readily make the trade-offs (like no joins and transactions).
That said, this is a trade-off - so one thing to always be asking yourself in this situation is why am I using MongoDB or some other NoSQL database? Yes, it's trendy and "hot", but I'd make certain that what you are doing makes sense for a NoSQL approach. If you are spending a lot of time working around the lack of joins and foreign keys, no transactions and other things you're used to in the SQL world I'd think seriously about whether this is the best fit for your problem.
You might consider a 3rd option: Get rid of the Checks collection and embed the checks for each website as an array in each Websites document.
This way you avoid any JOINs and you avoid inconsistencies, because it is impossible for a Check to exist without the Website it belongs to.
This, however, is only recommended when the checks array for each document stays relatively constant over time and doesn't grow constantly. Rapidly growing documents should be avoided in MongoDB, because everytime a document doubles its size, it is moved to a different location in the physical file it is stored in, which slows down write-operations. Also, MongoDB has a 16MB limit per document. This limit exists mostly to discourage growing documents.
You haven't said what a Check actually is in your application. When it is a list of tasks you perform periodically and only make occasional changes to, there would be nothing wrong with embedding. But when you collect the historical results of all checks you ever did, I would rather recommend to put each result(set?) in an own document to avoid document growth.

MongoDB design Questions - 2 way references

I am new to MongoDB so I apologize if these questions are simple.
I am developing an application that will track specific user interactions and put information about the user and the interactions into a MongoDB. There are several types of interactions that will all collect different information from the user.
My First question is: Should all of these interaction be in the same collection or should I separate them out by types (as you would do in a RDBMS)?
Additionally I would like to be able to look up:
All the interactions a specific user has made
All the users that have made a specific interaction
I was thinking of putting a Manual reference to an interaction document for each interaction a user performs in his document and a manual reference to the user that performed the interaction in each interaction document.
My second questions is: Does this "doubling up" of Manual references make sense or is there a better way to do this?
Any thoughts would be greatly appreciated.
Thank you!
My First question is: Should all of these interaction be in the same collection or should I separate them out by types (as you would do in a RDBMS)?
Without knowing too much about your data size, write amount, read amount, querying needs etc I would say; yes, all in one collection.
I am not sure if separating them out is how I would design this in a RDBMS either.
"Does this "doubling up" of Manual references make sense or is there a better way to do this?"
No it doesn't make sound databse design to me.
Putting a user_id on the interaction collection document sounds good enough.
So when you want to get all user interactions you just query by the interactions collection user_id.
When you want to do it the other way around you query for all interactions that fit your query area, pull out those user_ids and then do a $in clause on the user collection.
My First question is: Should all of these interaction be in the same collection or should I separate them out by types (as you would do in a RDBMS)?
The greatest advantage of a document store over a relational database is precisely that you can do that. Put all different interactions into one collection and don't be afraid to give them different sets of fields.
Additionally I would like to be able to look up:
All the interactions a specific user has made
I was thinking of putting a Manual reference to an interaction document for each interaction a user performs in his document and a manual reference to the user that performed the interaction in each interaction document.
Note that it's usually not a good idea to have documents which grow indefinitely. MongoDB has an upper limit for document size (per default:16MB). MongoDB isn't good at handling large documents, because documents are loaded completely into ram cache. When you have many large objects, not much will fit into the cache. Also, when documents grow, they sometimes need to be moved to another hard drive location, which slows down updates (that also screws with natural ordering, but you shouldn't rely on that anyway).
All the users that have made a specific interaction
Are you referring to a specific interaction instance (assuming that multiple users can be part of one interaction) or all users which already performed a specific interaction type?
In the latter case I would add an array of performed interaction types to the user document, because otherwise you would have to perform a join-like operation, which would either require a MapReduce or some application-sided logic.
The the first case I would, contrary to what Sammaye suggests, recommend to use not the _id field of the user collection, but rather the username. When you use an index with the unique flag on user.username, it's just as fast as searching by user._id and uniqueness is guaranteed.
The reason is that when you search for the interactions by a specific user, it's more likely that you know the username and not the id. When you only have the username and you are referencing the user by id, you first have to search the users collection to get the _id of the username, which is a additional database query.
This of course assumes that you don't always have the user._id at hand. When you do, you can of course use _id as reference.

MongoDB - One Collection Using Indexes

Ok so the more and more I develop in Mongodb i start to wonder about the need for multiple collections vs having one large collection with indexes (since columns and fields can be different for each document unlike tabular data). If i am trying to develop in the most efficient way possible (meaning less code and reusable code) then can I use one collection for all documents and just index on a field. By having all documents in one collection with indexes then i can reuse all my form processing code and other code since it will all be inserting into the same collection.
For Example:
Lets say i am developing a contact manager and I have two types of contacts "individuals" and "businesses". My original thought was to create a collection called individuals and a second collection called businesses. But that was because im used to developing in sql where yes this would be appropriate since columns would be different for each table. The more i started to think about the flexibility of document dbs the more I started to think, "do I really need two collections for this?" If i just add a field to each document called "contact type" and index on that, do i really need two collections? Since the fields/columns in each document do not have to be the same for all (like in sql) then each document can have their own fields as long as i have a "document type" field and an index on that field.
So then i took that concept and started to think, if i only need one collection for "individuals" and "businesses" then do i even need a separate collection for "Users" or "Contact History" or any other data. In theory couldn't i build the entire solution in once collection and just have a field in each document that specifield the "type" and index on it such as "Users", "Individual Contact", "Business Contacts", "Contact History", etc, and if it is a document related to another document i can index on the "parent key/foreign" Id field...
This would allow me to code the front end dynamically since the form processing code would all be the same (inserting into the same collection). This would save a lot of coding but i want to make sure by using indexes and secondary indexes that the db would still run fast and not cause future problems as the collection grew. As you can imagine, if everything was in one collection there might be hundreds of thousands even millions of documents in this collection as the user base grows but it would have indexes and secondary indexes to optimize performance.
My question is: Is this a common method mongodb developers use? Why or why not? What are the downfalls, if any? If this is a commonly used method, please also give any positives to using this method. thank you.
This is a really big point in Mongo and the answer is a little bit more of an art than science. Having one collection full of gigantic documents is definitely an anti-pattern because it works against many of Mongo's features.
For instance, when retrieving documents, you can only retrieve a whole document out of a collection (not entirely true, but mostly). So if you have huge documents, you're retrieving huge documents each time. Also, having huge documents makes sharding less flexible since only the top level documents are indexed (and hence, sharded) in each collection. You can index values deep into a document, but the index value is associated with the top level document.
At the same time, going purely relational is also an anti-pattern because you've lost a lot of the referential integrity by going to Mongo in the first place. Also, all joins are done in application memory, so each one requires a full round-trip (slow).
So the answer is to do something in between. I'm thinking you'll probably want a collection for individuals and a different collection for businesses in this case. I say this because it seem like businesses have enough meta-data associated that it could bulk up a lot. (Also, I individual-business relationship seems like a many-to-many). However, an individual might have a Name object (with first and last properties). That would be a bad idea to make Name into a separate collection.
Some info from 10gen about schema design: http://www.mongodb.org/display/DOCS/Schema+Design
EDIT
Also, Mongo has limited support for transactions - in the form of atomic aggregates. When you insert an object into mongo, the entire object is either inserted or not inserted. So you're application domain requires consistency between certain objects, you probably want to keep them in the same document/collection.
For example, consider an application that requires that a User always has a Name object (containing FirstName, LastName, and MiddleInitial). If a User was somehow inserted with no corresponding Name, the data would be considered to be corrupted. In an RDBMS you would wrap a transaction around the operations to insert User and Name. In Mongo, we make sure Name is in the same document (aggregate) as User to achieve the same effect.
Your example is a little less clear, since I don't understand the business cases. One thing that does come to mind is that Mongo has excellent support for inheritance. It might make sense to put all users, individuals, and potentially businesses into the same collection (depending on how the application is modeled). If one individual has many contacts, you probably want individuals to have an array of IDs. If your application requires that you get a quick preview of contacts, you might consider duplicating part of an individual and storing an array of contact objects.
If you're used to RDBMS thinking, you probably think all your data always has to be consistent. The truth is, that's probably not entirely true. This concept of applying atomic aggregates to the domain has been preached heavily by the DDD community recently. When you look at your domain in depth, like your business users do, the consistency boundaries should become distinct.
MongoDB, and NoSQL in general, is about de-normalising data and about reducing joins. It goes against normal SQL thinking.
In your case, I don't see any reason why you would want to have separate collections because it introduces unnecessary complexity and performance overhead. Consider, for example, if you wanted to have a screen that displayed all contacts, in alphabetical order. If you have one single collection for contacts, then its really easy, but if you have two collections it becomes a more complicated proposition.
Where I would have multiple collections is if your application had multiple users storing contacts. I would then have one collection for each user. This makes it so easy to extract out that users contacts.

MongoDB - Denormalization / model opinion

I've been getting in to mongo, but coming from RDBMS background facing the probably obvious questions with regards to denormalisation and general data modelling.
If I have a document type with an array of sub docs, each sub doc has a status code.
In The relational world I would add a foreign key to the record, StatusId, simple.
In mongodb, would you denormalise the key pieces of data from the "status" e.g. Code and desc and hold objectid referencing another collection of proper status. I guess the next question is one of design, if the status doc is modified I'd then need to modified the denormalised data?
Another question on the same theme is how would you model a transaction table, say I have events and people, the events could be quite granular, say time sheets which over time may lead to many records. Based on what I've seen, this would seem like a good candidate for a child / sub array of docs, of course that could be indexed for speed.
Therefore is it possible to query / find just the sub array or part of it? And given the 16mb limit for doc size, and I just limited the transaction history of the person? Or should the transaction history be a separate collection with a onjid referencing the person?
Thanks for any input
Sam
Or should the transaction history be a separate collection with a onjid referencing the person?
Probably, I think this S/O question may help you understand why.
if the status doc is modified I'd then need to modified the denormalised data?
Yes this is standard trade-off in MongoDB. You will encounter this question a lot. You may need to leverage a Queue structure to ensure that data remains consistent across multiple collections.
Therefore is it possible to query / find just the sub array or part of it?
This is a tough one specific to MongoDB. With the basic query syntax, you have only limited support for dealing with arrays of objects. The new "Aggregration Framework" is actually much better here, but it's not available in a stable build.
All your "how to model this or that" can't really be answered, because good schema design depends on so many factors (access patters, hardware characteristics, is cluster used, etc).
if the status doc is modified I'd then need to modified the denormalised data?
Usually yes, that's the drawback of denormalisation. But sometimes you don't have to (some social network site stores user name with a photo tag and doesn't update it when user changes his name).
to query / find just the sub array or part of it?
It is not currently possible to fetch only a part of array (unless using map/reduce, of course).
And given the 4mb limit
Where did you get this from? It's 16mb at the moment.
While it's true that schema design does take into account many factors, the need to denormalize data usually comes up somewhere. I tend to take advantage of denormalization in my apps that use MongoDB because I feel it lends itself well storing denormalized data:
no additional column maintenance
support for hashes and arrays as field types (perfect for storing denormalized fields)
speedy, non-blocking writes make syncing data less expensive
document size growth only marginally affects performance up to limits (for the most part)
There are a few gems that help you manage denormalized data, including setting it up and keeping it in sync. If you're using Mongoid, you try mongoid_alize. DISCLAIMER: I am the author and maintainer of mongoid_alize.