Creating new collections vs array properties - mongodb

Coming from a MySQL background, I've been questioning the some of the design patterns when working with Mongo. One question I keep asking myself is when should I create a new collection vs creating a property of an array type? My current situation goes as follows:
I have a collection of Users who all have at least 1 Inbox
Each inbox has 0 or more messages
Each message can have 0 or comments
My current structure looks like this:
{
username:"danramosd",
inboxes:[
{
name:"inbox1",
messages:[
{
message:"this is my message"
comments:[
{
comment:"this is a great message"
}
]
}
]
}
]
}
For simplicity I only listed 1 inbox, 1 message and 1 comment. Realistically though there could be many more.
An approach I believe that would work better is to use 4 collections:
Users - stores just the username
Inboxes - name of the inbox, along with the UID of User it belongs to
Messages - content of the message, along with the UID of inbox it belongs to
Comments - content of the comment, along with the UID of the message it belongs to.
So which one would be the better approach?

No one can help you with this question, because it is highly dependent on your application:
how many inboxes/messages/comments do you have on average
how often do you write/modify/delete these elements
how often do you read them
a lot of other things that I forgot to mention
When you are selecting one approach over another you are doing tradeofs.
If you store everything together (in one collection as your first case) you make it super easy to get all the things for a particular user. Taking apart the thing that most probably you do not need all the information at once, you at the same time makes it super hard to update some parts of the elements (try to write a query that will add a comment or remove the third comment). Even if this is easy - mongodb does not handle well growing documents because whenever you exceeds the padding factor it moves the document to another location (which is expensive) and increases the padding factor. Also keep in mind that this potentially can hit mongodb's limit on the size of the document.
It is always a good idea to read all mongodb use cases before trying to design any storage schema. Not surprisingly they have a comprehensive overview of your case as well.

Related

How to maximize existing number field in Cloud Firestore

I have several fields in one document that contain user records for mini games. After the user has played a few of them, I update the records in this document. I cannot be allowed to overwrite the existing record value with a smaller one. Therefore, I need to maximize them.
The solutions I have considered:
Transactions. This solution does not work for me, because it will not work without an Internet connection.
Cloud Functions. I can trigger the function when the document is updated or created. This solution works for me, but it complicates the logic in my application a lot.
Security Rules. I can prevent the document from being written if its new value is less than the old one. But this solution will only work well if you write one field at a time.
Ideally, I would like something like the following:
let data: [String: Any] = [
"game_id0": FieldValue.maximum(newRecord0),
"game_id1": FieldValue.maximum(newRecord1),
"game_id2": FieldValue.maximum(newRecord2),
]
let docRef = db.collection("user_records").document(documentId)
docRef.setData(data, merge: true)
But unfortunately FieldValue class only has methods: increment, arrayUnion, arrayRemove and delete.
In the description for the protocol, I found the maximum and minimum methods, but I doubt that this can be legally used.
Can anyone tell me any other feasible method?
UPD:
Let the following document be stored on the server:
{ "game_id": 13 }
The user plays the game from one device (which is offline) and scores 20 points.
Then the same user plays the same game from another device (which is online) and scores 22 points. An update request is sent. The server now stores the following information:
{ "game_id": 22 }
On the first device of the user, the Internet appears and the recording takes place. In this case, overwriting occurs and the document takes the following form:
{ "game_id": 20 }
That is, the previously collected user record is overwritten.
But I need the recording to occur only if the new value is greater than the current one. That is, the data after step two should not change.
If you can't use a transaction (which would normally be the right thing to do), then you have to use one of your other methods. I don't think you have any alternatives. You are probably going to have an easier time with Cloud Functions, and I don't think that's going to complicate things as much as you say. It should just be a few lines of code to check that any updated value should not be less than existing values.

Firestore query where document id not in array [duplicate]

I'm having a little trouble wrapping my head around how to best structure my (very simple) Firestore app. I have a set of users like this:
users: {
'A123': {
'name':'Adam'
},
'B234': {
'name':'Bella'
},
'C345': {
'name':'Charlie'
}
}
...and each user can 'like' or 'dislike' any number of other users (like Tinder).
I'd like to structure a "likes" table (or Firestore equivalent) so that I can list people who I haven't yet liked or disliked. My initial thought was to create a "likes" object within the user table with boolean values like this:
users: {
'A123': {
'name':'Adam',
'likedBy': {
'B234':true,
},
'disLikedBy': {
'C345':true
}
},
'B234': {
'name':'Bella'
},
'C345': {
'name':'Charlie'
}
}
That way if I am Charlie and I know my ID, I could list users that I haven't yet liked or disliked with:
var usersRef = firebase.firestore().collection('users')
.where('likedBy.C345','==',false)
.where('dislikedBy.C345','==',false)
This doesn't work (everyone gets listed) so I suspect that my approach is wrong, especially the '==false' part. Could someone please point me in the right direction of how to structure this? As a bonus extra question, what happens if somebody changes their name? Do I need to change all of the embedded "likedBy" data? Or could I use a cloud function to achieve this?
Thanks!
There isn't a perfect solution for this problem, but there are alternatives you can do depending on what trade-offs you want.
The options: Overscan vs Underscan
Remember that Cloud Firestore only allows queries that scale independent of the total size of your dataset.
This can be really helpful in preventing you from building something that works in test with 10 documents, but blows up as soon as you go to production and become popular. Unfortunately, this type of problem doesn't fit that scalable pattern and the more profiles you have, and the more likes people create, the longer it takes to answer the query you want here.
The solution then is to find a one or more queries that scale and most closely represent what you want. There are 2 options I can think of that make trade-offs in different ways:
Overscan --> Do a broader query and then filter on the client-side
Underscan --> Do one or more narrower queries that might miss a few results.
Overscan
In the Overscan option, you're basically trading increased cost to get 100% accuracy.
Given your use-case, I imagine this might actually be your best option. Since the total number of profiles is likely orders of magnitude larger than the number of profiles an individual has liked, the increased cost of overscanning is probably inconsequential.
Simply select all profiles that match any other conditions you have, and then on the client side, filter out any that the user has already liked.
First, get all the profiles liked by the user:
var likedUsers = firebase.firestore().collection('users')
.where('likedBy.C345','==',false)
Then get all users, checking against the first list and discarding anything that matches.
var allUsers = firebase.firestore().collection('users').get()
Depending on the scale, you'll probably want to optimize the first step, e.g. every time the user likes someone, update an array in a single document for that user for everyone they have liked. This way you can simply get a single document for the first step.
var likedUsers = firebase.firestore().collection('likedUsers')
.doc('C345').get()
Since this query does scale by the size of the result set (by defining the result set to be the data set), Cloud Firestore can answer it without a bunch of hidden unscalable work. The unscalable part is left to you to optimize (with 2 examples above).
Underscan
In the Underscan option, you're basically trading accuracy to get a narrower (hence cheaper) set of results.
This method is more complex, so you probably only want to consider it if for some reason the liked to unliked ratio is not as I suspect in the Overscan option.
The basic idea is to exclude someone if you've definitely liked them, and accept the trade-off that you might also exclude someone you haven't yet liked - yes, basically a Bloom filter.
In each users profile store a map of true/false values from 0 to m (we'll get to what m is later), where everything is set to false initially.
When a user likes the profile, calculate the hash of the user's ID to insert into the Bloom filter and set all those bits in the map to true.
So let's say C345 hashes to 0110 if we used m = 4, then your map would look like:
likedBy: {
0: false,
1: true,
2: true,
3: false }
Now, to find people you definitely haven't liked, you need use the same concept to do a query against each bit in the map. For any bit 0 to m that your hash is true on, query for it to be false:
var usersRef = firebase.firestore().collection('users')
.where('likedBy.1','==',false)
Etc. (This will get easier when we support OR queries in the future). Anyone who has a false value on a bit where your user's ID hashes to true definitely hasn't been liked by them.
Since it's unlikely you want to display ALL profiles, just enough to display a single page, you can probably randomly select a single one of the ID's hash bits that is true and just query against it. If you run out of profiles, just select another one that was true and restart.
Assuming most profiles are liked 500 or less times, you can keep the false positive ratio to ~20% or less using m = 1675.
There are handy online calculators to help you work out ratios of likes per profile, desired false positive ratio, and m, for example here.
Overscan - bonus
You'll quickly realize in the Overscan option that every time you run the query, the same profiles the user didn't like last time will be shown. I'm assuming you don't want that. Worse, all the ones the user liked will be early on in the query, meaning you'll end up having to skip them all the time and increase your costs.
There is an easy fix for that, use the method I describe on this question, Firestore: How to get random documents in a collection. This will enable you to pull random profiles from the set, giving you a more even distribution and reducing the chance of stumbling on lots of previously liked profiles.
Underscan - bonus
One problem I suspect you'll have with the Underscan option is really popular profiles. If someone is almost always liked, you might start exceeding the usefulness of a bloom filter if that profile has a size not reasonable to keep in a single document (you'll want m to be less than say 8000 to avoid running into per document index limits in Cloud Firestore).
For this problem, you want to combine the Overscan option just for these profiles. Using Cloud Functions, any profile that has more than x% of the map set to true gets a popular flag set to true. Overscan everyone on the popular flag and weave them into your results from the Underscan (remember to do the discard setup).

Firestore Rules: How to allow specific editing of someone else's record?

I have an app that has Users, and Jobs.
Each Job is owned by 1 User. Only that User can edit the job.
Any User can apply to any Job. When a User applies to a Job, I want to add that user to the job.appliedCandidates array. Basically the structure is this:
Jobs
- Job1
- owner = User1
- candidatesApplied
- User2
- User3
So User1 owns Job1. Let's say User4 comes along and applies to the job. Now I want to add him to candidatesApplied. But he doesn't have editing access to that job because he's not the owner!
And if I give him editing access, then he can change all the job data. Not what I want.
I'm pretty sure in Firestore you can do rules on a per-field basis, but this still doesn't solve my problem. If User4 can edit job1.candidatesApplied, that means he has access to delete the other users from the array!
I'm pretty sure that the array setup I've got going is not the way to go. One idea is to have "applicants" be a subcollection of a job, and allow any user to create a record in that subcollection, but not do anything else. But I'm not sure if this is right either.
How best should I handle this?
Given that you haven't set limits on how many applicants can apply for a job, let's just assume that the list of applicants can grow massively. If that's the case, then even modeling that data as a list in a single document is prone to error since there is a discrete max size to a document: 1 MB. And we're not even talking about security yet.
In order to avoid the max document size problem, the best way to deal with this is by putting each applicant in their own document. Whether or not that's a collection or subcollection is mostly irrelevant.
If you choose to store each applicant as a document in a collection, now you are not constrained to some arbitrary maximum of applicants, and it's easier to write rules for that. What those rules should be are outside of the scope of what you've proposed here so far. But flexible data modeling suggests that the subcollection approach is less prone to problems.

Database schema for a tinder like app

I have a database of million of Objects (simply say lot of objects). Everyday i will present to my users 3 selected objects, and like with tinder they can swipe left to say they don't like or swipe right to say they like it.
I select each objects based on their location (more closest to the user are selected first) and also based on few user settings.
I m under mongoDB.
now the problem, how to implement the database in the way it's can provide fastly everyday a selection of object to show to the end user (and skip all the object he already swipe).
Well, considering you have made your choice of using MongoDB, you will have to maintain multiple collections. One is your main collection, and you will have to maintain user specific collections which hold user data, say the document ids the user has swiped. Then, when you want to fetch data, you might want to do a setDifference aggregation. SetDifference does this:
Takes two sets and returns an array containing the elements that only
exist in the first set; i.e. performs a relative complement of the
second set relative to the first.
Now how performant this is would depend on the size of your sets and the overall scale.
EDIT
I agree with your comment that this is not a scalable solution.
Solution 2:
One solution I could think of is to use a graph based solution, like Neo4j. You could represent all your 1M objects and all your user objects as nodes and have relationships between users and objects that he has swiped. Your query would be to return a list of all objects the user is not connected to.
You cannot shard a graph, which brings up scaling challenges. Graph based solutions require that the entire graph be in memory. So the feasibility of this solution depends on you.
Solution 3:
Use MySQL. Have 2 tables, one being the objects table and the other being (uid-viewed_object) mapping. A join would solve your problem. Joins work well for the longest time, till you hit a scale. So I don't think is a bad starting point.
Solution 4:
Use Bloom filters. Your problem eventually boils down to a set membership problem. Give a set of ids, check if its part of another set. A Bloom filter is a probabilistic data structure which answers set membership. They are super small and super efficient. But ya, its probabilistic though, false negatives will never happen, but false positives can. So thats a trade off. Check out this for how its used : http://blog.vawter.com/2016/03/17/Using-Bloomfilters-to-Avoid-Repetition/
Ill update the answer if I can think of something else.

Blogs and Blog Comments Relationship in NoSQL

Going off an example in the accepted answer here:
Mongo DB relations between objects
For a blogging system, "Posts should be a collection. post author might be a separate collection, or simply a field within posts if only an email address. comments should be embedded objects within a post for performance."
If this is the case, does that mean that every time my app displays a blog post, I'm loading every single comment that was ever made on that post? What if there are 3,729 comments? Wouldn't this brutalize the database connection, SQL or NoSQL? Also there's the obvious scenario in which when I load a blog post, I want to show only the first 10 comments initially.
Document databases are not relational databases. You CANNOT first build the database model and then later on decide on various interesting ways of querying it. Instead, you should first determine what access patterns you want to support, and then design the document schemas accordingly.
So in order to answer your question, what we really need to know is how you intend to use the data. Displaying comments associated with a post is a distinctly different scenario than displaying all comments from a particular author. Each one of those requirements will dictate a different design, as will supporting them both.
This in itself may be useful information to you (?), but I suspect you want more concrete answers :) So please add some additional details on your intended usage.
Adding more info:
There are a few "do" and "don'ts" when deciding on a strategy:
DO: Optimize for the common use-cases. There is often a 20/80 breakdown where 20% of the UX drives 80% of the load - the homepage/landing page is a classic example. First priority is to make sure that these are as efficient as possible. Make sure that your data model allows either A) loading those in either a single IO request or B) is cache-friendly
DONT: don't fall into the dreaded "N+1" trap. This pattern occurs when you data model forces you to make N calls in order to load N entities, often preceded by an additional call to get the list of the N IDs. This is a killer, especially together with #3...
DO: Always cap (via the UX) the amount of data which you are willing to fetch. If the user has 3729 comments you obviously aren't going to fetch them all at once. Even it it was feasible from a database perspective, the user experience would be horrible. Thats why search engines use the "next 20 results" paradigm. So you can (for example) align the database structure to the UX and save the comments in blocks of 20. Then each page refresh involves a single DB get.
DO: Balance the Read and Write requirements. Some types of systems are read-heavy and you can assume that for each write there will be many reads (StackOverflow is a good example). So there it makes sense to make writes more expensive in order to gain benefits in read performance. For example, data denormalization and duplication. Other systems are evenly balanced or even write heavy and require other approaches
DO: Use the dimension of TIME to your advantage. Twitter is a classic example: 99.99% of tweets will never be accessed after the first hour/day/week/whatever. That opens all kinds of interesting optimization possibilities in the your data schema.
This is just the tip of the iceberg. I suggest reading up a little on column-based NoSQL systems (such as Cassandra)
Not sure if this answers you question, but anyhow you can throttle the amount of blog comments in two ways:
Load only the last 10 , or range of blog comments using $slice operator
db.blogs.find( {_id : someValue}, { comments: { $slice: -10 } } )
will return last 10 comments
db.blogs.find( {_id : someValue}, { comments: { $slice: [-10, 10] } } )
will return next 10 comments
Use capped array to save only the last n blog posts using capped arrays