I have a server storing content 5,000 documents. Lets say I have 1 million users who all query for 50 new documents at their own pace, until all content has been seen.
I want to make sure that each user only sees and interacts with the content once and never again, like Tinder.
My first thought was to tag each document with a list of user-ids of the users who have seen the document. However, this list would get really long... like a list of 1 million user-ids per document - but this sounds like it would really kill query performance.
Does anyone have any better ideas of how I can return content to users just once and never again.
p.s i am planning on doing this build out with mongoDB
p.p.s i thought about making a list of 'document-ids-seen' and attaching that to the user's document, and then with every query made by that user 'filter' out results that match 'document-ids-seen', but same challenge here, the query length would grow linearly as the user keeps interacting and bringing in new content.
The solution depends on the exact meaning of "at their own pace".
Your second post suggests that the time schedule is up to the user, but she will be presented with the documents in an order determined by your application, like e.g. getting news items in the order of the timestamp of news creation. In that case, your timestamp or auto increment solution will work, and it has only a small impact on data volume and query complexity.
If, however, the user may also choose which documents to view, this won't work any more, as the documents already viewed may be scattered across the entire document set. A solution to handle this efficiently consists of two design ideas:
(a) Imagine whether most users, at a given point of time, will have viewed a small or a large part of the entire document set. If only a small selection of documents is expected to be of interest to a particular user, then the count of documents the user has viewed will be rather small. (E.g. assume the documents are about IT and one user only wants to look at MongoDB docs, another mainly at Linux docs.) If all users will be interested in most or all of documents, then the count of documents a particular user has not viewed will be small. (E.g. a set of news that everyone tries to follow.) Depending on which is the case, store only a small list of viewed/not viewed document ids with each user, which will also simplify the query for the documents still to be viewed.
(b) With each user, don't store a list of single document ids (viewed or not viewed), but a list of intervals of such ids. E.g., if you store ids of documents not yet viewed, and some documents get added to the database, then, when a user is opened, her highest interval will be updated from (someLowerId, formerHighestId) to (someLowerId, currentHighestId). When a user views a document, the interval containing its id gets split from (lowId, highId) to (lowId, viewedId - 1), (viewedId + 1, highId), where one or both of these intervals may get empty. Including or excluding intervals like these will also simplify the queries as opposed to listing single ids.
I just had the idea that I could avoid the many-to-many relationship of content-to-users' interaction altogether, if I put a time-stamp on each document, and therefore only queried for more documents after a particular time-stamp 'X'.
Where 'X' could be stored in my 'users' table.
So when opening the app, I would sync my 'users' table, then issue queries after time-stamp 'X', then when results are returned, I'd update my 'users' table again with my new time-stamp X.
Or 'x' could not be a time-stamp, 'x' could just be an auto-incrementing id
Related
TL-DR
I have created a Flutter Firestore posts application. I want to present the user only new posts, which they didn't read yet.
How do I achieve that using Firestore query?
The problem
Each time a user sees a post, their id is added to the post views field.
Next time the user opens the app, I want to present only posts they didn't read yet.
The problem is that query array-not-contains is not supported. How do I achive that functionality?
You're going to have a real hard time with this because Firestore can only give you documents where you know something about the contents of that document. That's how indexes work - by recording data present in the document. Indexes don't track data not present in a document because that's basically an infinite amount of data.
If you are trying to track documents seen by the user, you would think to mark the document as "seen" using a boolean per user, or tracking the document ID somewhere. But as you can see, you can only query for documents that the user has seen, because that's the data present in the system.
What you can do is query for all documents, then query for all the documents the user has seen, then subtract the seen documents from all documents in order to get the unseen documents. But this probably doesn't scale in a way you'd like. (It's essentially the same problem with Firestore indexes not being able to surface documents without some known data present. Firestore won't do the equivalent of a SQL table scan, since that would be a lot of reads you'd have to pay for.)
You can kind of fake it by making sure there is a creation timestamp in each document, and record for each user the timestamp of the most recent seen document. If you require that the user must view the documents in chronological order, then you can simply query for documents with a creation timestamp greater than the timestamp of the latest document seen by the user. This is really as good as it's going to get with Firestore, since you can't query for the absence of data.
My application is used for creating production budgets for complex projects (construction, media productions etc.)
The structure of the budget is as follows:
The budget contains "sections",
the sections contain "accounts"
the accounts contains "subaccounts"
the subaccounts contain line items.
Line items have a number of fields, (units, rate, currency, tax etc.) and a calculated total
Or perhaps using Firestore to do these cascading calculations is the wrong approach? I should just load a single complex budget document into my app, do all the cacluations and updates on the clients, and then write back the entire budget as a single document when the user presses "save budget"?
Certain fields in line items may have alpha numeric codes which represent numeric values, which a user can use instead of a hard-coded number, e.g. user can enter "=build-weeks" and define that with a formula that evaluates to say "7" which is then used in the calculation of a total.
Line items bubble up their totals, so subaccounts have total equal to the sum of their line items,
Accounts total equals the total of their subaccounts,
sections total equals sum of accounts totals,
and budget total is total of section totals.
The question si how to aggregate into this data into documents comprising the budget.
Budgets may be sort of long, say 5,000 linesitems or more in total. Single accounts may have hundreds of line items.
Users will most likely look at a all of the line items for a given account, so it occurred to me
to make individual documents for sections, accounts and subaccounts, and make line items a map within a sub account.
The problem main concern I have with this approach is that when the user changes, say the exchange rate of currency of a line item, or changes the calculated value of a named value like "build-weeks" I will ahve to retrieve all the individual line items containing that curency or named value, recalculate the total, and then bubble up the changes through the hierarchy.
This seems not that complicated if each line item is its own document, I can just search the collection for the presence of the code in question, recalculate the line item, and use a cloud function to bubble up teh changes maybe.
But if all the lineitems are contained in an array of maps within each subaccount map item,
it seems like it will be quite tedious to find and change them when necessary..
On the other hand -- keeping these documents so small seems like a lot of document reads when somebody is reviewing a budget, or say, printing it, If somebody just clicks on a bunch of accounts, it might be 100's of reads per click to retrieve all the line items and hundreds or a thousand writes when somebody changes the value of a often used named value like "build-weeks".
Does anybody have any thoughts on the obvious "right" answer to this? Or does it just depend on what I want to optimize for - firestore costs, responsiveness of app, complexity of code?
From my stand point, there is no obvious answer to your problem and indeed it does depend on what you want to optimize for.
However there are a few points that you need to consider on your decision:
Documents in Firestore have a limit of 1Mb/Document;
Documents in Firestore have a limit of 20000 fields;
Queries are shallow, so you don't get data from subcollections on the same query;
For considerations 1 and 2, this means that if you choose the design you database to a big document containing everything, even though you said that your app will have lots of data, I doubt that it will be more than the limits mentioned, still, do consider those. Also, how necessary is it to get all the data at once, this could represent performance and user battery/data usage issues (if you are making a mobile app).
For consideration 3, it means that you would have to make many reads if you choose to get all the data for your sections divided in subdocuments, this will mean more cost to you but better performance for users.
To make the right call on this problem I suggest that you talk to possible users of your solution and understand the problem that you are trying to fix and what they expect of the app. Also, it might be interesting to take a look at the How to Structure Your Data and Maps, Arrays and Subcollections videos, as they explain in a more visual way how Firestore behaves and it could be helpful to antecipate problems that the approach you choose could cause.
Hope I was able to help with these considerations.
In my application user gets to pick specific documents out of the list, for example: 1,5,8 from the list containing documents 1,2,3,4,5,6,7,8,9. When logged into the application next time, I want to first fetch all of the chosen documents (considering pagination, because the number of documents user picked could be very high), and then start fetching the remaining documents as the user finishes viewing picked documents by scrolling down the list.
As it turns out, available Firestore querying methods are not capable of skipping the specific documents.
My current idea:
Make single document references for the user-specific documents and fetch them.
Make single document references for the documents between the range of user-specific documents (From the example that would be documents number: 2,3,4,6,7).
After that start making 'big queries' for the remaining documents.
This looks like a working solution, but I'm sure that there is a better way to accomplish the goal, since what I've done is not asynchronous and very slow. Help is appreciated!
Firestore doesn't have any way to exclude specific documents from queries. You may only include them using some existing field values. If you already know the documents to fetch, you can just get() them individually.
It sounds like you are already able to work around these requirements. I don't believe you have any alternatives.
I got a collection Users, that is name, password, email etc.
Also i got a collection Groups, every group has it's members - array of users.
How should i design my database? I clearly see 2 ways of doing so:
Way 1 (MySQL-like): every user has an _id, so i just put it into the members array and so be it.
Way 2: copy a whole user document inside plus add some fields.
On the MongoDB site they are telling that duplicate data is nothing to worry bcs of the low price of storages. Also they say that we should avoid JOINs on data read.
duplicate data is nothing to worry about
This is something to worry about when it comes to updating. Suppose you have user details nested and duplicated in every document. What happens when a user changes their name? You'll have to update every instance of that user in every document.
Be careful to differentiate between data and entities. A user is an entity, think carefully before duplicating entities as fixing it later could be hard work.
Personally, I'd split them unless you find yourself in a situation where performance is too slow to do the joining in real time. Then, and only then, consider merging.
Actually answer to this question depends on what kind of screens you are designing and what kind of queries you are going to make to fetch data. Lets go through pros and cons of each option which will help you in weighing each option.
Way 1 :- Putting array of user_ids in group collection
Pros
1) If you have a screen which shows group details of a particular group and list of all members (users_ids) belonging to that group, then one query can fetch all the details needed for this screen and it would be faster too.
Cons
1) If in group detail screen, you have to show details of users along with group details, then since mongodb does not provide any joins, you would be fetching user details in a separate query and would be joining both on the client side. This can lead to a impact on performance.
2) If you have a screen which shows user details and all the groups he/she belongs, then you will be searching user_id in user array in group collection. If you are expecting number of members in a group to be very high(millions), then searching inside the array can lead to a huge performance impact.
Way 2 :- Copy user document inside inside group collection
Duplicating data is not a problem in Mongodb, but you should have a really good reason for that. Thumb rule should be duplicate data when relationship is 1:few and not 1:many.
Pros
1) This approach will save you from joining group and user collection at client side as one query can fetch all the details of group along with its users.
Cons
1) Suppose you have a million groups and user_id_1 belongs to 100,000 groups, then whenever you have an update on user_id_1, you will have to update 100,000 documents. This can again lead to huge performance impact.
2) Also if a large number of users subscribe to 1 group, then document size of this group keeps on increasing. In Mongodb The maximum BSON document size is 16 megabytes that means you cannot have a document greater than 16MB, so you cannot add users to a group infinitely. This will limit your functionality.
Way 3 :- Embed group details in user collection
Pros
1) One query can fetch user details along with all the details of all the groups this user belongs to.
2) If you have are expecting few users in a group, then you will have few group arrays in a user document. This will not exceed 16MB limit.
Cons
1) If you are expecting that a user can subscribe to a lot many groups(millions), then user document may exceed 16MB limit.
2) Also if you have very frequent updates in group details then you will have to update the same in many user documents.
You can also go through the following link to get more details about data model design :-
https://docs.mongodb.org/manual/core/data-model-design/
It depends on how you will use data in your application.
If you have more than 2 groups and you will have to search a user in all of the groups, embed the user document within the group (way 2) is not a good idea. So in this case I sugest to use the way 1.
If you have only 2 groups or the user group will be known before your application when doing the query, then use the way 2.
I guess that separating the data is the way to go, since it will be better to direct update, get and delete user data directly.
I have a basic question about where I should embed a collection of followers/following in a mongo db. It makes sense to have an embedded collection of following in a user object, but does it also make sense to also embed the converse followers collection as well? That would mean I would have to update and embed in the profile record of both the:
following embedded list in the follower
And the followers embedded list of the followee
I can't ensure atomicity on that unless I also somehow keep a transaction or update status somewhere. Is it worth it embedding in both entities or should I just update #1, embed following in the follower's profile and, put an index on it so that I can query for the converse- followers across all profiles? Is the performance hit on that too much?
Is this a candidate for a collection that should not be embedded? Should I just have a collection of edges where I store following in its own collection with followerid and followedbyId ?
Now if I also have to update a feed to both users when they are followed or following, how should I organize that?
As for the use case, the user will see the people they are following when viewing their feeds, which happens quite often, and also see the followers of a profile when they view the profile detail of anyone, which also happens often but not quite as much as the 1st case. In both cases, the total numbers of following and followers shows up on every profile page.
In general, it's a bad idea to embed following/followed-by relationships into user documents, for several reasons:
(1) there is a maximum document size limit of 16MB, and it's plausible that a popular user of a well-subscribed site might end up with hundreds of thousands of followers, which will approach the maximum document size,
(2) followership relationships change frequently, and so the case where a user gains a lot of followers translates into repeated document growth if you're embedding followers. Frequent document growth will significantly hinder MongoDB performance, and so should be avoided (occasional document growth, especially is documents tend to reach a stable final size, is less of a performance penalty).
So, yes, it is best to split out following/followed-by relationship into a separate collection of records each having two fields, e.g., { _id : , oid : }, with indexes on _id (for the "who am I following?" query) and oid (for the "who's following me?" query). Any individual state change is modeled by a single document addition or removal, though if you're also displaying things like follower counts, you should probably keep separate counters that you update after any edge insertion/deletion.
(Of course, this supposes your business requirements allow you some flexibility on the consistency details: in general, if your display code tells a user he's got 304 followers and then proceeds to enumerate them, only the most fussy user will check that the followers enumerated tally up to 304. If business requirements necessitate absolute consistency, you'll either need a database that isolates transactions for you, or else you'll have to do the counting yourself as part of displaying all user identities.)
You can embed them all but create a new document when you reach a certain limit. For example you can limit a document to an array of 500 elements then create a new one. Also, if it is about feed, when viewed, you dont have to keep the viewed publications you can replace by new ones so you don't have to create new document for additional publication storage.
To maintain your performance, I'd advice you to make a collection that can use graphlookup aggregation, where you store your following. Being followed can reach millions of followers, so you have to store what pwople follow instead of who follows them.
I hope it helps you.