Fetching large documents from mongodb - mongodb

I have a collection in mongodb that stores activities of customers like product_view, added_to_cart etc with productId. I need this data to display products to my customer when he visits next.
Right now I am thinking to store all data of a customer in a single document,such as with customer_id as key and corresponding activities in array like product_view activities in product_view array etc.This will be fast to fetch for me as all data of a customer will be in one key only, but my consideration is that data size will go on increasing always this way. Moreover I may need to check say last 50-100 activities of a customer only. For that too I need to fetch the entire document.
What will be the best way to store this data. Request for data will be very very frequent. How can I manage response time ?

Your question answers itself. Have customer activity as separate collection with reference to customerId. Any time customer visits, you know customerId, hence can apply filter/aggregate operations to get whatever you want.
This way you can do paginated fetch of customer activities.

Related

Query documents in one collection that aren't referenced in another collection with Firestore

I have a firestore DB where I'm storing polls in one collection and responses to polls in another collection. I want to get a document from the poll collection that isn't referenced in the responses collection for a particular user.
The naive approach would be to get all of the poll documents and all of the responses filtered by user ID then filter the polls on the client side. The problem is that there may be quite a few polls and responses so those queries would have to pull down a lot of data.
So my question is, is there a way to structure my data so that I can query for polls that haven't been completed by a user without having to pull down the collections in their entirety? Or more generally, is there some pattern to use when you need to query for documents in one collection that aren't referenced by another?
The documents in each of the collections look something like this:
Polls:
{
question: string;
answers: Answer[];
}
Responses:
{
userId: string;
pollId: string;
answerId: string;
}
Anyhelp would be much appreciated!
Queries in Firestore can only return documents from one collection (or from all collections with the same name) and can only contain conditions on the data that they actually return.
Since there's no way to filter based on a condition in some other documents, you'll need to include the information that you want to filter on in the polls documents.
For example, you could include a completionCount field in each poll document, that you initially set to 0, and then update only every poll completion. With that in place, the query becomes a simple query on the completionCount field of the polls collection.
For a specific user I'd actually add all polls to their profile document, and remove them from there. Duplicating data is usually the easiest (and sometimes only) way to implement use-cases such as this.
If you're worried about having to add each new poll to each new user profile when it is created, you can also query all polls on their creation timestamp when you next load a user profile and perform that sync at that moment.
load user profile,
check when they were last active,
query for new polls,
add them to user profile.

Firestore: Pagination with Cursor

I am trying to paginate data with Firestore, and would be ordering data on columns where duplicates are expected, if the pagination happens to be among those values its expected that it won't work correctly.
I can work around this issue by using StartAfter based on Document ID which will be always be unique.
One way I can accomplish this is passing id of the last document to server side rest api request. This would require two steps, i.e. to fetch the DocumentSnapshot using the DocumentId and constructing the query based on it
var lastSnapshot = fetchSnapshot(id);
citiesRef.OrderBy("Population").StartAfter(lastSnapshot);
Other approach is to persist the DocumentId in the document while creation.This would require two steps each time when the document is created, one to create and the other to update immediately with Id generated (As I don't see a way to persist DocumentId during creation itself)
citiesRef.OrderBy("Population").StartAfter(lastId);
Which one of these is a good approach to follow, either to fetch DocumentSnapshot and not to persist id into the document, or perform two operations by persisting the DocumentId in the first place and using it as key for StartAfter.
Decided to go with Option 1, instead of persisting Document ID in the Document itself as in Option 2.

Performance difference between storing the asset as subdocument vs single document in Mongoose

I have an API for synchronizing contacts from the user's phone to our database. The controller essentially iterates the data sent in the request body and if it passes validation a new contact is saved:
const contact = new Contact({ phoneNumber, name, surname, owner });
await contact.save();
Having a DB with 100 IOPS and considering the average user has around 300 contacts, when the server is busy this API takes a lot of time.
Since the frontend client is made in a way that a contact ID is necessary for other operations (edit, delete), I was thinking about changing the data structure to subdocuments, and instead of saving each Contact as a separate document, the idea is to save one document with many contacts inside:
const userContacts = new mongoose.Schema({
owner: //the id of the contacts owner,
contacts: [new mongoose.Schema({
name: { type: String },
phone: { type: String }
})]
});
This way I have to do just one save. But since Mongo has to generate an ID for each subdocument, is this really that much faster than the original approach?
Summary
This really depends on your exact usage scenarios:
are contacts often updated?
what is the max / average quantity of contacts per user
are they ever partially loaded, or are they always fetched all together?
But for a fairly common collection such as contacts, I would not recommend storing them in subdocuments.
Instead you should be able to use insertMany for your initial sync scenario.
Explanation
Storing as subdocuments makes a bulk-write easier will make querying and updating contacts slower and more awkward than as regular documents.
For example, if I have 100 contacts, and I want to view and edit 1 of them, it needs to load the full 100 contacts. I can make the change via a partial update using $set or $update, so the update will be OK. But when I add a new contact, I will have to add a new contact subDocument to you Contacts document. This makes it a growing document, meaning your database will suffer from fragmentation which can slow things down a lot (see this answer)
You will have to use aggregate with $ projection or $unwind to search through contacts in MongoDB. If you want to apply a specific sort order, this too would have to be done via aggregate or in code.
Matching via projection can also lead to problems with duplicate contacts being difficult to find.
And this won't scale. What if you get users with 1000s of contacts later? Then this single document will grow large and querying it will become very slow.
Alternatives
If your contacts for sync are in the 100s, you might get away with a splitting them into groups of ~50-100 and calling insertMany for each batch.
If they grow into the thousands, then I would suggest uploading all contacts, saving them as JSON / CSV files to disk, then slowly processing these in the background in batches.

Entity Framework Plus clear cache for individual queries

I am using FromCache() method whenever I need to retrieve data from the SQL database. There will be a lot of unique queries executed in a single method since it is getting data based on userID. The data associated with the userID will be updated through a separate process which will also trigger an event in the method that controls retrieving. When the data for a specific user is updated, I want to expire the cache for that user so that the next query on that userID will get the most recent data.
I see that EF plus has the option to ExpireTag. Would it be feasible to create a single tag for each userID and then use that to expire the cache?
Would it be feasible to create a single tag for each userID and then use that to expire the cache?
Yes, tag can be used similarly as if you use a cache key.
The best is probably using 2 tags:
Users
[UniqueUserId]
The Users tag will expire all cache related to "users"
The [UniqueUserId] tag will expire all caches related to this specific users

MongoDB storing user-specific data on shared collection objects

I'm designing an application that processes RSS feeds using MongoDB. Currently my collections are as follows:
Entry
fields: content, feed_id, title, publish_date, url
Feed
fields: description, title, url
User
fields: email_address
subscriptions (embedded collection; fields: feed_id, tags)
A user can subscribe to feeds which are linked from the embedded subscription collection. From the subscriptions I can get a list of all the feeds a user should see and also the corresponding entries.
How should I store entry status information (isRead, isStarred, etc.) that is specific to a user? When a user views an entry I need to record isRead = 1. Two common queries I need to be able to perform are:
Find all entries for a specific feed where isRead = 0 or no status exists currently
For a specific user, mark all entries prior to a publish date with isRead = 1 (this could be hundreds or even thousands of records so it must be efficient)
Hmm, this is a tricky one!
It makes sense to me to store a record for entries that are unread, and delete them when they're read. I'm basing this on the assumption that there will be more read posts than unread for each individual user, so you might as well not have documents for all of those already-read entries sitting around in your DB forever. It also makes it easier to not have to worry about the 16MB document size limit if you're not having to drag around years of history with you everywhere.
For starred entries, I would simply add an array of Entry ObjectIds to User. No need to make these subscription-specific; it'll be much easier to pull a list of items a User has starred that way.
For unread entries, it's a little more complex. I'd still add it as an array, but to satisfy your requirement of being able to quickly mark as-read entries before a specific date, I would denormalize and save the publish-date alongside the Entry ObjectId, in a new 'UnreadEntry' document.
User
fields: email_address, starred_entries[]
subscriptions (embedded collection; fields: feed_id, tags, unread_entries[])
UnreadEntry
fields: id is Entry ObjectId, publish_date
You need to be conscious of the document limit, but 16MB is one hell of a lot of unread entries/feeds, so be realistic about whether that's a limit you really need to worry about. (If it is, it should be fairly straightforward to break out User.subscriptions to its own document.)
Both of your queries now become fairly easy to write:
All entries for a specific feed that are unread:
user.subscriptions.find(feedID).unread_entries
Mark all entries prior to a publish date read:
user.subscriptions.find(feedID).unread_entries.where(publish_date.lte => my_date).delete_all
And, of course, if you simply need to mark all entries in a feed as read, that's very easy:
user.subscriptions.find(feedID).unread_entries.delete_all