In our application we have to display most popular articles but there could be same strategy required in case of trending, hot, relevant, etc. data.
We have 2 collections [Articles] and [Comments] where articles have multiple comments.
We want to display most popular [Articles].
Popularity is counted basing on some complex formula but let's assume that the formula sums total number of [Articles] views and total [Comments] count. We assume that if formula counts popularity of 1 article then it will take all [Articles] into account to give it also a rank among others.
As you can see users are constantly increasing views and adding more comments. Every day different articles can be among the most popular ones.
The problem is as follows: how to display up to date data without spamming database with queries?
I was thinking about scheduled cron job (in our backend app) that would update [Article] popularity for ex. every hour and then save it in article itself. This way when users visit the page nothing would have to be counted and we could just work on saved data.
There might be also possibility to build a query that is fast enough and counts popularity on demand but I don't know if it's possible.
What will be the best strategy? Count data in background and keep it up to date ,build advanced query or something different?
Related
My application is used for creating production budgets for complex projects (construction, media productions etc.)
The structure of the budget is as follows:
The budget contains "sections",
the sections contain "accounts"
the accounts contains "subaccounts"
the subaccounts contain line items.
Line items have a number of fields, (units, rate, currency, tax etc.) and a calculated total
Or perhaps using Firestore to do these cascading calculations is the wrong approach? I should just load a single complex budget document into my app, do all the cacluations and updates on the clients, and then write back the entire budget as a single document when the user presses "save budget"?
Certain fields in line items may have alpha numeric codes which represent numeric values, which a user can use instead of a hard-coded number, e.g. user can enter "=build-weeks" and define that with a formula that evaluates to say "7" which is then used in the calculation of a total.
Line items bubble up their totals, so subaccounts have total equal to the sum of their line items,
Accounts total equals the total of their subaccounts,
sections total equals sum of accounts totals,
and budget total is total of section totals.
The question si how to aggregate into this data into documents comprising the budget.
Budgets may be sort of long, say 5,000 linesitems or more in total. Single accounts may have hundreds of line items.
Users will most likely look at a all of the line items for a given account, so it occurred to me
to make individual documents for sections, accounts and subaccounts, and make line items a map within a sub account.
The problem main concern I have with this approach is that when the user changes, say the exchange rate of currency of a line item, or changes the calculated value of a named value like "build-weeks" I will ahve to retrieve all the individual line items containing that curency or named value, recalculate the total, and then bubble up the changes through the hierarchy.
This seems not that complicated if each line item is its own document, I can just search the collection for the presence of the code in question, recalculate the line item, and use a cloud function to bubble up teh changes maybe.
But if all the lineitems are contained in an array of maps within each subaccount map item,
it seems like it will be quite tedious to find and change them when necessary..
On the other hand -- keeping these documents so small seems like a lot of document reads when somebody is reviewing a budget, or say, printing it, If somebody just clicks on a bunch of accounts, it might be 100's of reads per click to retrieve all the line items and hundreds or a thousand writes when somebody changes the value of a often used named value like "build-weeks".
Does anybody have any thoughts on the obvious "right" answer to this? Or does it just depend on what I want to optimize for - firestore costs, responsiveness of app, complexity of code?
From my stand point, there is no obvious answer to your problem and indeed it does depend on what you want to optimize for.
However there are a few points that you need to consider on your decision:
Documents in Firestore have a limit of 1Mb/Document;
Documents in Firestore have a limit of 20000 fields;
Queries are shallow, so you don't get data from subcollections on the same query;
For considerations 1 and 2, this means that if you choose the design you database to a big document containing everything, even though you said that your app will have lots of data, I doubt that it will be more than the limits mentioned, still, do consider those. Also, how necessary is it to get all the data at once, this could represent performance and user battery/data usage issues (if you are making a mobile app).
For consideration 3, it means that you would have to make many reads if you choose to get all the data for your sections divided in subdocuments, this will mean more cost to you but better performance for users.
To make the right call on this problem I suggest that you talk to possible users of your solution and understand the problem that you are trying to fix and what they expect of the app. Also, it might be interesting to take a look at the How to Structure Your Data and Maps, Arrays and Subcollections videos, as they explain in a more visual way how Firestore behaves and it could be helpful to antecipate problems that the approach you choose could cause.
Hope I was able to help with these considerations.
I want to perform a sorted query on a mongodb collection, find an specific item and retrieve its index in the resultset.
For example, find all students, sort by grade in descending order and get the position of student 'John Connor'. Like getting the ranking of John Connor in the class.
The only way that comes to my mind is querying the whole collection and programmatically performing a search on the resultset. It doesn't look efficient at all, specially because it may have memory issues when collections grow.
For large data sets one answer for this class of problem is to calculate the ranking periodically - once a day, once an hour, whatever fits your use case, and then store that pre-calculated ranking. Stackoverflow does exactly that - if we were to look at your ranking on stackoverflow - https://stackexchange.com/leagues/1/year/stackoverflow/2014-01-01/873641#873641 - this ranking is not being calculated dynamically. Your question has been upvoted so your reputation has increased by 5, but it's not reflected yet in your ranking.
A variation on this theme is to force a re-ranking whenever anyone's grade changes. Useful when the grade changes occur on some infrequent basis but by running the calculation right after the change you get minimal time when the ranking is off with minimal cost for constantly re-ranking.
Regarding the specifics of mongodb, yes you will need to walk the ranking list in order to calculate it.
Is it a good idea to create per day collections for data on a given day (we could start with per day and then move to per hour if there is too much data). Is there a limit on the number of collections we can create in mongodb, or does it result in performance loss (is it an overhead for mongodb to maintain so many collections). Does a large number of collections have any adverse effect on performance?
To give you more context, the data will be more like facebook feeds, and only the latest data (say last one week or month) is more important to us. Making per day collections keeps the number of documents low, and probably would result in fast access. Even if we need old data, we can fall back to older collections. Does this make sense, or am I heading in the wrong direction?
what you actually need is to archive the old data. I would suggest you to take a look at this thread at the mongodb mailing list:
https://groups.google.com/forum/#!topic/mongodb-user/rsjQyF9Y2J4
Last post there from Michael Dirolf (10gen)says:
"The OS will handle LRUing out data, so if all of your queries are
touching the same portion of data that should stay in memory
independently of the total size of the collection."
so I guess you can stay with single collection and good indexes will do the work.
anyhow, if the collection goes too big you can always run manual archive process.
Yes, there is a limit to the number of collections you can make. From the Mongo documentation Abhishek referenced:
The limitation on the number of namespaces is the size of the namespace file divided by 628.
A 16 megabyte namespace file can support approximately 24,000 namespaces. Each index also counts as a namespace.
Indexes etc. are included in the namespaces, but even still, it would take something like 60 years to hit that limit.
However! Have you considered what happens when you want data that spans collections? In other words, if you wanted to know how many users have feeds updated in a week, you're in a bit of a tight spot. It's not easy/trivial to query across collections.
I would recommend instead making one collection to store the data and simply move data out periodically as Tamir recommended. You can easily write a job to move data out of the collection every week or every month.
Creating a collection is not much overhead, but it the overhead is larger than creating a new document inside a collections.
There is a limitation on the no of collections that you can create: " http://docs.mongodb.org/manual/reference/limits/#Number of Namespaces "
Making new collections to me, won't be having any performance difference because in RAM you cache only those data that you actually query. In your case it will be recent feeds etc.
But having per day/hour collection will help you in achieving old data very easily.
I've got two collections: one with ~7.600.000 documents containing information about available trips and one with ~5000 documents containing information about hotels with region, city and country data. The trips collection has field with id of certain hotel.
my problem is, that I have to query both collections to get information about certain trip: location information from hotels collection and other information like price, number of people etc from trips collection.
I've read about mapreduce strategy of merging two collections, but i think that it won't fit in my case because it'll create only 5000 documents if I link them using hotel id? Is it possible?
Another approach is two embed hotels information in trip collection, but I'm afraid of updating hotel information in this case.
Please give me some advice, and tell which approach will be the best?
You have many options. It's all about deciding where to "join" the data. The options:
Join on the front end. Maybe bring back all trips first and then use AJAX calls to lazily load the hotel information. (Assuming a web application). Point being, two calls might not be the worst thing!
Use map/reduce in Mongo to output the data as you want it. It won't work in real time, but it will give you the right results. It wouldn't be limited to 5,000 documents. You could start with the bigger trip collection and bring in what you need. It's very flexible.
Embed the hotel information. As a note, you only want to embed hotel information if it's not changing all that often. If it changes constantly, I would consider leaving things as is.
For a lot of the work I do with Mongo, I've found that two calls isn't so bad - especially when dealing with fast changing data.
I am building an e-learning app, and showing student activities as a timeline, should I embed them in the user collection, or create a separate collection with an userId.
Constraints:
One to many relationship.
User activities are detailed and numerous
For 90% of the time, we only need to see one user at an time, the other case
is where a supervisor(teacher) needs to see an summary of the activities of
users(maybe another collection?)
I haven't thought of the use case of searching for activities and finding students, maybe I'll have a use for this later on? (eg. see who finished some particular activity first? But that changes the relationship to be Many to many and is a completely different question)
I have found different schemas for the related problem in these two questions:
MongoDB schema design -- Choose two collection approach or embedded document recommends to try and embed as much as possible
MongoDB schema for storing user location history reminds don't bloat a collection, because querying the elements deep below might be hard, especially if you're going to use lists
Both of those articles are right and both are wrong.
To embed or not to embed? This is the always the key question and it comes down to your needs, querying and storage and even your working set.
At the end of the day we can only give pointers you can't actually tell you which is best.
However, considering the size of an activities feed I personally would not embed it since it could easily grow past 16meg (per user) however for the speed and power of querying you could aggregate, say, the last 20 activites of a user and then embed that into the users row (since the last 20 is normally what is queried the most).
But then embedding an aggregate depends, sharding can take care of querying huge horizontally scaled collections and using the right queries means that you don't gain any real benefit from embedding and could potientially make your life harder by having to maintain the indexes, storage and queries required to maintain that subdocument.
As for embedding to the point of death. A lot of MongoDBs querying at the moment relies mostly upon one or two level embedding so that is why it could get hard to maintain say 12 nested tables, at which time you start to see questions on here and the Google group of how to maintain such a huge document (answer is client side if you really want to).
For 90% of the time, we only need to see one user at an time, the other case is where a supervisor(teacher) needs to see an summary of the activities of users(maybe another collection?)
Considering this I would house an aggregate on the user which means the user can see their own or other users activity singulary with one round trip.
However considering that a teacher would have to most likely have pages results from all users I would house a separate activities collection and query on that for them. Paging an aggregate of subdocuments requires a few queries and in this case it would be better to just do it this way.
Hopefully that should get you started.
You should not embed activities into student document.
The reason I'm pretty confident of this is the following statements:
"User activities are detailed and numerous"
"showing student activities as a timeline"
"teacher needs to see an summary of the activities of users"
It is a bad practice to design schema that has ever-growing documents - so having a student document that keeps growing every time they complete/add another activity is a recipe for poor performance.
If you want to sort student's activities, it's a lot simpler if each is a separate document in an activity collection than if it's an array within a student document.
When you need to query about activities across multiple students, having all activities in a single collection makes it trivial, but having activities embedded in student documents makes it difficult (you will need aggregation framework, most likely, which will make it slower).
You also say you might have need in the future to "see who finished some particular activity first? But that changes the relationship to be Many to many and is a completely different question" - this is not the case. You do not need to treat this as a many-to-many relationship - you can still store multiple activities associated with a single user and then query for all records matching activity "X" sorting by time finished (or whatever) and seeing which student has lowest time.