MongoDB aggregate for Dashboard - mongodb

I want to show the data in MongoDB on the dashboard. I implemented it by applying the "Aggregate"
.
I am constantly receiving the "Query Targeting: Scanned Objects / Returned has gone about 1000" alert. How do I solve this alert? The method I thought of is as follows.
Remove the aggregation function from the dashboard: If we need the aggregation data, send a query at that time to obtain the data.
Separate aggregate functions and send queries from business logic: Divide data obtained at once through aggregate functions into multiple queries and then combine the data.
If there is a better way, I wonder if there is a common way.

I am constantly receiving the "Query Targeting: Scanned Objects / Returned has gone about 1000" alert. How do I solve this alert?
What, specifically, are you trying to solve here?
The Query Targeting metric (and associated alert) provides general information regarding the efficiency of the workload against the cluster. It can help with identifying potential problems there, most notably when relevant indexes are missing. Some more information about the metrics and actions that you can take for it are described here.
That said, the metric itself is not perfect. The fact that the targeting ratio is high enough to trigger an alert does not necessarily mean that there is a problem or that any particular action needs to be taken. Particularly notable here is that aggregation operations can cause misleading targeting ratios depending on what types of transformations the pipeline is applying. So the existence of the alert indicates there may be some improvements that could be pursued, but it does not guarantee that there are. You can certainly take a look at the workload using the strategies described in that documentation to determine if any actions like index creation are needed in your specific situation.
The two approaches that you specifically mention in the question could be considered but they kind of don't directly address the alert itself. Certainly if these are heavy aggregations that aren't needed for the application to function then there may be good reason to consider reducing their frequency. But if they are needed by the application and they are structured to be reasonably efficient, then I would not recommend trying to make any drastic adjustments just to avoid triggering the alert. Rather it may be the case that the default query targeting alert is too low for your particular use case and workload and you may consider raising it instead.

Related

Performance Implications of Accessing Single MongoDB Document vs Different MongoDB Documents in The Same Collection

Say I have a MongoDB Document that contains within itself a list.
This list gets altered a lot and there's no real reason why it couldn't have its own collection and each of the items became a document.
Would there be any performance implications of the former? I've got an inkling that document read/writes are going to be blocked while any given connection tries to read it, but the same wouldn't be true for accessing different documents in the same collection.
I find that these questions are effectively impossible to 'answer' here on Stack Overflow. Not only is there not really a 'right' answer, but it is impossible to get enough context from the question to frame a response that appropriately factors in the items that are most important for you to consider in your specific situation. Nonetheless, here are some thoughts that come to mind that may help point you in the right direction.
Performance is obviously an important consideration here, so it's good to have it in mind as you think through the design. Even within the single realm of performance there are various aspects. For example, would it be acceptable for the source document and the associated secondary documents in another collection to be out of sync? If not, and you had to pursue a route such as using transactions to keep them aligned, then that may be a much bigger performance hit overall and not worth pursuing.
As broad as performance is, it is also just a single consideration here. What about usability? Are you able to succinctly express the type of modifications that you would be doing to the array using MongoDB's query language? What about retrieving the data, would you always pull the information back as a single logical document? If so, then that would imply needing to use $lookup very frequently. Even doing so via a view may be cumbersome and could be both a usability as well as performance consideration. Indeed, an overreliance on $lookup can be considered an antipattern.
What does it mean when you say that the list gets "altered" a lot? Are you inserting new information, or updating existing entries? There has been a 16MB size limit for individual documents for a long time in MongoDB, so they generally recommend avoiding unbounded arrays. Indeed processing them can be costly in various ways depending on some specific factors.
Also, where does your inkling about concurrency behavior come from? There is a FAQ on concurrency here which helps outline some of the expected behavior for various operations and their locking. Often (with any system) it can be most appropriate to build out an environment that appropriately represents your end state and stress test it directly. That often gives a good general sense for how the approach would work in your situation without having to become an expert in the particulars of how the database (or tool in general) works.
You can see that even in this short response, the "recommendation" fluctuates back and forth. Ultimately this question is about a trade-off which we are not in a good position answer for you. Hopefully this response helps give you some things to think about while doing so.

MongoDB watch single document [scalability]

This is MongoDB's api:
db.foo.watch([{$match: {"bar.baz": "qux" }}])
Let's say that collection foo contains millions of documents. The arguments passed into watch indicate that for every single document that gets updated the system will filter the ones that $match the query (but it will be triggered behind the scenes with any document change).
The problem is that as my application scales, my listeners will also scale and my intuition is that I will end up having n^2 complexity with this approach.
I think that as I add more listeners, database performance will deteriorate due to changes to documents that are not part of the $match query. There are other ways to deal with this, (web sockets & rooms) but before prematurely optimizing the system, I would like to know if my intuition is correct.
Actual Question:
Can I attach a listener to a single document, such that watch's performance isn't affected by sibling documents?
When I do collection.watch([$matchQuery]), does the MongoDB driver listen to all documents and then filters out the relevant ones? (this is what I am trying to avoid)
The code collection.watch([$matchQuery]) actually means watch the change stream for that collection rather than the collection directly.
As far as I know, there is no way to add a listener to a single document. Since I do not know of any way, I will give you a couple tips on how to avoid scalability problems with the approach you have chosen. Your code appears to be using change streams. It should not cause problems unless you open too many change streams.
There are two ways to accomplish this task by watching the entire collection with a process outside of that won't lead to deterioration of the database performance.
If you use change streams, you can open only a single change stream with logic that checks for all the conditions you need to filter for over time. The mistake is that people often open many change streams for single document filtering tasks, and that is when people have problems.
The simpler way, since you mentioned Atlas, is to use Triggers. You can use something called a match expression in your Triggers configuration to prevent any operations on your collection unless the
match expression evaluates to true. As noted in the documentation, the trigger function will not execute unless a field status in this case is updated to "blocked", but many match expressions are available:
{
"updateDescription.updatedFields": {
"status": "blocked"
}
}
I hope this helps. If not, I can keep digging. I think with change streams or Triggers, you are ok if you want to write a bit of code. :)

Are there any advantages to using a custom _id for documents in MongoDB?

Let's say I have a collection called Articles. If I were to insert a new document into that collection without providing a value for the _id field, MongoDB will generate one for me that is specific to the machine and the time of the operation (e.g. sdf4sd89fds78hj).
However, I do have the ability to pass a value for MongoDB to use as the value of the _id key (e.g. 1).
My question is, are there any advantages to using my own custom _ids, or is it best to just let Mongo do its thing? In what scenarios would I need to assign a custom _id?
Update
For anyone else that may find this. The general idea (as I understand it) is that there's nothing wrong with assigning your own _ids, but it forces you to maintain unique values within your application layer, which is a PITA, and requires an extra query before every insert to make sure you don't accidentally duplicate a value.
Sammaye provides an excellent answer here:
Is it bad to change _id type in MongoDB to integer?
Advantages with generating your own _ids:
You can make them more human-friendly, by assigning incrementing numbers: 1, 2, 3, ...
Or you can make them more human-friendly, using random strings: t3oSKd9q
(That doesn't take up too much space on screen, could be picked out from a list, and could potentially be copied manually if needed. However you do need to make it long enough to prevent collisions.)
If you use randomly generated strings they will have an approximately even sharding distribution, unlike the standard mongo ObjectIds, which tends to group records created around the same time onto the same shard. (Whether that is helpful or not really depends on your sharding strategy.)
Or you may like to generate your own custom _ids that will group related objects onto one shard, e.g. by owner, or geographical region, or a combination. (Again, whether that is desirable or not depends on how you intend to query the data, and/or how rapidly you are producing and storing it. You can also do this by specifying a shard key, rather than the _id itself. See the discussion below.)
Advantages to using ObjectIds:
ObjectIds are very good at avoiding collisions. If you generate your own _ids randomly or concurrently, then you need to manage the collision risk yourself.
ObjectIds contain their creation time within them. That can be a cheap and easy way to retain the creation date of a document, and to sort documents chronologically. (On the other hand, if you don't want to expose/leak the creation date of a document, then you must not expose its ObjectId!)
The nanoid module can help you to generate short random ids. They also provide a calculator which can help you choose a good id length, depending on how many documents/ids you are generating each hour.
Alternatively, I wrote mongoose-generate-unique-key for generating very short random ids (provided you are using the mongoose library).
Sharding strategies
Note: Sharding is only needed if you have a huge number of documents (or very heavy documents) that cannot be managed by one server. It takes quite a bit of effort to set up, so I would not recommend worrying about it until you are sure you actually need it.
I won't claim to be an expert on how best to shard data, but here are some situations we might consider:
An astronomical observatory or particle accelerator handles gigabytes of data per second. When an interesting event is detected, they may want to store a huge amount of data in only a few seconds. In this case, they probably want an even distribution of documents across the shards, so that each shard will be working equally hard to store the data, and no one shard will be overwhelmed.
You have a huge amount of data and you sometimes need to process all of it at once. In this case (but depending on the algorithm) an even distribution might again be desirable, so that all shards can work equally hard on processing their chunk of the data, before combining the results at the end. (Although in this scenario, we may be able to rely on MongoDB's balancer, rather than our shard key, for the even distribution. The balancer runs in the background after data has been stored. After collecting a lot of data, you may need to leave it to redistribute the chunks overnight.)
You have a social media app with a large amount of data, but this time many different users are making many light queries related mainly to their own data, or their specific friends or topics. In this case, it doesn't make sense to involve every shard whenever a user makes a little query. It might make sense to shard by userId (or by topic or by geographical region) so that all documents belonging to one user will be stored on one shard, and when that user makes a query, only one shard needs to do work. This should leave the other shards free to process queries for other users, so many users can be served at once.
Sharding documents by creation time (which the default ObjectIds will give you) might be desirable if you have lots of light queries looking at data for similar time periods. For example many different users querying different historical charts.
But it might not be so desirable if most of your users are querying only the most recent documents (a common situation on social media platforms) because that would mean one or two shards would be getting most of the work. Distributing by topic or perhaps by region might provide a flatter overall distribution, whilst also allowing related documents to clump together on a single shard.
You may like to read the official docs on this subject:
https://docs.mongodb.com/manual/sharding/#shard-key-strategy
https://docs.mongodb.com/manual/core/sharding-choose-a-shard-key/
I can think of one good reason to generate your own ID up front. That is for idempotency. For example so that it is possible to tell if something worked or not after a crash. This method works well when using re-try logic.
Let me explain. The reason people might consider re-try logic:
Inter-app communication can sometimes fail for different reasons, (especially in a microservice architecture). The app would be more resilient and self-healing by codifying the app to re-try and not give up right away. This rides over odd blips that might occur without the consumer ever being affected.
For example when dealing with mongo, a request is sent to the DB to store some object, the DB saves it, but just as it is trying to respond to the client to say everything worked fine, there is a network blip for whatever reason and the “OK” is never received. The app assumes it didn't work and so the app may end up re-trying the same data and storing it twice, or worse it just blows up.
Creating the ID up front is an easy, low overhead way to help deal with re-try logic. Of course one could think of other schemes too.
Although this sort of resiliency may be overkill in some types of projects, it really just depends.
I have used custom ids a couple of times and it was quite useful.
In particular I had a collection where I would store stats by date, so the _id was actually a date in a specific format. I did that mostly because I would always query by date. Keep in mind that using this approach can simplify your indexes as no extra index is needed, the basic cursor is sufficient.
Sometimes the ID is something more meaningful than a randomly generated one. For example, a user collection may use the email address as the _id instead. In my project I generate IDs that are much shorter than the ones Mongodb uses so that the ID shown in the URL is much shorter.
I'll use an example , i created a property management tool and it had multiple collections. For simplicity some fields would be duplicated for example the payment. And when i needed to update these record it had to happen simultaneously across all collections it appeared in so i would assign them a custom payment id so when the delete/query action is performed it changes all instances of it database wide

ElasticSearch Completion Suggester

In ElasticSearch, I'm using the completion suggester (docs here) with payloads that are very similar to the documents I'm inserting.
My question is - should I be doing this, or only inserting the IDs into the payloads and performing a follow-up Multi-GET to retrieve the real results? I'd much prefer the latter, but if the former is more performant (even if it takes more memory), I'll stick to that.
The completion suggester was generally implemented with speed as a major factor - to be used in autocomplete fields as you type etc. By increasing the payload you will naturally increase the json response sizes and hence slowdown the whole process.
However, I don't think there is a single right way - its a toolset to be adapted to your specific requirements. If your current solution is more performant, and you can scale to manage the memory requirements, then it sounds like the right solution for you.

MongoDB. Use cursor as value for $in in next query

Is there a way to use the cursor returned by the previous query as a value for $in in the next query? For example, something like this:
var users = db.user.find({state:1})
var offers = db.offer.find({user:{$in:users}})
I think this can reduce the traffic between mongodb and client in case the client doesn't need user information at all, just offers. Am i wrong?
Basically you want to do a join between two collections which Mongo doesn't support. You can reduce the amount of data being transferred from the server by limiting the fields returned from the first query to only the unique user information (i.e. the _id) that you need to get data from the offers collection.
If you really just want to make one query then you should store more information in the offers collection. For example, if you're trying to find offers for active users then you would store the active state of the user in the offers collection.
To work from your comment:
Yes, that's why I used tag 'join' in a question. The idea is that I
can make a first query more сomplex using a bunch of fields and
regexes without storing user data in other collections except
references. In these cases I always have to perform two consecutive
queries, but transfering of the results of the first query is not
necessary neither for me nor for the mongodb itself. I just want to
understand could it be done now, will it be possible to do so in the
future or it cannot be implemented for some technical reasons
As far as I understand it there is no immediate hurry to make this possible. Also the way it is coded atm will make this quite a big change to the way cursors work and are defined. A change big enough to possibly cause implementation breaks for other people. It is really a case of whether to set safe for inserts and updates for all future drivers. It is recognised that safe should be default but this will break implementation for other people who expect it the other way around.
It is rather inefficient if you don't require the results of the first query at all however since most networks are prepped with high traffic in mind and the traffic is cheap there hasn't been a demand to make it able to do chained queries server side in the cursor.
However subselects (which this basically is, it is selecting a set of rows based upon a sub selection of previous rows) have been on mongodb-user a couple of times and there might even be a JIRA for it somewhere, if not might be useful to make one.
As for doing it right now: there is no way.