MongoDB large one-time query load on production system - mongodb

I'm having a MongoDB database, holding tens of millions of documents.
Let's say I want to query a single value out of each document (see image below: target key under 0 key under references key)
so it's a 3rd level nested key, and only if the referenceType equals "CopiedFrom" (references level doesn't exists in all documents)
there's ~10M documents that will answer this condition, and this is a one-time query.
The DBA in my org tells me this database is transactional (and not for reporting) and serves many clients in production, hence, a query like i'm asking will put great load on the system and will compromise production response times.
I don't have much experience with MongoDB and cannot evaluate this claim (besides the fact that it's absurd to have historical data you cannot effectivly access).
Is he right, or he's exaggerating?
knowing this can help me deal with his claim, and get the data i need.
thanks!

Your use case is addressed by adding dedicated hidden nodes to the replica set for analytics queries. See here for example.
The DBA is generally correct in that an expensive analytical query is unsuitable for executing against servers that serve transactional workloads.

Related

Is it worth splitting one collection into many in MongoDB to speed up querying records?

I have a query for a collection. I am filtering by one field. I thought, I can speed up query, if based on this field I make many separate collections, which collection's name would contain that field name, in previous approach I filtered with. Practically I could remove filter component in a query, because I need only pick the right collection and return documents in it as response. But in this way ducoments will be stored redundantly, a document earlier was stored only once, now document might be stored in more collections. Is this approach worth to follow? I use Heroku as cloud provider. By increasing of the number of dynos, it is easy to serve more user request. As I know read operations in MongoDB are highly mutual, parallel executed. Locking occure on document level. Is it possible gain any advantage by increasing redundancy? Of course index exists for that field.
If it's still within the same server, I believe there may be little parallelization gain (from the database side) in doing it this way, because for a single server, it matters little how your document is logically structured.
All the server cares about is how many collection and indexes you have, since it stores those collections and associated indexes in a number of files. It will need to load these files as the collection is accessed.
What could potentially be an issue is if you have a massive number of collections as a result, where you could hit the open file limit. Note that the open file limit is also shared with connections, so with a lot of collections, you're indirectly reducing the number of possible connections.
For illustration, let's say you have a big collection with e.g. 5 indexes on them. The WiredTiger storage engine stores the collection as:
1 file containing the collection data
1 file containing the _id index
5 files containing the 5 secondary indexes
Total = 7 files.
Now you split this one collection across e.g. 100 collections. Assuming the collections also requires 5 secondary indexes, in total they will need 700 files in WiredTiger (vs. of the original 7). This may or may not be desirable from your ops point of view.
If you require more parallelization if you're hitting some ops limit, then sharding is the recommended method. Sharding the busy collection across many different shards (servers) will immediately give you better parallelization vs. a single server/replica set, given a properly chosen shard key designed to maximize parallelization.
Having said that, sharding also requires more infrastructure and may complicate your backup/restore process. It will also require considerable planning and testing to ensure your design is optimal for your use case, and will scale well into the future.

MongoDB - Can I shard all new db's (Created by application) automatically?

My team will deploy a new version of our app (Capture social media posts, hashtags etc.) they create a different DB for each user and we may have thousands of collections on each DB. I read all mongoDB shard documentation and I saw that I can only shard an collection or one DB at time, I'm missing something ?
We will start this new version fresh, without any databases and we will grow from 0 again (For now, we have 23k users) but we will escalate this number really quickly (100.000+ at the end of the year)
My question is: I really need a Shard cluster ? (My test setup have 3 shards with 3 microshards, 3 config servers and 2 mongos) for now, in production, i have a large server doing all the hard work but i dont want to scale to top, the horizontal scale is the best choice, i think.
Can I shard all my databases automatically or I really need to do that one by one doing the shard key procedure and so. ?
Thanks in advance
You are reading correctly. What you intend to do is so far away from what any sensible person would do that MongoDB doesn't offer any tools to support this. If you really want to go with this WTF solution, your application will be responsible to set up sharding for each collection it creates. This forces you to give administration permission to the application (despite what any security guides recommend).
"Will you really need a sharded cluster" - that depends on how much data you will have and how often you query it with what kind of query. But it is unlikely to work anyway, because your sharded cluster will have to manage (100,000 databases* 1.000 collections) = a hundred million collections. MongoDB is not designed for scaling in that direction. The cluster will likely be so busy with bookkeeping that you won't really see any notable performance gain.
It is also questionable if clustering would even theoretically make sense. Clustering is usually only useful when you have very large collections. But in your scenario where your data is so heavily fragmented into a million collections, each individual collection is unlikely to be very large.
If you really want to go this route, it might in fact be a better solution to separate the databases physically by assigning each user to a database server.
Or you could just build a database architecture like a normal team would with one database for all users and one collection per type of document. You would then speed up lookups by creating a compound index on user and whatever criteria you used to tell which database a document belonged to. This index might also be a good shard key.

Are there any advantages to using a custom _id for documents in MongoDB?

Let's say I have a collection called Articles. If I were to insert a new document into that collection without providing a value for the _id field, MongoDB will generate one for me that is specific to the machine and the time of the operation (e.g. sdf4sd89fds78hj).
However, I do have the ability to pass a value for MongoDB to use as the value of the _id key (e.g. 1).
My question is, are there any advantages to using my own custom _ids, or is it best to just let Mongo do its thing? In what scenarios would I need to assign a custom _id?
Update
For anyone else that may find this. The general idea (as I understand it) is that there's nothing wrong with assigning your own _ids, but it forces you to maintain unique values within your application layer, which is a PITA, and requires an extra query before every insert to make sure you don't accidentally duplicate a value.
Sammaye provides an excellent answer here:
Is it bad to change _id type in MongoDB to integer?
Advantages with generating your own _ids:
You can make them more human-friendly, by assigning incrementing numbers: 1, 2, 3, ...
Or you can make them more human-friendly, using random strings: t3oSKd9q
(That doesn't take up too much space on screen, could be picked out from a list, and could potentially be copied manually if needed. However you do need to make it long enough to prevent collisions.)
If you use randomly generated strings they will have an approximately even sharding distribution, unlike the standard mongo ObjectIds, which tends to group records created around the same time onto the same shard. (Whether that is helpful or not really depends on your sharding strategy.)
Or you may like to generate your own custom _ids that will group related objects onto one shard, e.g. by owner, or geographical region, or a combination. (Again, whether that is desirable or not depends on how you intend to query the data, and/or how rapidly you are producing and storing it. You can also do this by specifying a shard key, rather than the _id itself. See the discussion below.)
Advantages to using ObjectIds:
ObjectIds are very good at avoiding collisions. If you generate your own _ids randomly or concurrently, then you need to manage the collision risk yourself.
ObjectIds contain their creation time within them. That can be a cheap and easy way to retain the creation date of a document, and to sort documents chronologically. (On the other hand, if you don't want to expose/leak the creation date of a document, then you must not expose its ObjectId!)
The nanoid module can help you to generate short random ids. They also provide a calculator which can help you choose a good id length, depending on how many documents/ids you are generating each hour.
Alternatively, I wrote mongoose-generate-unique-key for generating very short random ids (provided you are using the mongoose library).
Sharding strategies
Note: Sharding is only needed if you have a huge number of documents (or very heavy documents) that cannot be managed by one server. It takes quite a bit of effort to set up, so I would not recommend worrying about it until you are sure you actually need it.
I won't claim to be an expert on how best to shard data, but here are some situations we might consider:
An astronomical observatory or particle accelerator handles gigabytes of data per second. When an interesting event is detected, they may want to store a huge amount of data in only a few seconds. In this case, they probably want an even distribution of documents across the shards, so that each shard will be working equally hard to store the data, and no one shard will be overwhelmed.
You have a huge amount of data and you sometimes need to process all of it at once. In this case (but depending on the algorithm) an even distribution might again be desirable, so that all shards can work equally hard on processing their chunk of the data, before combining the results at the end. (Although in this scenario, we may be able to rely on MongoDB's balancer, rather than our shard key, for the even distribution. The balancer runs in the background after data has been stored. After collecting a lot of data, you may need to leave it to redistribute the chunks overnight.)
You have a social media app with a large amount of data, but this time many different users are making many light queries related mainly to their own data, or their specific friends or topics. In this case, it doesn't make sense to involve every shard whenever a user makes a little query. It might make sense to shard by userId (or by topic or by geographical region) so that all documents belonging to one user will be stored on one shard, and when that user makes a query, only one shard needs to do work. This should leave the other shards free to process queries for other users, so many users can be served at once.
Sharding documents by creation time (which the default ObjectIds will give you) might be desirable if you have lots of light queries looking at data for similar time periods. For example many different users querying different historical charts.
But it might not be so desirable if most of your users are querying only the most recent documents (a common situation on social media platforms) because that would mean one or two shards would be getting most of the work. Distributing by topic or perhaps by region might provide a flatter overall distribution, whilst also allowing related documents to clump together on a single shard.
You may like to read the official docs on this subject:
https://docs.mongodb.com/manual/sharding/#shard-key-strategy
https://docs.mongodb.com/manual/core/sharding-choose-a-shard-key/
I can think of one good reason to generate your own ID up front. That is for idempotency. For example so that it is possible to tell if something worked or not after a crash. This method works well when using re-try logic.
Let me explain. The reason people might consider re-try logic:
Inter-app communication can sometimes fail for different reasons, (especially in a microservice architecture). The app would be more resilient and self-healing by codifying the app to re-try and not give up right away. This rides over odd blips that might occur without the consumer ever being affected.
For example when dealing with mongo, a request is sent to the DB to store some object, the DB saves it, but just as it is trying to respond to the client to say everything worked fine, there is a network blip for whatever reason and the “OK” is never received. The app assumes it didn't work and so the app may end up re-trying the same data and storing it twice, or worse it just blows up.
Creating the ID up front is an easy, low overhead way to help deal with re-try logic. Of course one could think of other schemes too.
Although this sort of resiliency may be overkill in some types of projects, it really just depends.
I have used custom ids a couple of times and it was quite useful.
In particular I had a collection where I would store stats by date, so the _id was actually a date in a specific format. I did that mostly because I would always query by date. Keep in mind that using this approach can simplify your indexes as no extra index is needed, the basic cursor is sufficient.
Sometimes the ID is something more meaningful than a randomly generated one. For example, a user collection may use the email address as the _id instead. In my project I generate IDs that are much shorter than the ones Mongodb uses so that the ID shown in the URL is much shorter.
I'll use an example , i created a property management tool and it had multiple collections. For simplicity some fields would be duplicated for example the payment. And when i needed to update these record it had to happen simultaneously across all collections it appeared in so i would assign them a custom payment id so when the delete/query action is performed it changes all instances of it database wide

120 mongodb collections vs single collection - which one is more efficient?

I'm new to mongodb and I'm facing a dilemma regarding my DB Schema design:
Should I create one single collection or put my data into several collections (we could call these categories I suppose).
Now I know many such questions have been asked, but I believe my case is different for 2 reasons:
If I go for many collections, I'll have to create about 120 and that's it. This won't grow in the future.
I know I'll never need to query or insert into multiple collections. I will always have to query only one, since a document in collection X is not related to any document stored in the other collections. Documents may hold references to other parts of the DB though (like userId etc).
So my question is: could the 120 collections improve query performance? Is this a useful optimization in my case?
Or should I just go for single collection + sharding?
Each collection is expected hold millions of documents. If use only one, it will store billions of docs.
Thanks in advance!
------- Edit:
Thanks for the great answers.
In fact the 120 collections is only a self made limit, it's not really optimal:
The data in the collections is related to web publishers. There could be millions of these (any web site can join).
I guess the ideal situation would be if I could create a collection for each publisher (to hold their data only). But obviously, this is not possible due to mongo limitations.
So I came up with the idea of a fixed number of collections to at least distribute the data somehow. Like: collection "A_XX" would hold XX Platform related data for publishers whose names start with "A".. etc. We'll only support a few of these platforms, so 120 collections should be more than enough.
On another website someone suggested using many databases instead of many collections. But this means overhead and then I would have to use / manage many different connections.
What do you think about this? Is there a better solution?
Sorry for not being specific enough in my original question.
Thanks in advance
Single Sharded Collection
The edited version of the question makes the actual requirement clearer: you have a collection that can potentially grow very large and you want an approach to partition the data. The artificial collection limit is your own planned partitioning scheme.
In that case, I think you would be best off using a single collection and taking advantage of MongoDB's auto-sharding feature to distribute the data and workload to multiple servers as required. Multiple collections is still a valid approach, but unnecessarily complicates your application code & deployment versus leveraging core MongoDB features. Assuming you choose a good shard key, your data will be automatically balanced across your shards.
You can do not have to shard immediately; you can defer the decision until you see your workload actually requiring more write scale (but knowing the option is there when you need it). You have other options before deciding to shard as well, such as upgrading your servers (disks and memory in particular) to better support your workload. Conversely, you don't want to wait until your system is crushed by workload before sharding so you definitely need to monitor the growth. I would suggest using the free MongoDB Monitoring Service (MMS) provided by 10gen.
On another website someone suggested using many databases instead of many collections. But this means overhead and then I would have to use / manage many different connections.
Multiple databases will add significantly more administrative overhead, and would likely be overkill and possibly detrimental for your use case. Storage is allocated at the database level, so 120 databases would be consuming much more space than a single database with 120 collections.
Fixed number of collections (original answer)
If you can plan for a fixed number of collections (120 as per your original question description), I think it makes more sense to take this approach rather than using a monolithic collection.
NOTE: the design considerations below still apply, but since the question was updated to clarify that multiple collections are an attempted partitioning scheme, sharding a single collection would be a much more straightforward approach.
The motivations for using separate collections would be:
Your documents for a single large collection will likely have to include some indication of the collection subtype, which may need to be added to multiple indexes and could significantly increase index sizes. With separate collections the subtype is already implicit in the collection namespace.
Sharding is enabled at the collection level. A single large collection only gives you an "all or nothing" approach, whereas individual collections allow you to control which subset(s) of data need to be sharded and choose more appropriate shard keys.
You can use the compact to command to defragment individual collections. Note: compact is a blocking operation, so the normal recommendation for a HA production environment would be to deploy a replica set and use rolling maintenance (i.e. compact the secondaries first, then step down and compact the primary).
MongoDB 2.4 (and 2.2) currently have database-level write lock granularity. In practice this has not proven a problem for the vast majority of use cases, however multiple collections would allow you to more easily move high activity collections into separate databases if needed.
Further to the previous point .. if you have your data in separate collections, these will be able to take advantage of future improvements in collection-level locking (see SERVER-1240 in the MongoDB Jira issue tracker).
The main problem here is that you will gain very little performance in the current MongoDB versions if you separate out collections into the same database. To get any sort of extra performance over a single collection setup you would need to move the collections out into separate databases, then you will have operational overhead for judging what database you should query etc.
So yes, you could go for 120 collections easily however, you won't really gain anything currently due to: https://jira.mongodb.org/browse/SERVER-1240 not being implemented (anytime soon).
Housing billions of documents in a single collection isn't too bad. I presume that even if you was to house this in separate collections it probably would not be on a single server either, just like sharding a single collection, so any speed reduction due to multi server setup will also not matter in this case.
In my personal opinion, using a single collection is easier on everything.

Is it better to model data as a Single collection or separate collections?

As an example, imagine a trivial "helpdesk" type app where there are support tickets, and the app supports multiple companies logging in and managing their tickets.
Given that companies won't interact with each others "Tickets"....
Is it better to have one collection of "Tickets" and query or is it better to create collections of Tickets per Company?
There are a couple of things to consider here.
The first thing is pre-allocation of space. You will find a couple of threads on the mongodb-user group whereby the OP is confused about why their database is taking so much space when their data is taking so little space. This is because when you reach a certain point of pre-alloc within a collection it will create files 2GB in size by default, even if you are only using 100meg of that space.
Now imagine this pre-alloc pattern for 1000 companies; this quickly creates inefficient use of disk space and, in most of the threads, performance and cost problems.
The second thing to consider here is the nssize, which is 2GB maximum. This may seem crazy but what if you do have more than 3 million members (assume a company is a "registered user")? You will quickly use up the maximum namespace file size that MongoDB can give.
Also you will gain no benefit from the lock (on DB level) without splitting them out into separate databases, this of course creates an operational overhead in maintaining the database connections for each company.
MongoDB is typically designed to scale through a cluster rather than scale vertically and scaling vertically is normally considered a bad idea for large websites.
I don't have much time using mongodb, but I'll give some arguments so we can discuss it. I think you should create just one Tickets collection, for the following reasons:
Creating a Collection for each company seems like redundancy.
You will have to create and configurate a collection every time you add a new company to your system in order to create tickets, when in the other hand you will only have to create the company.
I don't know how where you planning to create the link between your company document and it's corresponding ticket collection, but I think is more straightforward to create the link using the id of the company document with an idcompany attribute in the Tickets collection.
I think one of the reasons that might make you consider to create a ticket collection per company, is due to the large amount of data could decrease the speed of your queries (all the companies inserting to the same tickets collection). But the way you could counter this is creating a sharded cluster, using a compound shard key with idcompany and some usefull attribute from the Tickets document, this way is very likely that all the documents of a given company remains in the same shard, so the common queries will perform relatively quick.
My $0.02:
By separating out each company into their own collections, or better, databases... it makes customer migration and individualized backups, restores, imports and exports much easier at the expense of making your code a tad crappier.
Isolating customer data may reduce your data storage requirements, as you won't need to embed the customer ID into every single document. Of course, with separate databases, most drivers will treat that as a separate network connection.
As with everything, there are tradeoffs.