MongoDB Atlas Serverless Database | Serverless Instance Costs | RPU cost explanation - mongodb

Can someone explain how RPUs are getting calculated by an example ?
Let's say I have a mongo collection that has 5 million documents. So if i do a findOne to the collection, the RPUs generated would be 5M or 1 ?

It depends on a crucially import step that I could not find anywhere in Mongo's pricing information.
I had 4000 daily visits with a user database of 25,000 users. I was using findOne on the database two times per page load. What I thought would be 8000 RPUs turned out to be 170 million RPUs. That's right - 42,500 RPUs per page visit.
So the critically important step is: add indexes for the keys you're matching on.
Yes this is an obvious step to optimize performance, but also easy to overlook the first go-around of creating a database.
I was in the MongoDD dashboard daily to check user stats and was never alerted to any abnormalities. Searching online for over an hour, I could not find any mention of this from Mongo. But when I reached out to support, this is the first thing they pointed out. So they know it's an issue. They know it's a gotcha. They know how simple it'd be to point out that a single call is resulting in 40,000 RPUs. Doesn't seem that they care. Rather it seems to be a sleezy business model they've intentionally adopted.
So if you do a findOne on a 5 million document database, will you get 1 RPU charged or 5 Million RPUs charged? Depends if you remembered to index.

Related

Dealing with a LARGE data in mongodb

It is going to be a "general-ish" question but I have a reason for that. I am asking this because I am not sure what kind of approach shall I take to make things faster.
I have a mongoDB server running on a BIG aws instance (r3.4xlarge 16 core vCPU and 122 GB primary memory). The database has one HUGE collection which has 293017368 documents in it and some of them have a field called paid which holds a string value and some of them do not. Also some of them them have an array called payment_history some of them do not. I need to perform some tasks on that database but ALL the documents that do not have either paid or payment_history or both is not relevant to me. So I thought of cleaning (shrinking) the DB before I proceed with actual operations. I thought that as I have to check something like ({paid: {$exists: false}}) to delete records for the first step I should create an index over paid. I can see that at the present rate the index will take 80 days to be finished.
I am not sure what should be my approach for this situation? Shall I write a map-reduce to visit each and every document and perform whatever I need to perform in one shot and write the resulting documents in a different collection? Or shall I somehow (not sure how though) split up the huge DB in small sections and apply the transforms on each of them and then merge the resultant records from each server into a final cleaned record set? Or shall I somehow (not sure how) put that data in a Elastic Map-Reduce or redshift to operate on it? In a nutshell, what do you people think the best route to take for such a situation?
I am sorry in advance if this question sounds a little bit vague. I tried to explain the real situation as much as I could.
Thanks a lot in advance for help :)
EDIT
According to the comment about sparse indexing, I am now performing a partialIndexing. with this command - db.mycol.createIndex({paid: 1}, {partialFilterExpression: {paid: {$exists: true}}}) it roughly creates 53 indices per second... at this rate I am not sure how long it will take for the entire collection to get indexed. But I am keeping it on for the night and I will come back tomorrow here to update this question. I intend to hold this question the entire journey that I will go through, just for the sake of people in future with the same problem and same situation.

mongodb taking 500 ms in fetching 150 records

We are using MongoDb(3.0.6) for the application we are using. We have around 150 entries in one of the collections, but it takes approx 500ms to fetch all of these records, which I didn't expect. Mongo is deployed on a company server. What can be done to reduce this time?
We are not getting too many reads and there is no CPU load, etc, What can be mistake which may be causing this or what config should be changes to affect these.
Here is my schema: http://pastebin.com/x18iDPKf
I am just querying all the entries, which are 160 in number. I don't think time taken is due to Mongoose or NodeJs, as when I quesry using RoboMongo, It still takes same time.
Output of db.<collection>.stats().size :
223000
The query I am doing is:
db.getCollection('collectionName').find({})
Definitely it shouldn't be a problem with MongoDB. It should be a temporary issue or might be due to system constraints such as Internal Memory and so on.
If still the problem exists, use Indexes on the appropriate field which you are querying.
https://docs.mongodb.com/manual/indexes/

MongoDB Schema Suggestion

I am trying to pick MongoDB as my preferred database. I need help on the design of my table.
App background - analytics app where contacts push their own events and related custom data. Contact can have many events. Eg: contact did this, did that etc.
event_type, custom_data (json), epoch_time
eg:
event 1: event_type: page_visited, custom-data: {url: pricing, referrer: google}, current_time
event 2: event_type: video_watched, custom-data: {url: video_link}, current_time
event 3: event_type: paid, custom_data: {plan:lite, price:35}
These events are custom and are defined by the user. Scalability is a concern.
These are the common use cases:
give me a list of users who have come to pricing page in the last 7 days
give me a list of users who watched the video and paid more than 50
give me a list of users who have visited pricing, watched video but NOT paid at least 20
What's the best way to design my table?
Is it a good idea to use embedded events in this case?
In Mongo they are called collections and not tables, since the data is not rows/columns :)
(1) I'd make an Event collection and a Users collections
(2) I'd do 1 document per Event which has a userId in it.
(3) If you need realtime data you will want an index on what you want to query by (i.e. never do a scan over the whole collection).
(4) if there are things which are needed for reporting only, I'd recommend making a reporting node (i.e. a different mongo instance) and using replication to copy data to that mongo instance. You can put additional indexes for reporting on that node. That way the additional indexes and any expensive queries will not affect production performance.
Notes on sharding
If your events collection is going to become large - you may need to consider sharding. Perhaps sharding by user Id. However, I'd recommend that may be a longer term solution and not to dive into that until you need it.
One thing to note, is that mongo has currently (2.6) a database level write locking implementation. Which means you can only perform 1 write at a time. It allows many reads. Which means that if you want a high write system AND have a lot of users, you will need to look into sharding at some point. However, in my experience so far, administratively 1 primary node with a secondary (and reporting node) is easier to setup. We currently can handle around 10,000 operations per second with that setup.
However, we have had issues with spikes in users coming to the system. You'll want to make sure you have enough memory for your indexes. And SSD's would be recommended to. as a surge in users can result in cache misses (i.e. index not in memory) which causes it to be read off the hard disk.
One final note - there are a lot of NoSQL DB's and they all have their pros and cons. I personally found that high write, low read, and realtime anaysis of lots of data is not really mongo's strength. So it does depend on what you are doing. It sounds like you are still learning the fundamentals. It might be worth a read of all the available types to pick the right tool for the right job.

Is it a good idea to generate per day collections in mongodb

Is it a good idea to create per day collections for data on a given day (we could start with per day and then move to per hour if there is too much data). Is there a limit on the number of collections we can create in mongodb, or does it result in performance loss (is it an overhead for mongodb to maintain so many collections). Does a large number of collections have any adverse effect on performance?
To give you more context, the data will be more like facebook feeds, and only the latest data (say last one week or month) is more important to us. Making per day collections keeps the number of documents low, and probably would result in fast access. Even if we need old data, we can fall back to older collections. Does this make sense, or am I heading in the wrong direction?
what you actually need is to archive the old data. I would suggest you to take a look at this thread at the mongodb mailing list:
https://groups.google.com/forum/#!topic/mongodb-user/rsjQyF9Y2J4
Last post there from Michael Dirolf (10gen)says:
"The OS will handle LRUing out data, so if all of your queries are
touching the same portion of data that should stay in memory
independently of the total size of the collection."
so I guess you can stay with single collection and good indexes will do the work.
anyhow, if the collection goes too big you can always run manual archive process.
Yes, there is a limit to the number of collections you can make. From the Mongo documentation Abhishek referenced:
The limitation on the number of namespaces is the size of the namespace file divided by 628.
A 16 megabyte namespace file can support approximately 24,000 namespaces. Each index also counts as a namespace.
Indexes etc. are included in the namespaces, but even still, it would take something like 60 years to hit that limit.
However! Have you considered what happens when you want data that spans collections? In other words, if you wanted to know how many users have feeds updated in a week, you're in a bit of a tight spot. It's not easy/trivial to query across collections.
I would recommend instead making one collection to store the data and simply move data out periodically as Tamir recommended. You can easily write a job to move data out of the collection every week or every month.
Creating a collection is not much overhead, but it the overhead is larger than creating a new document inside a collections.
There is a limitation on the no of collections that you can create: " http://docs.mongodb.org/manual/reference/limits/#Number of Namespaces "
Making new collections to me, won't be having any performance difference because in RAM you cache only those data that you actually query. In your case it will be recent feeds etc.
But having per day/hour collection will help you in achieving old data very easily.

Atomic counters Postgres vs MongoDB

I'm building a very large counter system. To be clear, the system is counting the number of times a domain occurs in a stream of data (that's about 50 - 100 million elements in size).
The system will individually process each element and make a database request to increment a counter for that domain and the date it is processed on. Here's the structure:
stats_table (or collection)
-----------
id
domain (string)
date (date, YYYY-MM-DD)
count (integer)
My initial inkling was to use MongoDB because of their atomic counter feature. However as I thought about it more, I figured Postgres updates already occur atomically (at least that's what this question leads me to believe).
My question is this: is there any benefit of using one database over the other here? Assuming that I'll be processing around 5 million domains a day, what are the key things I need to be considering here?
All single operations in Postgres are automatically wrapped in transactions and all operations on a single document in MongoDB are atomic. Atomicity isn't really a reason to preference one database over the other in this case.
While the individual counts may get quite high, if you're only storing aggregate counts and not each instance of a count, the total number of records should not be too significant. Even if you're tracking millions of domains, either Mongo or Postgres will work equally well.
MongoDB is a good solution for logging events, but I find Postgres to be preferable if you want to do a lot of interesting, relational analysis on the analytics data you're collecting. To do so efficiently in Mongo often requires a high degree of denormalization, so I'd think more about how you plan to use the data in the future.