MongoDB Schema Suggestion - mongodb

I am trying to pick MongoDB as my preferred database. I need help on the design of my table.
App background - analytics app where contacts push their own events and related custom data. Contact can have many events. Eg: contact did this, did that etc.
event_type, custom_data (json), epoch_time
eg:
event 1: event_type: page_visited, custom-data: {url: pricing, referrer: google}, current_time
event 2: event_type: video_watched, custom-data: {url: video_link}, current_time
event 3: event_type: paid, custom_data: {plan:lite, price:35}
These events are custom and are defined by the user. Scalability is a concern.
These are the common use cases:
give me a list of users who have come to pricing page in the last 7 days
give me a list of users who watched the video and paid more than 50
give me a list of users who have visited pricing, watched video but NOT paid at least 20
What's the best way to design my table?
Is it a good idea to use embedded events in this case?

In Mongo they are called collections and not tables, since the data is not rows/columns :)
(1) I'd make an Event collection and a Users collections
(2) I'd do 1 document per Event which has a userId in it.
(3) If you need realtime data you will want an index on what you want to query by (i.e. never do a scan over the whole collection).
(4) if there are things which are needed for reporting only, I'd recommend making a reporting node (i.e. a different mongo instance) and using replication to copy data to that mongo instance. You can put additional indexes for reporting on that node. That way the additional indexes and any expensive queries will not affect production performance.
Notes on sharding
If your events collection is going to become large - you may need to consider sharding. Perhaps sharding by user Id. However, I'd recommend that may be a longer term solution and not to dive into that until you need it.
One thing to note, is that mongo has currently (2.6) a database level write locking implementation. Which means you can only perform 1 write at a time. It allows many reads. Which means that if you want a high write system AND have a lot of users, you will need to look into sharding at some point. However, in my experience so far, administratively 1 primary node with a secondary (and reporting node) is easier to setup. We currently can handle around 10,000 operations per second with that setup.
However, we have had issues with spikes in users coming to the system. You'll want to make sure you have enough memory for your indexes. And SSD's would be recommended to. as a surge in users can result in cache misses (i.e. index not in memory) which causes it to be read off the hard disk.
One final note - there are a lot of NoSQL DB's and they all have their pros and cons. I personally found that high write, low read, and realtime anaysis of lots of data is not really mongo's strength. So it does depend on what you are doing. It sounds like you are still learning the fundamentals. It might be worth a read of all the available types to pick the right tool for the right job.

Related

Querying a large mongodb collection in real-time

We have a service that allow people to open a room and play YouTube songs while others are listening in real-time.
Among other collections in our MongoDB we have one to store songs user adding to the room's playlists, it calls: userSong.
This collection holds records for all songs added for the combination of: user-room-song.
The code makes frequent queries to the collection in those major operations:
Loading current playlist (regular find with a trivial condition)
Loading random song for a room (using Mongo aggregation FW)
Loading room top songs (using Mongo aggregation FW)
Now, this table become big (+1m records) and things start become slow, AWS start sending us CPU utilization notifications more often and follow by mongotop the userSong collection makes the CPU high consumption mostly in READ operations.
We made some modifications in the collection indexes and it helps a bit but it's still not a solution, we need to find some other way to arrange the data cause it exponentially growing.
We tought about to split the userSong data into a low level segmentation, instead of by user-room-song to do it by collection of user-song for each room in the system, this will short the time to fetching data from the DB, now we need to decide how to do that:
Make a new collection for each room (roomUserSong) that will hold all user-song records for a particula room. this might be good for quick fetching but will create an unlimited new collectons in the database (roomusersong-1,roomusersong-2, ..., roomusersong-n) and we dont know if it's a good in practice or there are some others Mongo limitations in that kind of solution.
Create just 1 more collection in the DB with the following fields:
{room: <roomId>, userSongs: [{userSong1, userSong2, ..., userSongN}], so each room will have it's own document and inside it a sub document (an Array) that holds all user-song records for this room. this will solve the previous issue (to create unlimited collections) but it'll be very hard to work with Mongoose (our ODM) alter cause (as far as i know) we cannot define a schema in advanced for this such data structure. also this is may tak us to the sub-document size limitation that is 16MB as far as understood.
It'll be nice to hear some advices from people who have Mongo experience with those kind situations:
Is +1m is really consider big and supposed to make this CPU utilization issues? (using AWS m3.medium, one core)
What is the better solution approach form what introduced?
Any other ideas to make smart cache without change too much the code?
Thanks for helpers!

Mass Update NoSQL Documents: Bad Practice?

I'm storing two collections in a MongoDB database:
==Websites==
id
nickname
url
==Checks==
id
website_id
status
I want to display a list of check statuses with the appropriate website nickname.
For example:
[Google, 200] << (basically a join in SQL-world)
I have thousands of checks and only a few websites.
Which is more efficient?
Store the nickname of the website within the "check" directly. This means if the nickname is ever changed, I'll have to perform a mass update of thousands of documents.
Return a multidimensional array where the site ID is the key and the nickname is the value. This is to be used when iterating through the list of checks.
I've read that #1 isn't too bad (in the NoSQL) world and may, in fact, be preferred? True?
If it's only a few websites I'd go with option 1 - not as clean and normalized as in the relational/SQL world but it works and much less painful than trying to emulate joins with MongoDB. The thing to remember with MongoDB or any other NoSQL database is that you are generally making some kind of trade off - nothing is for free. I personally really value the schema-less document oriented data design and for the applications I use it for I readily make the trade-offs (like no joins and transactions).
That said, this is a trade-off - so one thing to always be asking yourself in this situation is why am I using MongoDB or some other NoSQL database? Yes, it's trendy and "hot", but I'd make certain that what you are doing makes sense for a NoSQL approach. If you are spending a lot of time working around the lack of joins and foreign keys, no transactions and other things you're used to in the SQL world I'd think seriously about whether this is the best fit for your problem.
You might consider a 3rd option: Get rid of the Checks collection and embed the checks for each website as an array in each Websites document.
This way you avoid any JOINs and you avoid inconsistencies, because it is impossible for a Check to exist without the Website it belongs to.
This, however, is only recommended when the checks array for each document stays relatively constant over time and doesn't grow constantly. Rapidly growing documents should be avoided in MongoDB, because everytime a document doubles its size, it is moved to a different location in the physical file it is stored in, which slows down write-operations. Also, MongoDB has a 16MB limit per document. This limit exists mostly to discourage growing documents.
You haven't said what a Check actually is in your application. When it is a list of tasks you perform periodically and only make occasional changes to, there would be nothing wrong with embedding. But when you collect the historical results of all checks you ever did, I would rather recommend to put each result(set?) in an own document to avoid document growth.

How to group on referenced entities attributes in mongodb?

I am trying to build an event tracking system for my mobile apps. I am evaluating mongodb for my needs and I don't have any hands-on experience with NoSQL databases. I have read mongodb documentation thoroughly and have come up with following schema design for my needs.
1. Must have a horizontally scalable data store
2. Data store must execute group queries quickly in sharded environment
3. Must have extremely high write throughput
Collections:
Events:
{name:'<name>', happened_at:'<timestamp>', user : { imei: '<imei>', model_id: '<model_id>'}
Devices:
{model_id:'<model_id>', device_width:<width>, memory: '<memory>', cpu: '<cpu>'}
I do not want to store devices as embedded document with in events.user to save storage space in my fastest growing collection i.e. events. Devices collection is not going to grow much and must be having records not more than 30k. While events collection is going to have few million documents added every day.
My data growth needs a sharded environment and we shall care about that from day 1 and hence not use anything which doesn't work in sharded system.
e.g. Group functions don't work with shards, we shall always write mongo M/R commands for such needs.
Problem: What is the best way to get all user who did a particular event(name='abc happened') on devices, having device_width<300.
My solution: Find all models having device_width<300 and use result for filtering events documents on such models.
Problem: Return count of users for which a particular event(name='abc happened') on devices, grouped against cpu of device
My solution: Get count of users for given event, grouped by model_ids(<30k records, I know). Further group with model_id related cpu and return final result.
Please let me know if I am doing it the right way. If not, what is the right way to do it at scale?
EDIT: Please also point out if there is any possible caveat like indexes might not get used to maximum effect with map/reduce.

Are there any advantages to using a custom _id for documents in MongoDB?

Let's say I have a collection called Articles. If I were to insert a new document into that collection without providing a value for the _id field, MongoDB will generate one for me that is specific to the machine and the time of the operation (e.g. sdf4sd89fds78hj).
However, I do have the ability to pass a value for MongoDB to use as the value of the _id key (e.g. 1).
My question is, are there any advantages to using my own custom _ids, or is it best to just let Mongo do its thing? In what scenarios would I need to assign a custom _id?
Update
For anyone else that may find this. The general idea (as I understand it) is that there's nothing wrong with assigning your own _ids, but it forces you to maintain unique values within your application layer, which is a PITA, and requires an extra query before every insert to make sure you don't accidentally duplicate a value.
Sammaye provides an excellent answer here:
Is it bad to change _id type in MongoDB to integer?
Advantages with generating your own _ids:
You can make them more human-friendly, by assigning incrementing numbers: 1, 2, 3, ...
Or you can make them more human-friendly, using random strings: t3oSKd9q
(That doesn't take up too much space on screen, could be picked out from a list, and could potentially be copied manually if needed. However you do need to make it long enough to prevent collisions.)
If you use randomly generated strings they will have an approximately even sharding distribution, unlike the standard mongo ObjectIds, which tends to group records created around the same time onto the same shard. (Whether that is helpful or not really depends on your sharding strategy.)
Or you may like to generate your own custom _ids that will group related objects onto one shard, e.g. by owner, or geographical region, or a combination. (Again, whether that is desirable or not depends on how you intend to query the data, and/or how rapidly you are producing and storing it. You can also do this by specifying a shard key, rather than the _id itself. See the discussion below.)
Advantages to using ObjectIds:
ObjectIds are very good at avoiding collisions. If you generate your own _ids randomly or concurrently, then you need to manage the collision risk yourself.
ObjectIds contain their creation time within them. That can be a cheap and easy way to retain the creation date of a document, and to sort documents chronologically. (On the other hand, if you don't want to expose/leak the creation date of a document, then you must not expose its ObjectId!)
The nanoid module can help you to generate short random ids. They also provide a calculator which can help you choose a good id length, depending on how many documents/ids you are generating each hour.
Alternatively, I wrote mongoose-generate-unique-key for generating very short random ids (provided you are using the mongoose library).
Sharding strategies
Note: Sharding is only needed if you have a huge number of documents (or very heavy documents) that cannot be managed by one server. It takes quite a bit of effort to set up, so I would not recommend worrying about it until you are sure you actually need it.
I won't claim to be an expert on how best to shard data, but here are some situations we might consider:
An astronomical observatory or particle accelerator handles gigabytes of data per second. When an interesting event is detected, they may want to store a huge amount of data in only a few seconds. In this case, they probably want an even distribution of documents across the shards, so that each shard will be working equally hard to store the data, and no one shard will be overwhelmed.
You have a huge amount of data and you sometimes need to process all of it at once. In this case (but depending on the algorithm) an even distribution might again be desirable, so that all shards can work equally hard on processing their chunk of the data, before combining the results at the end. (Although in this scenario, we may be able to rely on MongoDB's balancer, rather than our shard key, for the even distribution. The balancer runs in the background after data has been stored. After collecting a lot of data, you may need to leave it to redistribute the chunks overnight.)
You have a social media app with a large amount of data, but this time many different users are making many light queries related mainly to their own data, or their specific friends or topics. In this case, it doesn't make sense to involve every shard whenever a user makes a little query. It might make sense to shard by userId (or by topic or by geographical region) so that all documents belonging to one user will be stored on one shard, and when that user makes a query, only one shard needs to do work. This should leave the other shards free to process queries for other users, so many users can be served at once.
Sharding documents by creation time (which the default ObjectIds will give you) might be desirable if you have lots of light queries looking at data for similar time periods. For example many different users querying different historical charts.
But it might not be so desirable if most of your users are querying only the most recent documents (a common situation on social media platforms) because that would mean one or two shards would be getting most of the work. Distributing by topic or perhaps by region might provide a flatter overall distribution, whilst also allowing related documents to clump together on a single shard.
You may like to read the official docs on this subject:
https://docs.mongodb.com/manual/sharding/#shard-key-strategy
https://docs.mongodb.com/manual/core/sharding-choose-a-shard-key/
I can think of one good reason to generate your own ID up front. That is for idempotency. For example so that it is possible to tell if something worked or not after a crash. This method works well when using re-try logic.
Let me explain. The reason people might consider re-try logic:
Inter-app communication can sometimes fail for different reasons, (especially in a microservice architecture). The app would be more resilient and self-healing by codifying the app to re-try and not give up right away. This rides over odd blips that might occur without the consumer ever being affected.
For example when dealing with mongo, a request is sent to the DB to store some object, the DB saves it, but just as it is trying to respond to the client to say everything worked fine, there is a network blip for whatever reason and the “OK” is never received. The app assumes it didn't work and so the app may end up re-trying the same data and storing it twice, or worse it just blows up.
Creating the ID up front is an easy, low overhead way to help deal with re-try logic. Of course one could think of other schemes too.
Although this sort of resiliency may be overkill in some types of projects, it really just depends.
I have used custom ids a couple of times and it was quite useful.
In particular I had a collection where I would store stats by date, so the _id was actually a date in a specific format. I did that mostly because I would always query by date. Keep in mind that using this approach can simplify your indexes as no extra index is needed, the basic cursor is sufficient.
Sometimes the ID is something more meaningful than a randomly generated one. For example, a user collection may use the email address as the _id instead. In my project I generate IDs that are much shorter than the ones Mongodb uses so that the ID shown in the URL is much shorter.
I'll use an example , i created a property management tool and it had multiple collections. For simplicity some fields would be duplicated for example the payment. And when i needed to update these record it had to happen simultaneously across all collections it appeared in so i would assign them a custom payment id so when the delete/query action is performed it changes all instances of it database wide

What is pre-distilled data or data aggregated at runtime, and why is MongoDB not good at it?

What is an example of data that is "predistilled or aggregated in runtime"? (And why isn't MongoDB very good with it?)
This is a quote from the MongoDB docs:
Traditional Business Intelligence. Data warehouses are more suited to new, problem-specific BI databases. However note that MongoDB can work very well for several reporting and analytics problems where data is pre-distilled or aggregated in runtime -- but classic, nightly batch load business intelligence, while possible, is not necessarily a sweet spot.
Let's take something simple like counting clicks. There are a few ways to report on clicks.
Store the clicks in a single place. (file, database table, collection) When somebody wants stats, you run a query on that table and aggregate the results. Of course, this doesn't scale very well, so typically you use...
Batch jobs. Store your clicks as in #1, but only summarize them every 5 minutes or so. When people want to query the summary table. Note that "clicks" may have millions of rows, but "summary" may only have a few thousand rows, so it's much quicker to query.
Count the clicks in real-time. Every time there's a click you increment a counter somewhere. Typically this means incrementing the "summary" table(s).
Now most big systems use #2. There are several systems that are very good for this specifically (see Hadoop).
#3 is difficult to do with SQL databases (like MySQL), because there's a lot of disk locking happening. However, MongoDB isn't constantly locking the disk and tends to have much better write throughput.
So MongoDB ends up being very good for such "real-time counters". This is what they mean by predistilled or aggregated in runtime.
But if MongoDB has great write throughput, shouldn't it be good at doing batch jobs?
In theory, this may be true and MongoDB does support Map/Reduce. However, MongoDB's Map/Reduce is currently quite slow and not on par with other Map/Reduce engines like Hadoop. On top of that, the Business Intelligence (BI) field is filled with many other tools that are very specific and likely better-suited than MongoDB.
What is an example of data that is "predistilled or aggregated in
runtime"?
Example of this can be any report that require data from multiple collections.
And why isn't MongoDB very good with it?
In document databases you can't make a join and because of this it hard to build reports. Usually reports it's data aggregating from many tables/collections.
And since mongodb (and document database in general) good fit for data distribution and denormalization better to prebuild reports whenever it possible and just display data from this collection in runtime.
For some tasks/reports it is not possible to prebuild data, in this case mongodb give to you map/reduce, grouping, etc.