How to group on referenced entities attributes in mongodb? - mongodb

I am trying to build an event tracking system for my mobile apps. I am evaluating mongodb for my needs and I don't have any hands-on experience with NoSQL databases. I have read mongodb documentation thoroughly and have come up with following schema design for my needs.
1. Must have a horizontally scalable data store
2. Data store must execute group queries quickly in sharded environment
3. Must have extremely high write throughput
Collections:
Events:
{name:'<name>', happened_at:'<timestamp>', user : { imei: '<imei>', model_id: '<model_id>'}
Devices:
{model_id:'<model_id>', device_width:<width>, memory: '<memory>', cpu: '<cpu>'}
I do not want to store devices as embedded document with in events.user to save storage space in my fastest growing collection i.e. events. Devices collection is not going to grow much and must be having records not more than 30k. While events collection is going to have few million documents added every day.
My data growth needs a sharded environment and we shall care about that from day 1 and hence not use anything which doesn't work in sharded system.
e.g. Group functions don't work with shards, we shall always write mongo M/R commands for such needs.
Problem: What is the best way to get all user who did a particular event(name='abc happened') on devices, having device_width<300.
My solution: Find all models having device_width<300 and use result for filtering events documents on such models.
Problem: Return count of users for which a particular event(name='abc happened') on devices, grouped against cpu of device
My solution: Get count of users for given event, grouped by model_ids(<30k records, I know). Further group with model_id related cpu and return final result.
Please let me know if I am doing it the right way. If not, what is the right way to do it at scale?
EDIT: Please also point out if there is any possible caveat like indexes might not get used to maximum effect with map/reduce.

Related

Should data be clustered as databases or collections [duplicate]

I am designing a system with MongoDb (64 bit version) to handle a large amount of users (around 100,000) and each user will have large amounts of data (around 1 million records).
What is the best strategy of design?
Dump all records in single collection
Have a collection for each user
Have a database for each user.
Many Thanks,
So you're looking at somewhere in the region of 100 billion records (1 million records * 100,000 users).
The preferred way to deal with large amounts of data is to create a sharded cluster that splits the data out over several servers that are presented as single logical unit via the mongo client.
Therefore the answer to your question is put all your records in a single sharded collection.
The number of shards required and configuration of the cluster is related to the size of the data and other factors such as the quantity and distribution of reads and writes. The answers to those questions are probably very specific to your unique situation, so I won't attempt to guess them.
I'd probably start by deciding how many shards you have the time and machines available to set up and testing the system on a cluster of that many machines. Based on the performance of that, you can decide whether you need more or fewer shards in your cluster
So you are looking for 100,000,000 detail records overall for 100K users?
What many people don't seem to understand is that MongoDB is good at horizontal scaling. Horizontal scaling is normally classed as scaling huge single collections of data across many (many) servers in a huge cluster.
So already if you use a single collection for common data (i.e. one collection called user and one called detail) you are suiting MongoDBs core purpose and build.
MongoDB, as mentioned, by others is not so good at scaling vertically across many collections. It has a nssize limit to begin with and even though 12K initial collections is estimated in reality due to index size you can have as little as 5K collections in your database.
So a collection per user is not feasible at all. It would be using MongoDB against its core principles.
Having a database per user involves the same problems, maybe more, as having singular collections per user.
I have never encountered some one not being able to scale MongoDB to the billions or even close to the 100s of billions (or maybe beyond) on a optimised set-up, however, I do not see why it cannot; after all Facebook is able to make MySQL scale into the 100s of billions per user (across 32K+ shards) for them and the sharding concept is similar between the two databases.
So the theory and possibility of doing this is there. It is all about choosing the right schema and shard concept and key (and severs and network etc etc etc etc).
If you were to witness problems you could go for splitting archive collections, or deleted items away from the main collection but I think that is overkill, instead you want to make sure that MongoDB knows where each segment of your huge dataset is at any given point in time on the master and ensure that this data is always hot, that way queries that don't do a global and scatter OP should be quite fast.
About a collection on each users:
By default configuration, MongoDB is limited to 12k collections. You can increase the size of this with --nssize but it's not unlimited.
And you have to count index into this 12k. (check "namespaces" concept on mongo documentation).
About a database for each user:
For a model point of view, that's very curious.
For technical, there is no limit on mongo, but you probably have a limit with file descriptor (limit from you OS/settings).
So asĀ #Rohit says, the two last are not good.
Maybe you should explain more about your case.
Maybe you can cut users into different collections (ex: one for each first letter of name etc., or for each service of the company...).
And, of course use sharding.
Edit: maybe MongoDb is not the best database for your use case.

Mongo Architecture Efficiency

I am currently working on designing a local content bases sharing system that depends on mongoDB. I need to make a critical architecture decision that will undoubtably have a huge impact on query performance, scaling and overall long term maintainability.
Our system has a library of topics, each topic is available in specific cities/metropolitan areas. When a person creates a piece of content it needs to be stored as part of the topic in a specific city. There are three approaches I am currently considering to address these requirements (And open to other ideas as well).
Option 1 (Single Collection per Topic/City):
Example: a collection name would be TopicID123CityID456 and each entry would obviously be a document within that collection.
Option 2 (Single Topic Collection)
Example: A collection name would be Topic123 and each entry would create a document that contains an indexed cityID.
Option 3 (Single City Collection)
Example: A collection name would be City456 and each entry would create a document that contains an indexed topicID
When querying the DB I always want to build a feed in date order based on the member's selected topic(s) and city. Since members can group multiple topics together to build a custom feed, option 3 seems to be the best, however I am concerned with long term performance of this approach. It seems option 1 would be the most performant but also forces multiple queries when needing to select more than one topic.
Another thing that I need to consider is some topics will be far more active and grow much larger than other topics which will also vary by location.
Since I still consider myself a beginner with MongoDB, I want to make sure the general DB structure is the most ideal before coding all of the logic around writing and retrieving the data. And I don't know how well Mongo Performs with hundreds of thousands if not millions of documents in a collection thus my uncertainty in approach.
From experience which is the most optimal way of tackling the storage and recall of this data? Any insight would be greatly appreciated.
UPDATE: June 22, 2016
It is important to note that we are starting in a one DB server environment to start. #profesor79 provided a great scaling solution once we need to move to a multi-server (Sharded) environment.
from your 3 proposal I will pickup number 4 :-)
Having a one collection sharded over multiple servers.
As there could be one collection TopicCity, `we could have a one for all topics and one foll all cities.
Then collection topicCities will have all documents sharded.
Sharding on key {topic:1, city:1} will allow to balance load thru shard servers and enytime you will need to add more power you will be able to add shard to cluster.
Any comments welcome!

Querying a large mongodb collection in real-time

We have a service that allow people to open a room and play YouTube songs while others are listening in real-time.
Among other collections in our MongoDB we have one to store songs user adding to the room's playlists, it calls: userSong.
This collection holds records for all songs added for the combination of: user-room-song.
The code makes frequent queries to the collection in those major operations:
Loading current playlist (regular find with a trivial condition)
Loading random song for a room (using Mongo aggregation FW)
Loading room top songs (using Mongo aggregation FW)
Now, this table become big (+1m records) and things start become slow, AWS start sending us CPU utilization notifications more often and follow by mongotop the userSong collection makes the CPU high consumption mostly in READ operations.
We made some modifications in the collection indexes and it helps a bit but it's still not a solution, we need to find some other way to arrange the data cause it exponentially growing.
We tought about to split the userSong data into a low level segmentation, instead of by user-room-song to do it by collection of user-song for each room in the system, this will short the time to fetching data from the DB, now we need to decide how to do that:
Make a new collection for each room (roomUserSong) that will hold all user-song records for a particula room. this might be good for quick fetching but will create an unlimited new collectons in the database (roomusersong-1,roomusersong-2, ..., roomusersong-n) and we dont know if it's a good in practice or there are some others Mongo limitations in that kind of solution.
Create just 1 more collection in the DB with the following fields:
{room: <roomId>, userSongs: [{userSong1, userSong2, ..., userSongN}], so each room will have it's own document and inside it a sub document (an Array) that holds all user-song records for this room. this will solve the previous issue (to create unlimited collections) but it'll be very hard to work with Mongoose (our ODM) alter cause (as far as i know) we cannot define a schema in advanced for this such data structure. also this is may tak us to the sub-document size limitation that is 16MB as far as understood.
It'll be nice to hear some advices from people who have Mongo experience with those kind situations:
Is +1m is really consider big and supposed to make this CPU utilization issues? (using AWS m3.medium, one core)
What is the better solution approach form what introduced?
Any other ideas to make smart cache without change too much the code?
Thanks for helpers!

MongoDB Schema Suggestion

I am trying to pick MongoDB as my preferred database. I need help on the design of my table.
App background - analytics app where contacts push their own events and related custom data. Contact can have many events. Eg: contact did this, did that etc.
event_type, custom_data (json), epoch_time
eg:
event 1: event_type: page_visited, custom-data: {url: pricing, referrer: google}, current_time
event 2: event_type: video_watched, custom-data: {url: video_link}, current_time
event 3: event_type: paid, custom_data: {plan:lite, price:35}
These events are custom and are defined by the user. Scalability is a concern.
These are the common use cases:
give me a list of users who have come to pricing page in the last 7 days
give me a list of users who watched the video and paid more than 50
give me a list of users who have visited pricing, watched video but NOT paid at least 20
What's the best way to design my table?
Is it a good idea to use embedded events in this case?
In Mongo they are called collections and not tables, since the data is not rows/columns :)
(1) I'd make an Event collection and a Users collections
(2) I'd do 1 document per Event which has a userId in it.
(3) If you need realtime data you will want an index on what you want to query by (i.e. never do a scan over the whole collection).
(4) if there are things which are needed for reporting only, I'd recommend making a reporting node (i.e. a different mongo instance) and using replication to copy data to that mongo instance. You can put additional indexes for reporting on that node. That way the additional indexes and any expensive queries will not affect production performance.
Notes on sharding
If your events collection is going to become large - you may need to consider sharding. Perhaps sharding by user Id. However, I'd recommend that may be a longer term solution and not to dive into that until you need it.
One thing to note, is that mongo has currently (2.6) a database level write locking implementation. Which means you can only perform 1 write at a time. It allows many reads. Which means that if you want a high write system AND have a lot of users, you will need to look into sharding at some point. However, in my experience so far, administratively 1 primary node with a secondary (and reporting node) is easier to setup. We currently can handle around 10,000 operations per second with that setup.
However, we have had issues with spikes in users coming to the system. You'll want to make sure you have enough memory for your indexes. And SSD's would be recommended to. as a surge in users can result in cache misses (i.e. index not in memory) which causes it to be read off the hard disk.
One final note - there are a lot of NoSQL DB's and they all have their pros and cons. I personally found that high write, low read, and realtime anaysis of lots of data is not really mongo's strength. So it does depend on what you are doing. It sounds like you are still learning the fundamentals. It might be worth a read of all the available types to pick the right tool for the right job.

Monitoring Service MongoDB Schema Design

I have 1000+ streaming audio devices deployed around the world. All of which currently check in with the factory monitoring service about every 15 minutes. I'm rolling my own monitoring service and will be storing the status updates with mongo.
Question:
What would be a better design schema:
1) One massive collection named device_updates. Where all the status update document would include a device serial_number key?
2) 1000+ collections each named with the devices serial number, ie: 65FE9, with the devices status updates siloed in their own collection If going this route I would cap the collections at about 2000 status update documents.
Both would need to be indexed by the created_at date key.
Any ideas on which would be better performance wise? Or any thoughts on what would be the preferred method?
Thanks!
I would definitely go for one massive collection, since all the documents are of the same type.
As a general rule, think of a collection in MongoDB as a set of homogeneous documents. Having just one collection, moreover, makes it much easier to scale out horizontally (i.e., by use of shards), by using for example the serial_number as the shard key.