Distributing big data storage for non-relational data

Distributing big data storage for non-relational data - mongodb

The problem consists of a lot (apprx. 500 million per day) of non-relational messages of relatively small size (apprx. 1KB). The messages are written once and never modified again. The messages has various structures, though there are patterns that the message must fit in. This data then must be used to make a search over them. The search may be done on any fields of the message, the only always present field is the date, thus the search will be done for a specific day.
The approach I have come up so far is to use MongoDB. Each day I create a few collections (apprx. 2000) and distribute messages during the day to those collections according to the pattern. I find the patterns important because I make indexing that the number of indexes is limited to 64.
This strategy results in 500G of data + 150G of indexes = 650G per day. Of course, the question here is how to distribute those data? Obvious solution is to use Mongo Sharding and spread the collections over the shards. However, I have not find any scenario close to my problem described in mongo manuals. Moreover, I am not even sure if I can dynamically (not manually) add new collections every day to shards. Any knowledge/suggestions from expreinced users? Shoudl I change my design?

Related

Should data be clustered as databases or collections [duplicate]

I am designing a system with MongoDb (64 bit version) to handle a large amount of users (around 100,000) and each user will have large amounts of data (around 1 million records).
What is the best strategy of design?
Dump all records in single collection
Have a collection for each user
Have a database for each user.
Many Thanks,

So you're looking at somewhere in the region of 100 billion records (1 million records * 100,000 users).
The preferred way to deal with large amounts of data is to create a sharded cluster that splits the data out over several servers that are presented as single logical unit via the mongo client.
Therefore the answer to your question is put all your records in a single sharded collection.
The number of shards required and configuration of the cluster is related to the size of the data and other factors such as the quantity and distribution of reads and writes. The answers to those questions are probably very specific to your unique situation, so I won't attempt to guess them.
I'd probably start by deciding how many shards you have the time and machines available to set up and testing the system on a cluster of that many machines. Based on the performance of that, you can decide whether you need more or fewer shards in your cluster

So you are looking for 100,000,000 detail records overall for 100K users?
What many people don't seem to understand is that MongoDB is good at horizontal scaling. Horizontal scaling is normally classed as scaling huge single collections of data across many (many) servers in a huge cluster.
So already if you use a single collection for common data (i.e. one collection called user and one called detail) you are suiting MongoDBs core purpose and build.
MongoDB, as mentioned, by others is not so good at scaling vertically across many collections. It has a nssize limit to begin with and even though 12K initial collections is estimated in reality due to index size you can have as little as 5K collections in your database.
So a collection per user is not feasible at all. It would be using MongoDB against its core principles.
Having a database per user involves the same problems, maybe more, as having singular collections per user.
I have never encountered some one not being able to scale MongoDB to the billions or even close to the 100s of billions (or maybe beyond) on a optimised set-up, however, I do not see why it cannot; after all Facebook is able to make MySQL scale into the 100s of billions per user (across 32K+ shards) for them and the sharding concept is similar between the two databases.
So the theory and possibility of doing this is there. It is all about choosing the right schema and shard concept and key (and severs and network etc etc etc etc).
If you were to witness problems you could go for splitting archive collections, or deleted items away from the main collection but I think that is overkill, instead you want to make sure that MongoDB knows where each segment of your huge dataset is at any given point in time on the master and ensure that this data is always hot, that way queries that don't do a global and scatter OP should be quite fast.

About a collection on each users:
By default configuration, MongoDB is limited to 12k collections. You can increase the size of this with --nssize but it's not unlimited.
And you have to count index into this 12k. (check "namespaces" concept on mongo documentation).
About a database for each user:
For a model point of view, that's very curious.
For technical, there is no limit on mongo, but you probably have a limit with file descriptor (limit from you OS/settings).
So as #Rohit says, the two last are not good.
Maybe you should explain more about your case.
Maybe you can cut users into different collections (ex: one for each first letter of name etc., or for each service of the company...).
And, of course use sharding.
Edit: maybe MongoDb is not the best database for your use case.

Timeseries storage in Mongodb

I have about 1000 sensors outputting data during the day. Each sensor outputs about 100,000 points per day. When I query the data I am only interested in getting data from a given sensor on a given day. I don t do any cross sensor queries. The timeseries are unevenly spaced and I need to keep the time resolution so I cannot do things like arrays of 1 point per second.
I plan to store data over many years. I wonder which scheme is the best:
each day/sensor pair corresponds to one collection, thus adding 1000 collections of about 100,000 documents each per day to my db
each sensor corresponds to a collection. I have a fixed number of 1000 collections that grow every day by about 100,000 documents each.
1 seems to intuitively be faster for querying. I am using mongoDb 3.4 which has no limit for the number of collections in a db.
2 seems cleaner but I am afraid the collections will become huge and that querying will gradually become slower as each collection grows
I am favoring 1 but I might be wrong. Any advice?
Update:
I followed the advice of
https://bluxte.net/musings/2015/01/21/efficient-storage-non-periodic-time-series-mongodb/
Instead of storing one document per measurement, I have a document containing 128 measurement,startDate,nextDate. It reduces the number of documents and thus the index size but I am still not sure how to organize the collections.
When I query data, I just want the data for a (date,sensor) pair, that is why I thought 1 might speed up the reads. I currently have about 20,000 collections in my DB and when I query the list of all collections, it takes ages which makes me think that it is not a good idea to have so many collections.
What do you think?

I would definitely recommend approach 2, for a number of reasons:
MongoDB's sharding is designed to cope with individual collections getting larger and larger, and copes well with splitting data within a collection across separate servers as required. It does not have the same ability to split data which exists in many collection across different servers.
MongoDB is designed to be able to efficiently query very large collections, even when the data is split across multiple servers, as long as you can pick a suitable shard key which matches your most common read queries. In your case, that would be sensor + date.
With approach 1, your application needs to do the fiddly job of knowing which collection to query, and (possibly) where that collection is to be found. Approach 2, with well-configured sharding, means that the mongos process does that hard work for you

Whilst MongoDB has no limit on collections I tried a similar approach to 2 but moved away from it to a single collection for all sensor values because it was more manageable.
Your planned data collection is significant. Have you considered ways to reduce the volume? In my system I compress same-value runs and only store changes, I can also reduce the volume by skipping co-linear midpoints and interpolating later when, say, I want to know what the value was at time 't'. Various different sensors may need different compression algorithms (e.g. a stepped sensor like a thermostat set-point vs one that represents a continuous quantity like a temperature). Having a single large collection also makes it easy to discard data when it does get too large.
If you can guarantee unique timestamps you may also be able to use the timestamp as the _id field.

When I query the data I m only interested in getting data from a
given sensor on a given day. I don t do any cross sensor queries.
But that's what exactly what Cassandra is good for!
See this article and this one.
Really, in one of our my projects we were stuck with legacy MongoDB and the scenario, similar to yours, with the except of new data amount per day was even lower.
We tried to change data structure, granulate data over multiple MongoDB collections, changed replica set configurations, etc.
But we were still disappointed as data increases, but performance degrades
with the unpredictable load and reading data request affects writing response much.
With Cassandra we had fast writes and data retrieving performance effect was visible with the naked eye. If you need complex data analysis and aggregation, you could always use Spark (Map-reduce) job.
Moreover, thinking about future, Cassandra provides straightforward scalability.
I believe that keeping something for legacy is good as long as it suits well, but if not, it's more effective to change the technology stack.

If I understand right, you plan to create collections on the fly, i.e. at 12 AM you will have new collections. I guess MongoDB is a wrong choice for this. If required in MongoDB there is no way you can query documents across collections, you will have to write complex mechanism to retrieve data. In my opinion, you should consider elasticsearch. Where you can create indices(Collections) like sensor-data-s1-3-14-2017. Here you could do a wildcard search across indices. (for eg: sensor-data-s1* or sensor-data-*). See here for wildcard search.
If you want to go with MongoDB my suggestion is to go with option 2 and shard the collections. While sharding, consider your query pattern so you could get optimal performance and that does not degrade over the period.

Approach #1 is not cool, key to speed up is divide (shard) and rule. What-if number of singal itself reaches 100000.
So place one signal in one collection and shard signals over nodes to speed up read. Multiple collections or signals can be on same node.
How this Will Assist
Usually for signal processing time-span is used like process signal for 3 days, in that case you can parallel read 3 nodes for the signal and do parallel apache spark processing.
Cross-Signal processing: typically most of signal processing algorithms uses same period for 2 or more signals for analysis like cross correlation and as these (2 or more signals) are parallel fetch it'll also be fast and ore-processing of individual signal can be parallelized.

Mongo Architecture Efficiency

I am currently working on designing a local content bases sharing system that depends on mongoDB. I need to make a critical architecture decision that will undoubtably have a huge impact on query performance, scaling and overall long term maintainability.
Our system has a library of topics, each topic is available in specific cities/metropolitan areas. When a person creates a piece of content it needs to be stored as part of the topic in a specific city. There are three approaches I am currently considering to address these requirements (And open to other ideas as well).
Option 1 (Single Collection per Topic/City):
Example: a collection name would be TopicID123CityID456 and each entry would obviously be a document within that collection.
Option 2 (Single Topic Collection)
Example: A collection name would be Topic123 and each entry would create a document that contains an indexed cityID.
Option 3 (Single City Collection)
Example: A collection name would be City456 and each entry would create a document that contains an indexed topicID
When querying the DB I always want to build a feed in date order based on the member's selected topic(s) and city. Since members can group multiple topics together to build a custom feed, option 3 seems to be the best, however I am concerned with long term performance of this approach. It seems option 1 would be the most performant but also forces multiple queries when needing to select more than one topic.
Another thing that I need to consider is some topics will be far more active and grow much larger than other topics which will also vary by location.
Since I still consider myself a beginner with MongoDB, I want to make sure the general DB structure is the most ideal before coding all of the logic around writing and retrieving the data. And I don't know how well Mongo Performs with hundreds of thousands if not millions of documents in a collection thus my uncertainty in approach.
From experience which is the most optimal way of tackling the storage and recall of this data? Any insight would be greatly appreciated.
UPDATE: June 22, 2016
It is important to note that we are starting in a one DB server environment to start. #profesor79 provided a great scaling solution once we need to move to a multi-server (Sharded) environment.

from your 3 proposal I will pickup number 4 :-)
Having a one collection sharded over multiple servers.
As there could be one collection TopicCity, `we could have a one for all topics and one foll all cities.
Then collection topicCities will have all documents sharded.
Sharding on key {topic:1, city:1} will allow to balance load thru shard servers and enytime you will need to add more power you will be able to add shard to cluster.
Any comments welcome!

Is it better to have one collection with a billion or one thousand with one million objects?

How much will performance differ between one NoSQL database (MongoDB) containing single collection - logs - with 1 billion entries or one thousand collections (logs_source0, logs_source1)? Will this change if the data is sharded across multiple servers? Objects contain between 6 and 10 keys and sometimes one array of 3-5 objects. The design of the application can use either one of these, as _sourceX can be easily turned into an extra key or vice versa.

As long as all that data is on a single server, having a single big collection or many small ones should not make too much of a difference. As any performance question, a thorough answer would have to take your intended usage of that data into account. Are you frequently accessing all of that data? Or do you have a comparatively small working set of data that is frequently accessed, while the rest is very rarely looked at?
Having many small collections could be better when it comes to selectively paging some of that data into memory. A single big collection can, of course, also be paged into memory selectively, but at least the indexes would have to be entirely within memory if at all possible, to ensure quick access to the data. With many smaller collections, that would be easier since each collection would have its own, small indexes.
However, MongoDB's sharding is meant to solve exactly that problem (maintaining huge amounts of data), and it does so by keeping everything in a single logical collection, but distributing that collection automatically over as many shards as you like. This is far more flexible than creating those individual collections yourself. Among other things, it allows data to be rebalanced over time to make sure that each shard has an equal portion of that data. It is also more flexible to adapt to different numbers of shards, while your multi-collection scheme seems to rely on a rather fixed partitioning of the data (according to source #).
With sharding, the application would be completely unaware of the distribution patterns, and you could add or remove as many shards as you want, transparently, to handle the volume of your data.

Is it a good idea to generate per day collections in mongodb

Is it a good idea to create per day collections for data on a given day (we could start with per day and then move to per hour if there is too much data). Is there a limit on the number of collections we can create in mongodb, or does it result in performance loss (is it an overhead for mongodb to maintain so many collections). Does a large number of collections have any adverse effect on performance?
To give you more context, the data will be more like facebook feeds, and only the latest data (say last one week or month) is more important to us. Making per day collections keeps the number of documents low, and probably would result in fast access. Even if we need old data, we can fall back to older collections. Does this make sense, or am I heading in the wrong direction?

what you actually need is to archive the old data. I would suggest you to take a look at this thread at the mongodb mailing list:
https://groups.google.com/forum/#!topic/mongodb-user/rsjQyF9Y2J4
Last post there from Michael Dirolf (10gen)says:
"The OS will handle LRUing out data, so if all of your queries are
touching the same portion of data that should stay in memory
independently of the total size of the collection."
so I guess you can stay with single collection and good indexes will do the work.
anyhow, if the collection goes too big you can always run manual archive process.

Yes, there is a limit to the number of collections you can make. From the Mongo documentation Abhishek referenced:
The limitation on the number of namespaces is the size of the namespace file divided by 628.
A 16 megabyte namespace file can support approximately 24,000 namespaces. Each index also counts as a namespace.
Indexes etc. are included in the namespaces, but even still, it would take something like 60 years to hit that limit.
However! Have you considered what happens when you want data that spans collections? In other words, if you wanted to know how many users have feeds updated in a week, you're in a bit of a tight spot. It's not easy/trivial to query across collections.
I would recommend instead making one collection to store the data and simply move data out periodically as Tamir recommended. You can easily write a job to move data out of the collection every week or every month.

Creating a collection is not much overhead, but it the overhead is larger than creating a new document inside a collections.
There is a limitation on the no of collections that you can create: " http://docs.mongodb.org/manual/reference/limits/#Number of Namespaces "
Making new collections to me, won't be having any performance difference because in RAM you cache only those data that you actually query. In your case it will be recent feeds etc.
But having per day/hour collection will help you in achieving old data very easily.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse