Creating multiple collections from a very large collection and using those smaller collections for every transaction - mongodb

In MongoDB, we have one large collection, which has every possible bit of information we use and it is being updated on a daily basis from different sources.
Now my question is: Is it a good idea to have multiple collections created from the original collection and to use these collection for the transaction from UI. These small collections are kept updated on daily basis using CRON.

Related

What is the best way to archive history data in mongo

I have a collection in mongo that stores every user action of my application, and its very huge in size (3Million documents per day). On UI I have a requirement to show the user actions for max. 6months period.
And the queries on this collection are becoming very slow with all the historic data, though there are indexes in place. So, I want to move the documents that are older than 6months to a separate collection.
Is it the right way to handle my issue?
Following are some of the techniques you can use to manage data growth in MongoDB:
Using capped collection
Using TTLs
Using mulitple collections for months
Using different databases on same host

Meteor MongoDB Server Aggregation into new Collection

I'm currently experimenting with a test collection on a LAN-accessible MongoDB server and data in a Meteor (v1.6) application. View layer of choice is React and right now I'm using the createContainer to bind the subscriptions to props.
The data that gets put in the MongoDB storage is updated on a daily basis and consists of a big set of data from several SQL databases, netting up to about 60000 lines of JSON per day. The data has been ever-so-slightly reshaped to be turned into a usable format whilst remaining as RAW as I'd like it to be.
The working solution right now is fetching all this data and doing further manipulations client-side to prepare the data for visualization. The issue should seem obvious: each client is fetching a set of documents that grows every day and repeats a lot of work on earlier entries before being ready to display. I want to do this manipulation on the server, through MongoDB's Aggregation Framework.
My initial idea is to do the aggregations on the server and to create new Collections containing smaller, more specific datasets without compromising the RAWness of the original Collection. That would mean the "reduced" Collections can still be reactive, as I've been able to confirm through testing in a Remote Desktop, subscribing to an aggregated Collection which I can update through Robo3T.
I don't know if this would be ideal. As far as storage goes, there's plenty of room for the extra Collections. But I have no idea how to set up an automated aggregation script on said server. And regarding Meteor, I've tried using meteorhacks:aggregate and jcbernack:reactive-aggregate but couldn't figure out how to deal with either one of them. If anyone is dealing, or has dealt with, something similar; I'd love to hear ideas / suggestions.

All vs All comparisons on MongoDB

We are planning to use MongoDB for a general purpose system and it seems well suited to the particular data and use cases we have.
However we have one use case where we will need to compare every document (of which there could be 10s of millions) with every other document. The 'distance measure' could be pre computed offline by another system but we are concerned about the online performance of MongoDB when we want to query - eg when we want to see the top 10 closest documents in the entire collection to a list of specific documents ...
Is this likely to be slow? Also can this be done across documents (eg query for the top10 closest documents in one collection to a document in another collection)...
Thanks in advance,
FK

When should I create a new collections in MongoDB?

So just a quick best practice question here. How do I know when I should create new collections in MongoDB?
I have an app that queries TV show data. Should each show have its own collection, or should they all be store within one collection with relevant data in the same document. Please explain why you chose the approach you did. (I'm still very new to MongoDB. I'm used to MySql.)
The Two Most Popular Approaches to Schema Design in MongoDB
Embed data into documents and store them in a single collection.
Normalize data across multiple collections.
Embedding Data
There are several reasons why MongoDB doesn't support joins across collections, and I won't get into all of them here. But the main reason why we don't need joins is because we can embed relevant data into a single hierarchical JSON document. We can think of it as pre-joining the data before we store it. In the relational database world, this amounts to denormalizing our data. In MongoDB, this is about the most routine thing we can do.
Normalizing Data
Even though MongoDB doesn't support joins, we can still store related data across multiple collections and still get to it all, albeit in a round about way. This requires us to store a reference to a key from one collection inside another collection. It sounds similar to relational databases, but MongoDB doesn't enforce any of key constraints for us like most relational databases do. Enforcing key constraints is left entirely up to us. We're good enough to manage it though, right?
Accessing all related data in this way means we're required to make at least one query for every collection the data is stored across. It's up to each of us to decide if we can live with that.
When to Embed Data
Embed data when that embedded data will be accessed at the same time as the rest of the document. Pre-joining data that is frequently used together reduces the amount of code we have to write to query across multiple collections. It also reduces the number of round trips to the server.
Embed data when that embedded data only pertains to that single document. Like most rules, we need to give this some thought before blindly following it. If we're storing an address for a user, we don't need to create a separate collection to store addresses just because the user might have a roommate with the same address. Remember, we're not normalizing here, so duplicating data to some degree is ok.
Embed data when you need "transaction-like" writes. Prior to v4.0, MongoDB did not support transactions, though it does guarantee that a single document write is atomic. It'll write the document or it won't. Writes across multiple collections could not be made atomic, and update anomalies could occur for how many ever number of scenarios we can imagine. This is no longer the case since v4.0, however it is still more typical to denormalize data to avoid the need for transactions.
When to Normalize Data
Normalize data when data that applies to many documents changes frequently. So here we're talking about "one to many" relationships. If we have a large number of documents that have a city field with the value "New York" and all of a sudden the city of New York decides to change its name to "New-New York", well then we have to update a lot of documents. Got anomalies? In cases like this where we suspect other cities will follow suit and change their name, then we'd be better off creating a cities collection containing a single document for each city.
Normalize data when data grows frequently. When documents grow, they have to be moved on disk. If we're embedding data that frequently grows beyond its allotted space, that document will have to be moved often. Since these documents are bigger each time they're moved, the process only grows more complex and won't get any better over time. By normalizing those embedded parts that grow frequently, we eliminate the need for the entire document to be moved.
Normalize data when the document is expected to grow larger than 16MB. Documents have a 16MB limit in MongoDB. That's just the way things are. We should start breaking them up into multiple collections if we ever approach that limit.
The Most Important Consideration to Schema Design in MongoDB is...
How our applications access and use data. This requires us to think? Uhg! What data is used together? What data is used mostly as read-only? What data is written to frequently? Let your applications data access patterns drive your schema, not the other way around.
The scope you've described is definitely not too much for "one collection". In fact, being able to store everything in a single place is the whole point of a MongoDB collection.
For the most part, you don't want to be thinking about querying across combined tables as you would in SQL. Unlike in SQL, MongoDB lets you avoid thinking in terms of "JOINs"--in fact MongoDB doesn't even support them natively.
See this slideshare:
http://www.slideshare.net/mongodb/migrating-from-rdbms-to-mongodb?related=1
Specifically look at slides 24 onward. Note how a MongoDB schema is meant to replace the multi-table schemas customary to SQL and RDBMS.
In MongoDB a single document holds all information regarding a record. All records are stored in a single collection.
Also see this question:
MongoDB query multiple collections at once

Best way to group meteor collections

The Meteor project I'm working on right now has multiple collections: Projects, Sensors, and Readings. Looking at it relationally, Users have many Projects, Projects have many Sensors, and Sensors have_many Readings.
What's the best way to organize this within Meteor? Right now I have created three discrete collections, and each item in the Sensors and Readings collections has a "parent" field with a mongo _id to relate it to the parent.
I'm not sure if this is the best way to go about it, the alternative being having one all-encompassing collection "Projects" that contains all the information about the sensors and readings that belong to it. When adding a new Sensor to a project, all you'd need to do is append a newly instantiated Sensor object to the Project.Sensors array.
What do you think?
It all depends on your use case. Will your app be more write or read heavy?
The way you've set it up right now is called a normalized fashion. You don't have redundant data and data is related to other data in other collections. (Like a table join in SQL). When you update data, you only have to update the small record it is part of. The downside is when you need to read a lot of data it takes a while longer because of all the different db queries.
The alternative you're suggesting is called denormalization, where you add redundant data to records or group collections into larger collections. This has the advantage that when reading data you don't need that many db look-ups as all the data you need is mostly in one record. On the other hand, writing to the db becomes more complex since you don't need to update one small record but one or more large records.
Either way, both methods are used, it just depends on the use case and how you want your app to perform (more read or more write performance).