mongodb - combining multiple collections, but each individual query might not need all columns - mongodb

Here is the scenario :
We have 2 tables (issues, anomalies) in BigQuery, which we plan to combine into a single document in MongoDB, since the 2 collections (issues, anomalies) is data about particular site.
[
{
"site": "abc",
"issues": {
--- issues data --
},
"anomalies": {
-- anomalies data --
}
}
]
There are some queries which require the 'issues' data, while others require 'anomalies' data.
In the future, we might need to show 'issues' & 'anomalies' data together, which is the reason why i'm planning to combine the two in a single document.
Questions on the approach above, wrt performance/volume of data read:
When we read the combined document, is there a way to read only specific columns (so the data volume read is not huge) ?
Or does this mean that when we read the document, the entire document is loaded in memory ?
Pls let me know.
tia!
UPDATE :
going over the mongoDB docs, we can use projections to pull only the required data from mongoDB documents.
Also, in this case - the data that is transferred over the network is only the data the specific fields that is read.
However the mongoDB server will still have to select the specific fields from the documents.

Related

Two mongodb collections in one query

I have a big collection of clients and a huge collection of the clients data, the collections are separated and I don't want to combine them to a single collection (because of the other already working servlets) but now I need to "Join" data from both collection in a single result.
Since The query should return a big number of results I don't want to query the server once and then use the result to query again. I'm also concerned about the traffic between the server and the DB and the memory that the result set will occupy in the server RAM.
The way it's working now is that I get the relevant client list from the 'clients' collection and send this list to the query of the 'client data' collection and only then I get the aggregated results.
I want to cut off the getting and sending the client list from and right back to the server, get the server to ask himself, let the query of client data collection to ask clients collection for the relevant client list.
How can I use a stored procedure(javascript functions) to do the query in the DB and return only the relevant clients out of the collection.
Alternatively, Is there a way to write a query that joins result from another collection ?
"Good news everyone", this aggregation query work just fine in the mongo shell as a join query
db.clientData.aggregate([{
$match: {
id: {
$in: db.clients.distinct("_id",
{
"tag": "qa"
})
}
}
},
$group: {
_id: "$computerId",
total_usage: {
$sum: "$workingTime"
}
}
}]);
The key idea with MongoDB data modelling is to be write-heavy, not read-heavy: store the data in the format that you need for reading, not in some format that minimizes/avoids redundancy (i.e. use a de-normalized data model).
I don't want to combine them to a single collection
That's not a good argument
I'm also concerned about the traffic between the server and the DB [...]
If you need the data, you need the data. How does the way it is queried make a difference here?
[...] and the memory that the result set will occupy in the server RAM.
Is the amount of data so large that you want to stream it from the server to the client, such that is transferred in chunks? How much data are we talking, and why does the client read it all?
How can I use a stored procedure to do the query in the DB and return only the relevant clients out of the collection
There are no stored procedures in MongoDB, but you can use server-side map/reduce to 'join' collections. Generally, code that is stored in and run by the database is a violation of the layer architecture separation of concerns. I consider it one of the most ugly hacks of all time - but that's debatable.
Also, less debatable, keep in mind that M/R has huge overhead in MongoDB and is not geared towards real-time queries made e.g. in a web server call. These calls will take hundreds of milliseconds.
Is there a way to write a query that joins result from another collection ?
No, operations are constrained to a single collection. You can perform a second query and use the $in operator there, however, which is similar to a subselect and reasonably fast, but of course requires two round-trips.
How can I use a stored procedure to do the query in the DB and return only the relevant clients out of the collection. Alternatively
There are no procedure in Mongodb
Alternatively, Is there a way to write a query that joins result from another collection ?
You normally don't need to do any Joins in MongoDB and there is no such thing. The flexibility of the document handled already typical need of joins. You should the think about your document model and asking how to design joins out of your schema should always be your first port of call. As alternative you may need to use aggregation or Map-Reduce in server side to handle this.
First of all, mnemosyn and Michael9 are right. But if I were in your shoes, also assuming that the client data collection is one document per client, I would store the document ID of the client data document in the client document to make the "join" (still no joins in Mongo) easier.
If you have more client data documents per client then an array of document IDs.
But all this does not save you from that you have to implement the "join" in your application code, if it's a Rails app then in your controller probably.

Which mongo document schema/structure is correct?

I have two document formats which I can't decide is the mongo way of doing things. Are the two examples equivalent? The idea is to search by userId and have userId be indexed. It seems to me the performance will be equal for either schemas.
multiple bookmarks as separate documents in a collection:
{
userId: 123,
bookmarkName: "google",
bookmarkUrl: "www.google.com"
},
{
userId: 123,
bookmarkName: "yahoo",
bookmarkUrl: "www.yahoo.com"
},
{
userId: 456,
bookmarkName: "google",
bookmarkUrl: "www.google.com"
}
multiple bookmarks within one document per user.
{
userId: 123,
bookmarks:[
{
bookmarkName: "google",
bookmarkUrl: "www.google.com"
},
{
bookmarkName: "yahoo",
bookmarkUrl: "www.yahoo.com"
}
]
},
{
userId: 456,
bookmarks:[
{
bookmarkName: "google",
bookmarkUrl: "www.google.com"
}
]
}
The problem with the second option is that it causes growing documents. Growing documents are bad for write performance, because the database will have to constantly move them around the database files.
To improve write performance, MongoDB always writes each document as a consecutive sequence to the database files with little padding between each document. When a document is changed and the change results in the document growing beyond the current padding, the document needs to be deleted and moved to the end of the current file. This is a quite slow operation.
Also, MongoDB has a hardcoded limit of 16MB per document (mostly to discourage growing documents). In your illustrated use-case this might not be a problem, but I assume that this is just a simplified example and your actual data will have a lot more fields per bookmark entry. When you store a lot of meta-data with each entry, that 16MB limit could become a problem.
So I would recommend you to pick the first option.
I would go with the option 2 - multiple bookmarks within one document per user because this schema would take advantage of MongoDB’s rich documents also known as “denormalized” models.
Embedded data models allow applications to store related pieces of information in the same database record. As a result, applications may need to issue fewer queries and updates to complete common operations. Link
There are two tools that allow applications to represent these
relationships: references and embedded documents.
When designing data models, always consider the application usage of
the data (i.e. queries, updates, and processing of the data) as well
as the inherent structure of the data itself.
The Second type of structure represents an Embedded type.
Generally Embedded type structure should be chosen when our application needs:
a) better performance for read operations.
b) the ability to request and retrieve
related data in a single database operation.
c) Data Consistency, to update related data in a single atomic write operation.
In MongoDB, operations are atomic at the document level. No single
write operation can change more than one document. Operations that
modify more than a single document in a collection still operate on
one document at a time. Ensure that your application stores all fields
with atomic dependency requirements in the same document. If the
application can tolerate non-atomic updates for two pieces of data,
you can store these data in separate documents. A data model that
embeds related data in a single document facilitates these kinds of
atomic operations.
d) to issue fewer queries and updates to complete common operations.
When not to choose:
Embedding related data in documents may lead to situations where
documents grow after creation. Document growth can impact write
performance and lead to data fragmentation. (limit of 16MB per
document)
Now let's compare the structures from a developer's perspective:
Say I want to see all the bookmarks of a particular user:
The first type would require an aggregation to be applied on all the documents.
minimum set of functions that would be required to get the aggregated results, $match,$group(with $push operator):
db.collection.aggregate([{$match:{"userId":123}},{$group:{"_id":"$userId","bookmarkNames":{$push:"$bookmarkName"},"bookMarkUrls:{$push:"$bookmarkUrl"}"}}])
or a find() which returns multiple documents to be iterated.
Wheras the Embedded type would allow us to fetch it using a $match in the find query.
db.collection.find({"userId":123});
This just indicates the added overhead from the developer's point of view. We would view the first type as an unwinded form of the embedded document.
The first type, multiple bookmarks as separate documents in a collection,
is normally used in case of logging. Where the log entries are huge and will have a TTL, time to live. The collections in that case, would be capped collections. Where documents would be automatically deleted after a particular period of time.
Bottomline, if your documents size would not grow beyond 16 MB at any particular time opt for the Embedded type. it would save developing effort as well.
See Also: MongoDB relationships: embed or reference?

Using Large number of collections in MongoDB

I am considering MongoDB to hold data of our campaign logs,
{
"domain" : ""
"log_time" : ""
"email" : ""
"event_type" : "",
"data" : {
"campaign_id" : "",
"campaign_name" : "",
"message" : "",
"subscriber_id" : ""
}
}
The above one is our event structure, each event is associated with one domain,
one domain can contain any number of events and there is no relation between one domain to another domain
most of our queries are specific to one domain at a time
for quick query responses I'm planning to create one collection per one domain so that I can query on particular domain collection data instead of query on whole data which contains all domains data
we will have at least 100k+ domains in the future, so I need to create 100k+ collections.
We are expecting 1 million + documents per collection.
our main intention is index on only required collections, we don't want to index on whole data, that is why we are planning to have one collection per one domain
which approach is better for my case
1.Storing all domains events in one collection
(or)
2.Each domain events in separate collection
I have seen some questions on max number of collections that mongodb can support but I didn't get clarity on this topic , as far I know we can extend default limit size 24k, but if I create 100k+ collections what about performance will it get affect
Is this solution (using max number of collections) right approach for my case
Please suggest about my approach, thanks in advance
Without some hard numbers, this question would be probably just opinion based.
However, if you do some calculations with the numbers you provided, you will get to a solution.
So your total document count is:
100 K collections x 1M documents = 100 G (100.000.000.000) documents.
From your document structure, I'm going to do a rough estimate and say that the average size for each document will be 240 bytes (it may be even higher).
Multiplying those two numbers you get ~21.82 TB of data. You can't store this amount of data just one one server, so you will have to split your data across multiple servers.
With this amount of data, your problem isn't anymore one collection vs multiple collections, but rather, how do I store all of this data in MongoDB on multiple servers, so I can efficiently do my queries.
If you have 100K collections, you can probably do some manual work and store e.g. 10 K collections per MongoDB server. But there's a better way.
You can use sharding and let the MongoDB do the hard work of splitting your data across servers. With sharding, you will have one collection for all domains and then shard that collection across multiple servers.
I would strongly recommend you to read all documentation regarding sharding, before trying to deploy a system of this size.

Is it encouraged to query a MongoDB database in a loop?

In an SQL database, if I wanted to access some sort of nested data, such as a list of tags or categories for each item in a table, I'd have to use some obscure form of joining in order to send the SQL query once and then only loop through the result cursor.
My question is, in a NoSQL database such as MongoDB, is it OK to query the database repeatedly such that I can do the previous task as follows:
cursor = query for all items
for each item in cursor do
tags = query for item's tags
I know that I can store the tags in an array in the item's document, but I'm assuming that it is somehow not possible to store everything inside the same document. If that is the case, would it be expensive to requery the database repeatedly or is it designed to be used that way?
No, neither in Mongo, nor in any other database should you query a database in a loop. And one good reason for this is performance: in most web apps, database is a bottleneck and devs trying to make as small amount of db calls as possible, whereas here you are trying to make as many as possible.
I mongo you can do what you want in many ways. Some of them are:
putting your tags inside the document {itemName : 'item', tags : [1, 2, 3]}
knowing the list of elements, you do not need a loop to find information about them. You can fetch all results in one query with $in : db.tags.find({ field: { $in: [<value1>, <value2>, ... <valueN> ] }})
You should always try to fulfill a request with as few queries as possible. Keep in mind that each query, even when the database can answer it entirely from cache, requires a network roundtrip between application server, database and back.
Even when you assume that both servers are in the same datacenter and only have a latency of microseconds, these latency times will add up when you query for a large number of documents.
Relational databases solve this issue with the JOIN command. But unfortunately MongoDB has no support for joins. For that reason you should try to build your documents in a way that the most common queries can be answered by a single document. That means that you should denormalize your data. When you have a 1:n relation, you should consider to embed the referencing documents as an array in the main document. Having redundancies in your database is usually not as unacceptable in MongoDB as it is in relational databases.
When you still have good reasons to keep the child-documents as separate documents, you should use a query with the $in operator to query them all at once, as Salvador Dali suggested in his answer.

Mongodb : multiple specific collections or one "store-it-all" collection for performance / indexing

I'm logging different actions users make on our website. Each action can be of different type : a comment, a search query, a page view, a vote etc... Each of these types has its own schema and common infos. For instance :
comment : {"_id":(mongoId), "type":"comment", "date":4/7/2012,
"user":"Franck", "text":"This is a sample comment"}
search : {"_id":(mongoId), "type":"search", "date":4/6/2012,
"user":"Franck", "query":"mongodb"} etc...
Basically, in OOP or RDBMS, I would design an Action class / table and a set of inherited classes / tables (Comment, Search, Vote).
As MongoDb is schema less, I'm inclined to set up a unique collection ("Actions") where I would store these objects instead of multiple collections (collection Actions + collection Comments with a link key to its parent Action etc...).
My question is : what about performance / response time if I try to search by specific columns ?
As I understand indexing best practices, if I want "every users searching for mongodb", I would index columns "type" + "query". But it will not concern the whole set of data, only those of type "search".
Will MongoDb engine scan the whole table or merely focus on data having this specific schema ?
If you create sparse indexes mongo will ignore any rows that don't have the key. Though there is the specific limitation of sparse indexes that they can only index one field.
However, if you are only going to query using common fields there's absolutely no reason not to use a single collection.
I.e. if an index on user+type (or date+user+type) will satisfy all your querying needs - there's no reason to create multiple collections
Tip: use date objects for dates, use object ids not names where appropriate.
Here is some useful information from MongoDB's Best Practices
Store all data for a record in a single document.
MongoDB provides atomic operations at the document level. When data
for a record is stored in a single document the entire record can be
retrieved in a single seek operation, which is very efficient. In some
cases it may not be practical to store all data in a single document,
or it may negatively impact other operations. Make the trade-offs that
are best for your application.
Avoid Large Documents.
The maximum size for documents in MongoDB is 16MB. In practice most
documents are a few kilobytes or less. Consider documents more like
rows in a table than the tables themselves. Rather than maintaining
lists of records in a single document, instead make each record a
document. For large media documents, such as video, consider using
GridFS, a convention implemented by all the drivers that stores the
binary data across many smaller documents.