Mongodb text search for large collection

Mongodb text search for large collection - mongodb

Given below collection which has potential for ~18 million documents. I need a search functionality on the payload part of the document.
Because of the large volume of data, will it create performance issues if I create a text index on the payload field in the document? Are there any known performance issues when the collection contains millions of documents?
{
"_id" : ObjectId("5575e388e4b001976b5e570d"),
"createdDate" : ISODate("2015-06-07T05:00:34.040Z"),
"env" : "prod",
"messageId" : "my-message-id-1",
"payload" : "message payload typically 500-1000 bytes of string data"
}
I use MongoDB 3.0.3

I believe that is exactly what NoSQL DB were designed to do; give you quick access to a piece of data via an [inverted] index. Mongo is designed for that. NoSQL DB's like Mongo are designed to handle massive sets of data distributed across multiple nodes in a cluster. 18 million in the scope of Mongo is pretty small. You should not have any performance problems if you index property. You might want to read up on sharing also, it is key to getting the best performance out of your MongoDB.

You can use the Mongo DB Atlas feature where you can search your text based on different Analyzers that MongoDB provides. And you can then do a fuzzy search where text closer to your text will also be returned:
PS: For full-text match and to ignore fuzzy, just exclude the fuzzy object from below.
$search:{
{
index: 'analyzer_name_created_from_atlas_search',
text: {
query: 'message payload typically 500-1000 bytes of string data',
path: 'payload',
fuzzy:{
maxEdits: 2
}
}
}
}

Related

mongodb - combining multiple collections, but each individual query might not need all columns

Here is the scenario :
We have 2 tables (issues, anomalies) in BigQuery, which we plan to combine into a single document in MongoDB, since the 2 collections (issues, anomalies) is data about particular site.
[
{
"site": "abc",
"issues": {
--- issues data --
},
"anomalies": {
-- anomalies data --
}
}
]
There are some queries which require the 'issues' data, while others require 'anomalies' data.
In the future, we might need to show 'issues' & 'anomalies' data together, which is the reason why i'm planning to combine the two in a single document.
Questions on the approach above, wrt performance/volume of data read:
When we read the combined document, is there a way to read only specific columns (so the data volume read is not huge) ?
Or does this mean that when we read the document, the entire document is loaded in memory ?
Pls let me know.
tia!
UPDATE :
going over the mongoDB docs, we can use projections to pull only the required data from mongoDB documents.
Also, in this case - the data that is transferred over the network is only the data the specific fields that is read.
However the mongoDB server will still have to select the specific fields from the documents.

MongoDB for BI use

Actually we want to use MongoDB for some BI processing and we don't know which schema is more suited in our case to get the job done. Imagine we got 100 000 data describing sales of a certain network, do we have to put all this data in one array? (like this)
{
"_id" : ObjectId()
"dataset" : "set1",
"values" : [
{"property":"value_1"},
.
.
.
.
{"property":"value_100000"}
]
}
Or for each entry a document? (like this)
{"_id: ObjectId(), "property":"value_1"}
.
.
.
{"_id: ObjectId(), "property":"value_100000"}
Or simply what is the ideal way to scheme this use case?

Embedding is better for :
Small subdocuments
Data that does not change regularly
WHen eventual consistency is acceptable
Document that grow by a small amount
Data that you'll often need to perform asecond query to fetch
Fast reading speed
References are better for
Large subdocuments
Volatile data
When immediate consitency is necessary
Document grow with a large amount
Data that you often exclude from document
Fast write speed
-From 《Mongodb Definitive Guide》
Reference is something like
{'_id':ObjectId("123"),'cousin':ObjectId("456")}
It refers to his cousin through its ObjectId something like foreign key in SQL.

Which mongo document schema/structure is correct?

I have two document formats which I can't decide is the mongo way of doing things. Are the two examples equivalent? The idea is to search by userId and have userId be indexed. It seems to me the performance will be equal for either schemas.
multiple bookmarks as separate documents in a collection:
{
userId: 123,
bookmarkName: "google",
bookmarkUrl: "www.google.com"
},
{
userId: 123,
bookmarkName: "yahoo",
bookmarkUrl: "www.yahoo.com"
},
{
userId: 456,
bookmarkName: "google",
bookmarkUrl: "www.google.com"
}
multiple bookmarks within one document per user.
{
userId: 123,
bookmarks:[
{
bookmarkName: "google",
bookmarkUrl: "www.google.com"
},
{
bookmarkName: "yahoo",
bookmarkUrl: "www.yahoo.com"
}
]
},
{
userId: 456,
bookmarks:[
{
bookmarkName: "google",
bookmarkUrl: "www.google.com"
}
]
}

The problem with the second option is that it causes growing documents. Growing documents are bad for write performance, because the database will have to constantly move them around the database files.
To improve write performance, MongoDB always writes each document as a consecutive sequence to the database files with little padding between each document. When a document is changed and the change results in the document growing beyond the current padding, the document needs to be deleted and moved to the end of the current file. This is a quite slow operation.
Also, MongoDB has a hardcoded limit of 16MB per document (mostly to discourage growing documents). In your illustrated use-case this might not be a problem, but I assume that this is just a simplified example and your actual data will have a lot more fields per bookmark entry. When you store a lot of meta-data with each entry, that 16MB limit could become a problem.
So I would recommend you to pick the first option.

I would go with the option 2 - multiple bookmarks within one document per user because this schema would take advantage of MongoDB’s rich documents also known as “denormalized” models.
Embedded data models allow applications to store related pieces of information in the same database record. As a result, applications may need to issue fewer queries and updates to complete common operations. Link

There are two tools that allow applications to represent these
relationships: references and embedded documents.
When designing data models, always consider the application usage of
the data (i.e. queries, updates, and processing of the data) as well
as the inherent structure of the data itself.
The Second type of structure represents an Embedded type.
Generally Embedded type structure should be chosen when our application needs:
a) better performance for read operations.
b) the ability to request and retrieve
related data in a single database operation.
c) Data Consistency, to update related data in a single atomic write operation.
In MongoDB, operations are atomic at the document level. No single
write operation can change more than one document. Operations that
modify more than a single document in a collection still operate on
one document at a time. Ensure that your application stores all fields
with atomic dependency requirements in the same document. If the
application can tolerate non-atomic updates for two pieces of data,
you can store these data in separate documents. A data model that
embeds related data in a single document facilitates these kinds of
atomic operations.
d) to issue fewer queries and updates to complete common operations.
When not to choose:
Embedding related data in documents may lead to situations where
documents grow after creation. Document growth can impact write
performance and lead to data fragmentation. (limit of 16MB per
document)
Now let's compare the structures from a developer's perspective:
Say I want to see all the bookmarks of a particular user:
The first type would require an aggregation to be applied on all the documents.
minimum set of functions that would be required to get the aggregated results, $match,$group(with $push operator):
db.collection.aggregate([{$match:{"userId":123}},{$group:{"_id":"$userId","bookmarkNames":{$push:"$bookmarkName"},"bookMarkUrls:{$push:"$bookmarkUrl"}"}}])
or a find() which returns multiple documents to be iterated.
Wheras the Embedded type would allow us to fetch it using a $match in the find query.
db.collection.find({"userId":123});
This just indicates the added overhead from the developer's point of view. We would view the first type as an unwinded form of the embedded document.
The first type, multiple bookmarks as separate documents in a collection,
is normally used in case of logging. Where the log entries are huge and will have a TTL, time to live. The collections in that case, would be capped collections. Where documents would be automatically deleted after a particular period of time.
Bottomline, if your documents size would not grow beyond 16 MB at any particular time opt for the Embedded type. it would save developing effort as well.
See Also: MongoDB relationships: embed or reference?

Store Records without keys in mongodb or document mapping with keys

mongodb always stores records in such a way that
{
'_id' : '1' ,
'name' : 'nidhi'
}
But i want to store in a way like
{ 123 , 'nidhi'}
I do not want to store keys again and again in database.
Is it possible with mongodb or with any other database.
Is there anything like sql is possible in nosql that I set the architecture first like mysql and then start inserting values in documents.

That is not possible with MongoDB. Documents are defined by Key/Value pairs. That has something to do that BSON (Binary JSON) – the internal storage format of MongoDB – was developed from JSON (JavaScript Object Notation). And without keys, you couldn't query the database at all, except by rather awkward positional parameters.
However, if disk space is so precious, you could revise your modeling to sth like:
{ _id:1,
values:['nidhi','foo','bar','baz']
}
However, since disk space is relatively cheap when compared to computing power (not a resource MongoDB uses a lot, though) and RAM (rule of thumb for MongoDB: the more, the better), you approach doesn't make sense. For a REST API to return a record to the client, all you have to do is (pseudo code):
var docToReturn = collection.findOne({_id:requestedId});
if(docToReturn){
response.send(200,docToReturn);
} else {
response.send(404,{'state':404,'error':'resource '+requestedId+' not available'});
}
Even if the data was possible to query with your approach, you would have to map the returned values to meaningful keys. And how would you deal with the fact that Arrays don't have a guaranteed order? Or with the fact that MongoDB has dynamic schemas, so that one doc in the collection may have a totally different structure than the other?

Using Large number of collections in MongoDB

I am considering MongoDB to hold data of our campaign logs,
{
"domain" : ""
"log_time" : ""
"email" : ""
"event_type" : "",
"data" : {
"campaign_id" : "",
"campaign_name" : "",
"message" : "",
"subscriber_id" : ""
}
}
The above one is our event structure, each event is associated with one domain,
one domain can contain any number of events and there is no relation between one domain to another domain
most of our queries are specific to one domain at a time
for quick query responses I'm planning to create one collection per one domain so that I can query on particular domain collection data instead of query on whole data which contains all domains data
we will have at least 100k+ domains in the future, so I need to create 100k+ collections.
We are expecting 1 million + documents per collection.
our main intention is index on only required collections, we don't want to index on whole data, that is why we are planning to have one collection per one domain
which approach is better for my case
1.Storing all domains events in one collection
(or)
2.Each domain events in separate collection
I have seen some questions on max number of collections that mongodb can support but I didn't get clarity on this topic , as far I know we can extend default limit size 24k, but if I create 100k+ collections what about performance will it get affect
Is this solution (using max number of collections) right approach for my case
Please suggest about my approach, thanks in advance

Without some hard numbers, this question would be probably just opinion based.
However, if you do some calculations with the numbers you provided, you will get to a solution.
So your total document count is:
100 K collections x 1M documents = 100 G (100.000.000.000) documents.
From your document structure, I'm going to do a rough estimate and say that the average size for each document will be 240 bytes (it may be even higher).
Multiplying those two numbers you get ~21.82 TB of data. You can't store this amount of data just one one server, so you will have to split your data across multiple servers.
With this amount of data, your problem isn't anymore one collection vs multiple collections, but rather, how do I store all of this data in MongoDB on multiple servers, so I can efficiently do my queries.
If you have 100K collections, you can probably do some manual work and store e.g. 10 K collections per MongoDB server. But there's a better way.
You can use sharding and let the MongoDB do the hard work of splitting your data across servers. With sharding, you will have one collection for all domains and then shard that collection across multiple servers.
I would strongly recommend you to read all documentation regarding sharding, before trying to deploy a system of this size.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse