I need to tag documents in a collection, let's call it 'Contacts'.
The first idea I had was to create an attribute called "tags" for each document.
Well, in this case we have something like:
{
_id:'1',
contact_name:'Asya Kamsky',
tags:['mongodb', 'maths', 'travels']
}
Now, let's suppose that we have users that want to tag any document in 'Contacts'.
If we keep the decision to save the tags attribute for each document, as the tags are personal, we need to use the userId for each tag.
So our document would be something like that (or not):
{
_id:'1',
contact_name:'Asya Kamsky',
tags:[
{userId:'alex',tags:['mongodb', 'maths', 'travels']},
{userId:'eric',tags:['databases', 'friends', 'japan']},
]
}
Now, let's complicate it a bit. Let's imagine that we have A LOT of users and each one want to tag documents with his personal tags.
How to deal with that?
Ok, we could create thousands of tags for each document:
{
_id:'1',
contact_name:'Asya Kamsky',
tags:[
{userId:'alex',tags:['mongodb', 'maths', 'travels']},
{userId:'eric',tags:['databases', 'friends', 'japan']},
{.....................................................}
{.....................................................}
{......................................................}
]
}
But, what if we have millions of users? In this case we have a 16mg limitation for each document, as I know....
At this point, worrying about the future growth of my application, I decided
to create a nice separated collection called 'tags' that would contain documents similar to:
{
"contact_name" : "Asya Kamsky",
"useriId" : "alex",
"tags" : ['mongodb', 'maths', 'travels'],
"timestamp" : "2017-08-08 14:33:28"
},
{
"contact_name" : "Asya Kamsky",
"useriId" : "eric",
"tags" : ['databases', 'friends', 'japan'],
"timestamp" : "2017-08-08 14:33:28"
}
That's, we have a separated documents that represent a tag of each user.
Cool and clean, right?
Well, i this case, we face 2 problems:
Minor problem: We return to the SQL logic that I don't like anymore but I accept in some cases.
Big (for me) problem: how to search a contact by PERSONAL tags? In this case we have a nice 'JOIN' problem that MongoDB resolves well using $lookup.
"Resolves well" for 10000, 20000, or even 500000 documents. But as I want to ensure a good performance in the future, I think about 10000000 contacts. So, as I researched recently, the $lookup works well for a "small part" of universe and, even with indexes, this search would take a lot of time to be executed.
How to resolve this challenge?
Thanks all
If your usage is such that the number of users X number/size of tags per contact (plus whatever other data is in a contacts document) is likely to bring you near the 16MB document size limit then storing the tags ins a separate collection seems valid. But before you go down that route are you sure this is likely? Have you tried creating contact documents in a bid to see how many tags, how many users per contact would get you near the 16MB limit. If the answer implies a number of users and/or tags which you are unlikely ever to reach then maybe your concerns are strictly theoretical and you could consider sticking with the simplest solution which is to embed the user specific tags inside contacts.
The rest of this answer assumes that the size estimates and your knowledge about the likely number of tags and users per contact are such that the size constraints are valid. On this basis, you stated this specific concern about join performance ...
But as I want to ensure a good performance in the future, I think about 10000000 contacts. So, as I researched recently, the $lookup works well for a "small part" of universe and, even with indexes, this search would take a lot of time to be executed.
Have you tried measuring this performance? Generate seed documents for contacts and tags and then persist variations of these and then run queries using $lookup and measure the performance. You could do this for a few benchmarks, for example:
1,000 contacts and 10,000 tags
100,000 contacts and 1,000,000 tags
1,000,000 contacts and 10,000,000 tags
10,000,000 contacts and 100,000,000 tags
When running your benchmark tests you can additionally use explain() to understand what's going on inside MongoDB.
You might find that performance is acceptable, only you can know this since you understand what expectations the users of your system have with respect to performance.
One last point, if the use case here is that a given user wants to find all of their contacts and tags then this could be handled with a 'client side join' i.e. two queries (1) to get the tags for "userId" : "..." and (2) to find the contacts referenced by those tags. Depending on what your use cases are, this could be more performant that a server side join (aka $lookup).
Related
I have a collection named Company which has the following structure:
{
"_id" : ObjectId("57336ea1a7454c0100d889e4"),
"currentMonth" : 62,
"variables1": { ... },
...
"variables61": { ... },
"variables62" : {
"name" : "Test",
"email": "email#test.com",
...
},
"country" : "US",
}
My need is to be able to search for companies by name with up-to-date data. I don't have permission to change this data structure because many applications still use it. For the moment I haven't found a way to index these variables with this data structure, which makes the search slow.
Today each of these documents can be several megabytes in size and there are over 20,000 of them in this collection.
The system I want to implement uses a search engine to index the names of companies, but for that it needs to be able to detect changes in the collection.
MongoDB's change stream seems like a viable option but I'm not sure how to make it scalable and efficient.
Do you have any suggestions that would help me solve this problem? Any suggestion on the steps needed to set up the above system?
Usually with MongoDB you can add new fields to documents and existing applications would simply ignore the extra fields (though they naturally would not be populated by old code). Therefore:
Create a task that is regularly executed which goes through all documents in your collection, figures out the name for each document from its fields, then writes the name into a top-level field.
Add an index on that field.
In your search code, look up by the values of that field.
Compare the calculated name to the source-of-truth name. If different, discard the document.
If names don't change once set, step 1 only needs to go through documents that are missing the top-level name and step 4 is not needed.
Using the change detection pattern with monstache, I was able to synchronise in real time MongoDB with ElasticSearch, performing a Filter based on the current month and then Map the result of the variables to be indexed 🎊
I have a collection called calls containing properties DateStarted, DateEnded, IdAccount, From, To, FromReversed, ToReversed. In other words this is how a call document looks like:
{
_id : "LKDJLDKJDLKDJDLKJDLKDJDLKDJLK",
IdAccount: 123,
DateStarted: ISODate('2020-11-05T05:00:00Z'),
DateEnded: ISODate('2020-11-05T05:20:00Z'),
From: "1234567890",
FromReversed: "0987654321",
To: "1231231234",
ToReversed: "4321321321"
}
On our website we want to give customers the option to search by custom calls. When they search for calls they must specify the DateStarted and DateEnded Those fields are required the other ones are optional. The IdAccount will be injected on our end so that the customer can only get calls that belong to his account.
Because we have about 5 million records we have created the following indexes
db.calls.ensureIndex({"IdAccount":1});
db.calls.ensureIndex({"DateStarted":1});
db.calls.ensureIndex({"DateEnded":1});
db.calls.ensureIndex({"From":1});
db.calls.ensureIndex({"FromReversed":1});
db.calls.ensureIndex({"To":1});
db.calls.ensureIndex({"ToReversed":1});
The reason why we did not created a compound index is because we want to be able to search by custom criteria. For example we may want to search by all calls with date smaller than December 11 and from a specific account.
Because of the indexes all these queries execute very fast:
db.calls.find({'DateStarted' : {'$gte': ISODate('2020-11-05T05:00:00Z')}).limit(200).explain();
db.calls.find({'DateEnded' : {'$lte': ISODate('2020-11-05T05:00:00Z')}).limit(200).explain();
db.calls.find({'IdAccount' : 123 ).limit(200).explain();
// etc...
Even queries that use regexes execute very fast. They only work fast if I use ^... meaning that it must start with a search pattern as:
db.calls.find({ 'From' : /^305/ ).limit(200).explain();
and that is the reason why we created the field FromReversed and ToReversed. If I want to search for a To phone number that ends with 3985 I will execute:
db.calls.find({ 'ToReversed' : /^5893/ ).limit(200).explain(); // note I will have to reverse the search option to
So the only queries that are slow are the ones that do not start with something such as this query:
db.calls.find({ 'ToReversed' : /1234/ ).limit(200).explain();
Question
Why is it that if I combine all the queries it is very slow? For example this query is very slow:
db.calls.find({
'DateStarted':{'$gte':ISODate('2018-11-05T05:00:00Z')},
'DateEnded':{'$lte':ISODate('2020-11-05T05:00:00Z')},
'IdAccount':123,
'ToReversed' : /^5893/
}).limit(200).explain();
The problem is the 'ToReversed' : /^5893/. If I execute that query by itself it is really fast. Even if I put something that does not give me the limit of 200 results fast. Should I add a compound index as well? just for the scenario where it is slow
I need to give our customers the option to search by phone numbers that end with or start with a specific criteria. The moment I add extra stuff to the query it becomes really slow.
Edit
By researching on the internet if I use the hint option it is faster. It goes from 20 seconds to 5 seconds.
db.calls.find({
'DateStarted':{'$gte':ISODate('2018-11-05T05:00:00Z')},
'DateEnded':{'$lte':ISODate('2020-11-05T05:00:00Z')},
'IdAccount':123,
'ToReversed' : /^5893/
}).hint({'ToReversed':1}).limit(200).explain();
This is still slow and it will be great if I can lower it to 1 second just like the simple queries take milliseconds.
For the find query you showed us involving filtering on 4 fields, ideally the optimal index would cover all 4 fields:
db.calls.createIndex( {
"DateStarted": 1,
"DateEnded": 1,
"IdAccount": 1,
"ToReversed": 1
} )
As to which columns should appear first, you should generally place the most restrictive columns first. Check the cardinality of your data to determine this.
I've read a lot of documentation and examples here in Stackoverflow but I'm not really sure about my conclusions so this is why I'm askingfor help.
Imagine we have a collection Films and a collection Users and we want to know, which users have seen a film, and which films has seen an user.
One way to design this in MongoDb is:
User:
{
"name":"User1",
"films":[filmId1, filmId2, filmId3, filmId4] //ObjectIds from Films
}
Film:
{
"name": "The incredible MongoDb Developer",
"watched_by": [userId1, userId2, userId3] //ObjectsIds from User
}
Ok, this may work if the amount of users/films is low, but for example if we expect that one film will have a 800k users the size of the array will be near to: 800k * 12 bytes ~ 9.5MB which is nearly to the 16MB max for a BSON file.
In this case, there are other approach than the typical relational-world way that is create an intermediate collection for the relations?
Also I don't know if read and parse a JSON about 10MB will have a better performance in comparison with the classic relational way.
Thank you
For films, if you include the viewers, you might eventually hit the 16MB size limit of BSON documents, as you correctly stated.
Putting the films a user has seen into an array is a viable way, depending on your use cases. Especially if you want to have relations with attributes (say date and place of viewing), doing updates and statistical analysis becomes less performant (you would need to $unwind your docs first, subsequent $matches become more costly and whatnot).
If your relations have or may have attributes, I'd go with what you describe as the classical relational way, since it answers your most likely use cases as good as embedding and allow for higher performance from my experience:
Given a collection with a structure like
{
_id: someObjectId,
date: ISODate("2016-05-05T03:42:00Z"),
movie: "nameOfMovie",
user: "username"
}
You have everything at hand to answer the following sample questions easily:
For a given user, which movies has he seen in the last 3 month, in descending order of date?
db.views.aggregate([
{$match:{user:userName, date:{$gte:threeMonthAgo}}},
{$sort:{date:-1}},
{$group:{_id:"$user",viewed:{$push:{movie:"$movie",date:"$date"}}}}
])
or, if you are ok with an iterator, even easier with:
db.views.find({user:username, date:{$get:threeMonthAgo}}).sort({date:-1})
For a given movie, how many users have seen it on May 30th this year?
db.views.aggregate([
{$match:{
movie:movieName,
date{
$gte:ISODate("2016-05-30T00:00:00"),
$lt:ISODate("2016-05-31T00:00:00")}
}},
{$group:{
_id: "$movie",
views: {$sum:1}
}}
])
The reason why I use an aggregation here instead of a .count() on the result is SERVER-3645
For a given movie, show all users which have seen it.
db.views.find({movie:movieName},{_id:0,user:1})
There is a thing to note: Since we used the usernames and movie names, respectively, we do not need a JOIN (or something similar), which should give us good performance. Plus we do not have to do rather costly update operations when adding entries. Instead of an update, we simply insert the data.
I have a list of about 50 tags in an array, and want to search through my documents to find records that match these tags.
Because they're user-submitted and mongoDB is case-sensitive, I'm using /wildcard/i as a means of searching. I know this is not the fastest way to do a search but I can't think of a better solution.
I can do my query in two ways. The first is to run a for loop over my tags array, and for each result, perform:
db.collection.find({tags: /<tag[x]>/i})
Or, I can collect all of the tags and run one single lookup using $or, like so:
db.collection.find({$or:[{tags:/<tag1>/i},{tags:/<tag2>/i},{tags:/<tag3>/i}, ... {tags:/<tag50>/i}]});
I have tried both, and found using $or to be significantly faster - but because of the work-in-progress state of my application, it's very difficult to tell whether this is because it's actually faster or whether my app is causing significant overhead in other areas (it is).
So for clarification, in MongoDB is a big query performed once faster than small queries performed many times?
EDIT: Another example would be whether looking up 3 individual records based on _id is faster than doing one lookup using {$or:[{_id: ObjectId([id1])},{_id: ObjectId([id2])},{_id: ObjectId([id3])}]}. Is less more?
I recommend you adjust your schema so it keeps a normalized array of tags. When you insert a new document, do it like this:
tags : [ "business", "Computing", "PayPal" ],
lowercaseTags : [ "business", "computing", "paypal" ]
Similarly when you update the tags, update both arrays.
Create an index on lowercaseTags, and then when you want to query them, use a single query with the $in operator, and the normalized form of the search terms.
For example, to search for business iTunes YouTube, use this query:
db.collection.find( { tags : $in: [ "business", "itunes", "youtube" ] } )
This answer gives an example of this approach. It should be loads faster than what you have.
An alternate approach you can take is to create a text index and use the text command.
Both of these approaches are geared toward index optimization, and designing your schema to work well with Mongo. The payoff should be a lot higher than whatever difference there is between a single $or query and 50 simpler queries.
I'm building simple Web App where users can vote.
What is the fastest way for checking if user has already voted. I'm interested in both relation databases and document based databases (mongodb,...)
I have few ideas but I am sure they can be improved:
Relation databases
Create a seperate table for voting:
|userid|articleid|
Before incrementing articles vote check if there is a row including both userid and articleid. We have two queries. Is possible to improve this with triggers? For example:
|useridarticleid| unique column
Before vote generate useridarticleid on application side. Try to insert useridarticleid. Trigger will fire if field is new and it will increment our vote column in article.
Document based
This is a bit more trickier. So having document structured like so:
{
"id": "123",
"content": "something",
"num_votes": 2,
"votes" : [
"userid1",
"userid2"
]
}
First "query" - check if userid is in votes array. Second "query" - Increment num_votes if not.
Again two queries. So I thought we can change this but I don't know really if it will increase performance:
Insert userid in votes array. When user want to check article "count" votes in array. But I think it possible that performance will drop because if traffic is high counting every article is a bit of waste. Imagine Reddit here.
Actually, it's a lot simpler in a document database. Your document structure is perfect for it.
{
"id": "123",
"content": "something",
"num_votes": 2,
"votes" : [
"userid1",
"userid2"
]
}
db.collection.update(
{id:"123", votes:{$ne:"userid"}},
{$push:{"votes":"userid"},$inc:{"num_votes":1}}
);
This will atomically update record id=123 adding userid to list of voters and incrementing votes by one only if userid is not already in the list of votes on this document.
So there is only one query and one update - and they are actually the same operation.
In a relational database |userid|articleid| would be the best approach, using both fields as primary keys.
In the second one you can also consider wther putting the votes in the user document, or in the article document.
Anyway, I'd suggest you really focus on creating a design, where changing all this decisions later is easy.
The different ways of designing this, favor things like "A lot of users at the same article at the same time" or "A lot of users in different articles", etc... Until you can see the real usage, you won't have enough information to decide which approach will work best and fastest... So create something that you can easily adapt to whatever information you learn later.
BTW: You might also consider don't counting the votes synchronically. I remember an article (which I can't find) where it mentioned that you tube votes numbers weren't actually "accurate"... They put an estimation of the current votes, and calculated the real number in a background worker thread.