Dating app Schema pattern nosql, is array extraction viable? - mongodb

I'm trying to build a dating app, and for my backend, I'm using a nosql database. When it comes to the user's collection, some relations are happening between documents of the same collection. For example, a user A can like, dislike, or may haven’t had the choice yet. A simple schema for this scenario is the following:
database = {
"users": {
"UserA": {
"_id": "jhas-d01j-ka23-909a",
"name": "userA",
"geo": {
"lat": "",
"log": "",
"perimeter": ""
},
"session": {
"lat": "",
"log": ""
},
"users_accepted": [
"j2jl-564s-po8a-oej2",
"soo2-ap23-d003-dkk2"
],
"users_rejected": [
"jdhs-54sd-sdio-iuiu",
"mbb0-12md-fl23-sdm2",
],
},
"UserB": {...},
"UserC": {...},
"UserD": {...},
"UserE": {...},
"UserF": {...},
"UserG": {...},
},
}
Here userA has a reference from the users it has seen and made a decision, and stores them either in “users_accepted” or “users_rejected”. If User C hasn’t been seen (either liked or disliked) by userA, then it is clear that it won’t appear in both of the arrays. However, these arrays are unbounded and may exceed the max size that a document can handle. One of the approaches may be to extract both of these arrays and create the following schema:
database = {
"users": {
"UserA": {
"_id": "jhas-d01j-ka23-909a",
"name": "userA",
"geo": {
"lat": "",
"log": "",
"perimeter": ""
},
"session": {
"lat": "",
"log": ""
},
},
"UserB": {...},
"UserC": {...},
"UserD": {...},
"UserE": {...},
"UserF": {...},
"UserG": {...},
},
"likes": {
"id_27-82" : {
"user_give_like" : "userB",
"user_receive_like" : "userA"
},
"id_27-83" : {
"user_give_like" : "userA",
"user_receive_like" : "userC"
},
},
"dislikes": {
"id_23-82" : {
"user_give_dislike" : "userA",
"user_receive_dislike" : "userD"
},
"id_23-83" : {
"user_give_dislike" : "userA",
"user_receive_dislike" : "userE"
},
}
}
I need 4 basic queries
Get the users that have liked UserA (Show who is interested in userA)
Get the users that UserA has liked
Get the users that UserA has disliked
Get the matches that UserA has
The query 1. is fairly simple, just query the likes collection and get the users where "user_receive_like" is "userA".
Query 2. and 3. are used to get the users that userA has not seen yet, get the users that are not in query 2. or query 3.
Finally query 4. may be another collection
"matches": {
"match_id_1": {
"user_1": "referece_user1",
"user_2": "referece_user2"
},
"match_id_2": {
"user_1": "referece_user3",
"user_2": "referece_user4"
}
}
Is this approach viable and efficient?

You are right to notice, that these arrays are unbounded and pose a serious scalability problem for your application. If you were to assign 2-3 user roles to a user with the 1st approach it would be totally fine, but it is not the case for you. The official MongoDB documentation suggests that you should not use unbounded arrays: https://www.mongodb.com/docs/atlas/schema-suggestions/avoid-unbounded-arrays/
Your second approach is the superior implementation choice for you, because:
you can build indices of form (user_give_dislike, user_receive_like) which will improve your query performance even in case when you have 1M+ documents
you can store additional metadata, (like timestamps etc) on the likes collection without affecting the design of the user collection
the query for "matches" will be much simpler to write with this approach: https://mongoplayground.net/p/sFRvUniHKn8
More about NoSQL data modelling:
https://www.mongodb.com/docs/manual/data-modeling/ and
https://www.mongodb.com/docs/manual/tutorial/model-referenced-one-to-many-relationships-between-documents/

To answer your question , let me write some more assumptions about the domain and then lets try to answer it.
Assumptions:
System should support scale for 100 million users
A single user might like or dislike ~100k users in its lifetime
Also some thoery about nosql , if our queries go to all the shards of the collection then max scale of the system depends on the scale of the single shard
Now with these assumptions see the query performance of the question that you asked :
Get the users that have liked UserA (Show who is interested in userA) -
Assuming we are doing sharding or user_give_like column then if we filter on user_receive_like then it will do query on all shards , which is not the right thing for scalability
Get the users that UserA has liked
This will work fine as we have created shard based on user_give_like
Get the users that UserA has disliked
This will work fine as we have created shard based on user_give_dislike
Get the matches that UserA has
In this case if we do a join between existing users and all users which UserA has liked and disliked this will create a parallel query on all shard and is not scalable when UserA like or dislike has huge count
Now to conclude this dosen't look like a reasonable approach to me.

Related

Index and sort on nested variable field MongoDB

I have a document that looks like this :
{
"_id": "chatID"
"presence": {
"userID1": 1647627240464,
"userID2": 1647227540464
},
}
I need to query for each userID the chats where he is present and order by the timestamp in the presence map.
I am aware that this is probably not the best way to do, before i had 1 element per user meaning duplicating the chatIDs, but it's a pain to update them all because it would look like :
{
"_id": "userID1chatID",
"at": 1647627240464,
"ids": "30EYwO01_Nyq7dMqe_O3vfL3AH",
"members": ["userID1", "userID2", "userID3"],
"owner": "userID1",
"present": true,
"uid": "chatID",
"url": "databaseURL"
}
This would allow me to find the chats where userID1 is present: true and order by at DESCENDING.
The problem with this is that i need to update the at attribute for all the documents (one per user) for this same chat room.
How can i do this same query while maintaining a single document with present as a map ?
Problem : the index would be on a variable : userID1, userID2, etc...
like : present.userID1 and seems to not be convenient for use when userID1 can be removed from the present map if the user leaves the chat.
Please let me know if this is unclear, thanks in advance.

MongoDB Schema optimization for contacts

I am storing my contacts in mongodb like this but main drawback of this schema is I am not able to store 40k-50k contacts in one document due to limit of 16mb.
I want to change my schema now. So can anyone please suggest me best way to redesign this.
Here is my sample doucument
{
"_id" : ObjectId("5c53653451154c6da4623a77"),
"contacts" : [{
name:"",
email:"",
group:[5c53653451154c6da4623a79]
}],
"groups" : [{
_id: ObjectId("5c53653451154c6da4623a79"),
group_name:"test"
}],
}
According to you document sample, contacts belongs to a group.
In that scenario, there are different ways to end up with a better schema:
1- Document embedding:
You will have an array of contacts inside each group document.
collection groups:
{
"_id": ObjectId("5c53653451154c6da4623a79"),
"group_name":"test",
"contacts": [
{
"name":"something",
"email":"something",
},
{
"name":"something else",
"email":"something else",
}
]
}
2- Document referencing:
You will have two collections - contacts and groups - and store a group reference inside each contact.
collection contacts:
{
"_id" : ObjectId("5c53653451154c6da4623a77"),
"name":"something",
"email":"something",
"groups":["5c53653451154c6da4623a79"]
},
{
"_id" : ObjectId("5c536s7df9sd7f987d9s7d98"),
"name":"something else",
"email":"something else",
"groups":["5c53653451154c6da4623a79"]
}
collection groups:
{
"_id": ObjectId("5c53653451154c6da4623a79"),
"group_name":"test"
}
Why are we referencing group inside contact and not the contrary? Because we probably will have more contacts than groups. This way we have smaller documents with smaller "reference arrays".
The path you will follow depends a lot on how many contacts you have per group. If this number is small, I would take the Document Embedding approach, for the sake of simplicity and access easiness. If you have a large number of contacts per group, I would use Document Reference, to have smaller documents.

Should I use selector or views in Cloudant?

I'm having confusion about whether to use selector or views, or both, when try to get a result from the following scenario:
I need to do a wildsearch for a book and return the result of the books plus the price and the details of the store branch name.
So I tried using selector to do wildsearch using regex
"selector": {
"_id": {
"$gt": null
},
"type":"product",
"product_name": {
"$regex":"(?i)"+search
}
},
"fields": [
"_id",
"_rev",
"product_name"
]
I am able to get the result. The idea after getting the result is to use all the _id's from the result set and query to views to get more details like price and store branch name on other documents, which I feel is kind of odd and I'm not certain is that the correct way to do it.
Below is just the idea once I get the result of _id's and insert it as a "productId" variable.
var input = {
method : 'GET',
returnedContentType : 'json',
path : 'test/_design/app/_view/find_price'+"?keys=[\""+productId+"\"]",
};
return WL.Server.invokeHttp(input);
so I'm asking for input from an expert regarding this.
Another question is how to get the store_branch_name? Can it be done in a single view where we can get the product detail, prices and store branch name? Or do I need to have several views to achieve this?
expected result
product_name (from book document) : Book 1
branch_name (from branch array in Store document) : store 1 branch one
price ( from relationship document) : 79.9
References:
Book
"_id": "book1",
"_rev": "1...b",
"product_name": "Book 1",
"type": "book"
"_id": "book2",
"_rev": "1...b",
"product_name": "Book 2 etc",
"type": "book"
relationship
"_id": "c...5",
"_rev": "3...",
"type": "relationship",
"product_id": "book1",
"store_branch_id": "Store1_branch1",
"price": "79.9"
Store
{
"_id": "store1",
"_rev": "1...2",
"store_name": "Store 1 Name",
"type": "stores",
"branch": [
{
"branch_id": "store1_branch1",
"branch_name": "store 1 branch one",
"address": {
"street": "some address",
"postalcode": "33490",
"type": "addresses"
},
"geolocation": {
"coordinates": [
42.34493,
-71.093232
],
"type": "point"
},
"type": "storebranch"
},
{
"branch_id": "store1_branch2",
"branch_name":
**details ommit...**
}
]
}
In Cloudant Query, you can specify two different kinds of indexes, and it's important to know the differences between the two.
For the first part of your question, if you're using Cloudant Query's $regex operator for wildcard searches like that, you might be better off creating a Cloudant Query index of type "text" instead of type "json". It's in the Cloudant docs, but see the intro blog post for details: https://cloudant.com/blog/cloudant-query-grows-up-to-handle-ad-hoc-queries/ There's a more advanced post on this that covers the tradeoffs between the two types of indexes https://cloudant.com/blog/mango-json-vs-text-indexes/
It's harder to address the second part of your question without understanding how your application interacts with your data, but there are a couple pieces of advice.
1) Consider denormalizing some of this information so you're not doing the JOINs to begin with.
2) Inject more logic into your document keys, and use the traditional MapReduce View indexing system to emit a compound key (an array), that you can use to emulate a JOIN by taking advantage of the CouchDB/Cloudant index sorting rules.
That second one's a mouthful, but check out this example on YouTube: https://youtu.be/0al1KnCKjlA?t=23m39s
Here's a preview (example map function) of what I'm talking about:
'map' : function(doc)
{
if (doc.type==="user") {
emit( [doc._id], null );
}
else if (doc.type==="edge:follower") {
emit( [doc.user, doc.follows], {"_id":doc.follows} );
}
}
The resulting secondary index here would take advantage of the rules outlined in http://wiki.apache.org/couchdb/View_collation -- that strings sort before arrays, and arrays sort before objects. You could then issue range queries to emulate the results you'd get with a JOIN.
I think that's as much detail that's appropriate for here. Hope it helps!

Querying MongoDB (Using Edge Collection - The most efficient way?)

I've written Users, Clubs and Followers collections for the sake of an example the below.
I want to find all user documents from the Users collection that are following "A famous club". How can I find those? and Which way is the fastest?
More info about 'what do I want to do - Edge collections'
Users collection
{
"_id": "1",
"fullname": "Jared",
"country": "USA"
}
Clubs collection
{
"_id": "12",
"name": "A famous club"
}
Followers collection
{
"_id": "159",
"user_id": "1",
"club_id": "12"
}
PS: I can get the documents using Mongoose like the below way. However, creating followers array takes about 8 seconds with 150.000 records. And second find query -which is queried using followers array- takes about 40 seconds. Is it normal?
Clubs.find(
{ club_id: "12" },
'-_id user_id', // select only one field to better perf.
function(err, docs){
var followers = [];
docs.forEach(function(item){
followers.push(item.user_id)
})
Users.find(
{ _id:{ $in: followers } },
function(error, users) {
console.log(users) // RESULTS
})
})
There is no an eligible formula to manipulate join many-to-many relation on MongoDB. So I combined collections as embedded documents like the below. But the most important taks in this case creating indexes. For instance if you want to query by followingClubs you should create an index like schema.index({ 'followingClubs._id':1 }) using Mongoose. And if you want to query country and followingClubs you should create another index like schema.index({ 'country':1, 'followingClubs._id':1 })
Pay attention when working with Embedded Documents: http://askasya.com/post/largeembeddedarrays
Then you can get your documents fastly. I've tried to get count of 150.000 records using this way it took only 1 second. It's enough for me...
ps: we musn't forget that in my tests my Users collection has never experienced any data fragmentation. Therefore my queries may demonstrated good performance. Especially, followingClubs array of embedded documents.
Users collection
{
"_id": "1",
"fullname": "Jared",
"country": "USA",
"followingClubs": [ {"_id": "12"} ]
}
Clubs collection
{
"_id": "12",
"name": "A famous club"
}

How to store option data in mongo?

I have a site that I'm using Mongo on. So far everything is going well. I've got several fields that are static option data, for example a field for animal breeds and another field for animal registrars.
Breeds
Arabian
Quarter Horse
Saddlebred
Registrars
AQHA
American Arabians
There are maybe 5 or 6 different collections like this that range from 5-15 elements.
What is the best way to put these in Mongo? Right now, I've got a separate collection for each group. That is a breeds collection, a registrars collection etc.
Is that the best way, or would it make more sense to have a single static data collection with a "type" field specifying the option type?
Or something else completely different?
Since this data is static then it's better to just embed the data in documents. This way you don't have to do manual joins.
And also store it in a separate collection (one or several, doesn't matter, choose what's easier) to facilitate presentation (render combo-boxes, etc.)
I believe creating multiple collections has collection size implications? (something about MongoDB creating a collection file on disk as twice the size of the previous file [ db.0 = 64MB, db.1 = 128MB and so on)
Here's what I can think of:
1. Storing as single collection
The benefits here are:
You only need one call to Mongo to fetch, and if you can cache the call you quickly have the data.
You avoid duplication: create a single schema that deals with all your options. You can just nest suboptions if there are any.
Of course, you also avoid duplication in statics/methods to modify options.
I have something similar on a project that I'm working on. I have categories and subcategories all stored in one collection. Here's a JSON/BSON dump as example:
In all the data where I need to store my 'options' (station categories in my case) I simply use the _id.
{
"status": {
"code": 200
},
"response": {
"categories": [
{
"cat": "Taxi",
"_id": "50b92b585cf34cbc0f000004",
"subcat": []
},
{
"cat": "Bus",
"_id": "50b92b585cf34cbc0f000005",
"subcat": [
{
"cat": "Bus Rapid Transit",
"_id": "50b92b585cf34cbc0f00000b"
},
{
"cat": "Express Bus Service",
"_id": "50b92b585cf34cbc0f00000a"
},
{
"cat": "Public Transport Bus",
"_id": "50b92b585cf34cbc0f000009"
},
{
"cat": "Tour Bus",
"_id": "50b92b585cf34cbc0f000008"
},
{
"cat": "Shuttle Bus",
"_id": "50b92b585cf34cbc0f000007"
},
{
"cat": "Intercity Bus",
"_id": "50b92b585cf34cbc0f000006"
}
]
},
{
"cat": "Rail",
"_id": "50b92b585cf34cbc0f00000c",
"subcat": [
{
"cat": "Intercity Train",
"_id": "50b92b585cf34cbc0f000012"
},
{
"cat": "Rapid Transit/Subway",
"_id": "50b92b585cf34cbc0f000011"
},
{
"cat": "High-speed Rail",
"_id": "50b92b585cf34cbc0f000010"
},
{
"cat": "Express Train Service",
"_id": "50b92b585cf34cbc0f00000f"
},
{
"cat": "Passenger Train",
"_id": "50b92b585cf34cbc0f00000e"
},
{
"cat": "Tram",
"_id": "50b92b585cf34cbc0f00000d"
}
]
}
]
}
}
I have a call to my API that gets me that document (app.ly/api/v1/stationcategories). I find this much easier to code with.
In your case you could have something like:
{
"option": "Breeds",
"_id": "xxx",
"suboption": [
{
"option": "Arabian",
"_id": "xxx"
},
{
"option": "Quarter House",
"_id": "xxx"
},
{
"option": "Saddlebred",
"_id": "xxx"
}
]
},
{
"option": "Registrars",
"_id": "xxx",
"suboption": [
{
"option": "AQHA",
"_id": "xxx"
},
{
"option": "American Arabians",
"_id": "xxx"
}
]
}
Whenever you need them, either loop through them, or pull specific options from your collection.
2. Storing as a static JSON document
This as #Sergio mentioned, is a viable and more simplistic approach. You can then either have separate docs for separate options, or put them in one document.
You do lose some flexibility here because you can't reference options by Id (which I prefer because changing option name doesn't affect all your other data).
Prone to typos (though if you know what you're doing this shouldn't be a problem).
For Node.js users: this might leave you with a headache from require('../../../options.json') similar to PHP.
The reader will note that I'm being negative about this approach, it works, but is rather inflexible.
Though we're discouraged from using joins unnecessarily on MongoDB, referencing by ObjectId is sometimes useful and extensible.
An example is if your website becomes popular in one region of the world, and say people from Poland start accounting for say 50% of your site visits. If you decide to add Polish translations. You would need to go back to all your documents, and add Polish names (if exists) to your options. If using approach 1, it's as easy as adding a Polish name to your options, and then plucking the Polish name from your options collection at runtime.
I could only think of 2 options other than storing each option as a collection
UPDATE: If someone has positives or negatives for either approach, may you please add them. My bias might be unhelpful to some people as there are benefits to storing static JSON files
MongoDB is schemaless and also no JOIN is supported. So you have to move out of the RDBMS and normalization given the fact that this is purely a different kind of database.
Few rules which you can apply while designing as a guidelines. Of course, you have the choice of keeping it in a separate collection when needed.
Static Master/Reference Data:
You have to always embed them in your documents wherever required. Since the data is not going to be changed, it is not at all bad idea to keep in the same collection. If the data is too large, group them and store them in a separate collection instead of creating multiple collection for the this master data itself.
NOTE: When embedding the collections as sub-documents, always make sure that you are never going to exceed the 16MB limit. That is the limit (at this point) for each collection can take in MongoDB.
Dynamic Master/Reference Data
Try to keep them in a separate collection as the master data is tend to change often.
Always remember, NO join support, so keep them in a way that you can easily access it without querying the database too many times.
So there is NO suggested best way, it always changes based on your need and the design can go either way.