I have a site that I'm using Mongo on. So far everything is going well. I've got several fields that are static option data, for example a field for animal breeds and another field for animal registrars.
Breeds
Arabian
Quarter Horse
Saddlebred
Registrars
AQHA
American Arabians
There are maybe 5 or 6 different collections like this that range from 5-15 elements.
What is the best way to put these in Mongo? Right now, I've got a separate collection for each group. That is a breeds collection, a registrars collection etc.
Is that the best way, or would it make more sense to have a single static data collection with a "type" field specifying the option type?
Or something else completely different?
Since this data is static then it's better to just embed the data in documents. This way you don't have to do manual joins.
And also store it in a separate collection (one or several, doesn't matter, choose what's easier) to facilitate presentation (render combo-boxes, etc.)
I believe creating multiple collections has collection size implications? (something about MongoDB creating a collection file on disk as twice the size of the previous file [ db.0 = 64MB, db.1 = 128MB and so on)
Here's what I can think of:
1. Storing as single collection
The benefits here are:
You only need one call to Mongo to fetch, and if you can cache the call you quickly have the data.
You avoid duplication: create a single schema that deals with all your options. You can just nest suboptions if there are any.
Of course, you also avoid duplication in statics/methods to modify options.
I have something similar on a project that I'm working on. I have categories and subcategories all stored in one collection. Here's a JSON/BSON dump as example:
In all the data where I need to store my 'options' (station categories in my case) I simply use the _id.
{
"status": {
"code": 200
},
"response": {
"categories": [
{
"cat": "Taxi",
"_id": "50b92b585cf34cbc0f000004",
"subcat": []
},
{
"cat": "Bus",
"_id": "50b92b585cf34cbc0f000005",
"subcat": [
{
"cat": "Bus Rapid Transit",
"_id": "50b92b585cf34cbc0f00000b"
},
{
"cat": "Express Bus Service",
"_id": "50b92b585cf34cbc0f00000a"
},
{
"cat": "Public Transport Bus",
"_id": "50b92b585cf34cbc0f000009"
},
{
"cat": "Tour Bus",
"_id": "50b92b585cf34cbc0f000008"
},
{
"cat": "Shuttle Bus",
"_id": "50b92b585cf34cbc0f000007"
},
{
"cat": "Intercity Bus",
"_id": "50b92b585cf34cbc0f000006"
}
]
},
{
"cat": "Rail",
"_id": "50b92b585cf34cbc0f00000c",
"subcat": [
{
"cat": "Intercity Train",
"_id": "50b92b585cf34cbc0f000012"
},
{
"cat": "Rapid Transit/Subway",
"_id": "50b92b585cf34cbc0f000011"
},
{
"cat": "High-speed Rail",
"_id": "50b92b585cf34cbc0f000010"
},
{
"cat": "Express Train Service",
"_id": "50b92b585cf34cbc0f00000f"
},
{
"cat": "Passenger Train",
"_id": "50b92b585cf34cbc0f00000e"
},
{
"cat": "Tram",
"_id": "50b92b585cf34cbc0f00000d"
}
]
}
]
}
}
I have a call to my API that gets me that document (app.ly/api/v1/stationcategories). I find this much easier to code with.
In your case you could have something like:
{
"option": "Breeds",
"_id": "xxx",
"suboption": [
{
"option": "Arabian",
"_id": "xxx"
},
{
"option": "Quarter House",
"_id": "xxx"
},
{
"option": "Saddlebred",
"_id": "xxx"
}
]
},
{
"option": "Registrars",
"_id": "xxx",
"suboption": [
{
"option": "AQHA",
"_id": "xxx"
},
{
"option": "American Arabians",
"_id": "xxx"
}
]
}
Whenever you need them, either loop through them, or pull specific options from your collection.
2. Storing as a static JSON document
This as #Sergio mentioned, is a viable and more simplistic approach. You can then either have separate docs for separate options, or put them in one document.
You do lose some flexibility here because you can't reference options by Id (which I prefer because changing option name doesn't affect all your other data).
Prone to typos (though if you know what you're doing this shouldn't be a problem).
For Node.js users: this might leave you with a headache from require('../../../options.json') similar to PHP.
The reader will note that I'm being negative about this approach, it works, but is rather inflexible.
Though we're discouraged from using joins unnecessarily on MongoDB, referencing by ObjectId is sometimes useful and extensible.
An example is if your website becomes popular in one region of the world, and say people from Poland start accounting for say 50% of your site visits. If you decide to add Polish translations. You would need to go back to all your documents, and add Polish names (if exists) to your options. If using approach 1, it's as easy as adding a Polish name to your options, and then plucking the Polish name from your options collection at runtime.
I could only think of 2 options other than storing each option as a collection
UPDATE: If someone has positives or negatives for either approach, may you please add them. My bias might be unhelpful to some people as there are benefits to storing static JSON files
MongoDB is schemaless and also no JOIN is supported. So you have to move out of the RDBMS and normalization given the fact that this is purely a different kind of database.
Few rules which you can apply while designing as a guidelines. Of course, you have the choice of keeping it in a separate collection when needed.
Static Master/Reference Data:
You have to always embed them in your documents wherever required. Since the data is not going to be changed, it is not at all bad idea to keep in the same collection. If the data is too large, group them and store them in a separate collection instead of creating multiple collection for the this master data itself.
NOTE: When embedding the collections as sub-documents, always make sure that you are never going to exceed the 16MB limit. That is the limit (at this point) for each collection can take in MongoDB.
Dynamic Master/Reference Data
Try to keep them in a separate collection as the master data is tend to change often.
Always remember, NO join support, so keep them in a way that you can easily access it without querying the database too many times.
So there is NO suggested best way, it always changes based on your need and the design can go either way.
Related
I have a DynamoDB with data that looks like this:
{
"userId": {
"S": "someuserID"
},
"listOfMaps": {
"L": [
{
"M": {
"neededVal": {
"S": "ThisIsNeeded1"
},
"id": {
"S": "1"
}
}
},
{
"M": {
"neededVal": {
"S": "ThisIsNeeded2"
},
"id": {
"S": "2"
}
}
},
...
]
},
"userName": {
"S": "someuserName"
}
}
The listOfMaps can contain more than just 2 maps, as is denoted by the ellipsis. Is there a PartiQL query I can put together that will let me get the neededVal based on the userId and the id of the item in the map itself?
I know that I can query for the n-th item's neededVal like this:
SELECT "listOfMaps"[N]."neededVal"
FROM "table-name"
WHERE "userId" = 'someuserID'
But is it possible to make it do something like this:
SELECT "listOfMaps"."neededVal"
FROM "table-name"
WHERE "userId" = 'someuserID' AND "listOfMaps"."id" = '4'
It looks like you're modeling the one-to-many relationship using a complex attribute (e.g. a list of objects). This is a completely valid approach to modeling one-to-many relationships and is best used when 1) the results data doesn't change (or don't change often) and 2) you don't have any access patterns around the data within the complex attribute.
However, since you do want to perform searches based on data within the complex attribute, you'd be better off modeling the data differently.
For example, you might consider modeling results in the user partition with a PK=user_id SK=neededVal#. This would allow you to fetch items by User ID (QUERY where PK=USER#user_id SK begins_with neededVal#).
I don't know the specifics around your access patterns, but can say that you'll need to move results into their own items if you want to support access patterns around the data within your complex attribute.
I'm trying to build a dating app, and for my backend, I'm using a nosql database. When it comes to the user's collection, some relations are happening between documents of the same collection. For example, a user A can like, dislike, or may haven’t had the choice yet. A simple schema for this scenario is the following:
database = {
"users": {
"UserA": {
"_id": "jhas-d01j-ka23-909a",
"name": "userA",
"geo": {
"lat": "",
"log": "",
"perimeter": ""
},
"session": {
"lat": "",
"log": ""
},
"users_accepted": [
"j2jl-564s-po8a-oej2",
"soo2-ap23-d003-dkk2"
],
"users_rejected": [
"jdhs-54sd-sdio-iuiu",
"mbb0-12md-fl23-sdm2",
],
},
"UserB": {...},
"UserC": {...},
"UserD": {...},
"UserE": {...},
"UserF": {...},
"UserG": {...},
},
}
Here userA has a reference from the users it has seen and made a decision, and stores them either in “users_accepted” or “users_rejected”. If User C hasn’t been seen (either liked or disliked) by userA, then it is clear that it won’t appear in both of the arrays. However, these arrays are unbounded and may exceed the max size that a document can handle. One of the approaches may be to extract both of these arrays and create the following schema:
database = {
"users": {
"UserA": {
"_id": "jhas-d01j-ka23-909a",
"name": "userA",
"geo": {
"lat": "",
"log": "",
"perimeter": ""
},
"session": {
"lat": "",
"log": ""
},
},
"UserB": {...},
"UserC": {...},
"UserD": {...},
"UserE": {...},
"UserF": {...},
"UserG": {...},
},
"likes": {
"id_27-82" : {
"user_give_like" : "userB",
"user_receive_like" : "userA"
},
"id_27-83" : {
"user_give_like" : "userA",
"user_receive_like" : "userC"
},
},
"dislikes": {
"id_23-82" : {
"user_give_dislike" : "userA",
"user_receive_dislike" : "userD"
},
"id_23-83" : {
"user_give_dislike" : "userA",
"user_receive_dislike" : "userE"
},
}
}
I need 4 basic queries
Get the users that have liked UserA (Show who is interested in userA)
Get the users that UserA has liked
Get the users that UserA has disliked
Get the matches that UserA has
The query 1. is fairly simple, just query the likes collection and get the users where "user_receive_like" is "userA".
Query 2. and 3. are used to get the users that userA has not seen yet, get the users that are not in query 2. or query 3.
Finally query 4. may be another collection
"matches": {
"match_id_1": {
"user_1": "referece_user1",
"user_2": "referece_user2"
},
"match_id_2": {
"user_1": "referece_user3",
"user_2": "referece_user4"
}
}
Is this approach viable and efficient?
You are right to notice, that these arrays are unbounded and pose a serious scalability problem for your application. If you were to assign 2-3 user roles to a user with the 1st approach it would be totally fine, but it is not the case for you. The official MongoDB documentation suggests that you should not use unbounded arrays: https://www.mongodb.com/docs/atlas/schema-suggestions/avoid-unbounded-arrays/
Your second approach is the superior implementation choice for you, because:
you can build indices of form (user_give_dislike, user_receive_like) which will improve your query performance even in case when you have 1M+ documents
you can store additional metadata, (like timestamps etc) on the likes collection without affecting the design of the user collection
the query for "matches" will be much simpler to write with this approach: https://mongoplayground.net/p/sFRvUniHKn8
More about NoSQL data modelling:
https://www.mongodb.com/docs/manual/data-modeling/ and
https://www.mongodb.com/docs/manual/tutorial/model-referenced-one-to-many-relationships-between-documents/
To answer your question , let me write some more assumptions about the domain and then lets try to answer it.
Assumptions:
System should support scale for 100 million users
A single user might like or dislike ~100k users in its lifetime
Also some thoery about nosql , if our queries go to all the shards of the collection then max scale of the system depends on the scale of the single shard
Now with these assumptions see the query performance of the question that you asked :
Get the users that have liked UserA (Show who is interested in userA) -
Assuming we are doing sharding or user_give_like column then if we filter on user_receive_like then it will do query on all shards , which is not the right thing for scalability
Get the users that UserA has liked
This will work fine as we have created shard based on user_give_like
Get the users that UserA has disliked
This will work fine as we have created shard based on user_give_dislike
Get the matches that UserA has
In this case if we do a join between existing users and all users which UserA has liked and disliked this will create a parallel query on all shard and is not scalable when UserA like or dislike has huge count
Now to conclude this dosen't look like a reasonable approach to me.
I'm having confusion about whether to use selector or views, or both, when try to get a result from the following scenario:
I need to do a wildsearch for a book and return the result of the books plus the price and the details of the store branch name.
So I tried using selector to do wildsearch using regex
"selector": {
"_id": {
"$gt": null
},
"type":"product",
"product_name": {
"$regex":"(?i)"+search
}
},
"fields": [
"_id",
"_rev",
"product_name"
]
I am able to get the result. The idea after getting the result is to use all the _id's from the result set and query to views to get more details like price and store branch name on other documents, which I feel is kind of odd and I'm not certain is that the correct way to do it.
Below is just the idea once I get the result of _id's and insert it as a "productId" variable.
var input = {
method : 'GET',
returnedContentType : 'json',
path : 'test/_design/app/_view/find_price'+"?keys=[\""+productId+"\"]",
};
return WL.Server.invokeHttp(input);
so I'm asking for input from an expert regarding this.
Another question is how to get the store_branch_name? Can it be done in a single view where we can get the product detail, prices and store branch name? Or do I need to have several views to achieve this?
expected result
product_name (from book document) : Book 1
branch_name (from branch array in Store document) : store 1 branch one
price ( from relationship document) : 79.9
References:
Book
"_id": "book1",
"_rev": "1...b",
"product_name": "Book 1",
"type": "book"
"_id": "book2",
"_rev": "1...b",
"product_name": "Book 2 etc",
"type": "book"
relationship
"_id": "c...5",
"_rev": "3...",
"type": "relationship",
"product_id": "book1",
"store_branch_id": "Store1_branch1",
"price": "79.9"
Store
{
"_id": "store1",
"_rev": "1...2",
"store_name": "Store 1 Name",
"type": "stores",
"branch": [
{
"branch_id": "store1_branch1",
"branch_name": "store 1 branch one",
"address": {
"street": "some address",
"postalcode": "33490",
"type": "addresses"
},
"geolocation": {
"coordinates": [
42.34493,
-71.093232
],
"type": "point"
},
"type": "storebranch"
},
{
"branch_id": "store1_branch2",
"branch_name":
**details ommit...**
}
]
}
In Cloudant Query, you can specify two different kinds of indexes, and it's important to know the differences between the two.
For the first part of your question, if you're using Cloudant Query's $regex operator for wildcard searches like that, you might be better off creating a Cloudant Query index of type "text" instead of type "json". It's in the Cloudant docs, but see the intro blog post for details: https://cloudant.com/blog/cloudant-query-grows-up-to-handle-ad-hoc-queries/ There's a more advanced post on this that covers the tradeoffs between the two types of indexes https://cloudant.com/blog/mango-json-vs-text-indexes/
It's harder to address the second part of your question without understanding how your application interacts with your data, but there are a couple pieces of advice.
1) Consider denormalizing some of this information so you're not doing the JOINs to begin with.
2) Inject more logic into your document keys, and use the traditional MapReduce View indexing system to emit a compound key (an array), that you can use to emulate a JOIN by taking advantage of the CouchDB/Cloudant index sorting rules.
That second one's a mouthful, but check out this example on YouTube: https://youtu.be/0al1KnCKjlA?t=23m39s
Here's a preview (example map function) of what I'm talking about:
'map' : function(doc)
{
if (doc.type==="user") {
emit( [doc._id], null );
}
else if (doc.type==="edge:follower") {
emit( [doc.user, doc.follows], {"_id":doc.follows} );
}
}
The resulting secondary index here would take advantage of the rules outlined in http://wiki.apache.org/couchdb/View_collation -- that strings sort before arrays, and arrays sort before objects. You could then issue range queries to emulate the results you'd get with a JOIN.
I think that's as much detail that's appropriate for here. Hope it helps!
When I issue an ordered bulk operation in MongoDB 3, is the bulk operation as a whole written to the oplog so that it can be replayed as a whole after a server crash?
The rationale for this question is the following:
I know that there are no real transactions but that I can use the $isolated keyword to have some read consistency (in some cases).
Apart from being good schema design or not, let's assume that I would have to update multiple documents in possibly different collections in one go, in what would be a transaction in SQL. I do not care about the data being in an inconsistent state at any near moment, however I do require the data to be consistent eventually. So, while I may not care about errors and missing rollbacks during the operation, I require the sequence of updates to be performed entirely or not performed at all at some point, in order to have them survive unexpected server failures or shutdowns in the middle of the bulk operation (because of, say, random CoreOS updates).
I will enter into this with the general caveat that I admit I have not even looked at the results, but the basic principles seem valid to me from the start.
What you need to consider here is "what is actually happening under the hood" of the "nice syntax sugar" you are presented with in general calls. What this means is basically looking at what the "command form" of the operations you are calling actually do. In this case "update".
So, if you had a look at that link already, then consider the following "Bulk" update form:
var bulk = db.collection.initializeOrdedBulkOp();
bulk.find({ "_id": 1 }).updateOne({ "$set": { "a": 1 } });
bulk.find({ "_id": 2 }).updateOne({ "$set": { "b": 2 } });
bulk.execute();
Now you already know this is being sent to the server as one request, but what you are likely not considering is that actual "request" made "under the hood" is actually this:
db.runCommand({
"update": "collection",
"updates": [
{ "q": { "_id": 1 }, "u": { "$set": { "a": 1 } } },
{ "q": { "_id": 2 }, "u": { "$set": { "b": 2 } } }
],
"ordered": true
})
Therefore it stands to reason that what you actually see in the logs under the "update" operation is actually something like (abbreviated from full output to just the query):
{ "q": { "_id": 1 }, "u": { "$set": { "a": 1 } } }
{ "q": { "_id": 2 }, "u": { "$set": { "b": 2 } } }
Which therefore means that each of those actions with the associated command is in the oplog for "replay" on replication and/or on other actions you might perform such as specifically "replaying" the oplog entries.
I'd be sure that is what actually happens without even looking, because I know that it how the drivers implement the actual calls, and it makes sense that each call is kept within the oplog in this way.
Therefore "as a whole", then no. These are not "transactions" and are always distinct operations even if their submission and return are within a singular request. But they are not a singular operation, and therefore will not and should not be recorded as such.
I'm trying to design a schema paradigm in MongoDB which would support multilingual values for variable attributes in documents.
For example, I would have a product catalog where each product may require storing its name, title or any other attribute in various languages.
This same paradigm should probably hold for other locale-specific properties, such as price/currency variations
I've been considering a key-value approach where key is the language code and value is the corresponding value:
{
sku: "1011",
name: { "en": "cheese", "de": "Käse", "es": "queso", etc... },
price: { "usd": 30.95, "eur": 20, "aud": 40, etc... }
}
The problem is I believe this would deny me of using indices on multilingual fields.
Eventually, I'd like a generic, yet intuitive, index-able design.
Any suggestion would be appreciated, thanks.
Wholesale recommendations over your schema design may be a bit broad a topic for discussion here. I can however suggest that you consider putting the elements you are showing into an Array of sub-documents, rather than the singular sub-document with fields for each item.
{
sku: "1011",
name: [{ "en": "cheese" }, {"de": "Käse"}, {"es": "queso"}, etc... ],
price: [{ "usd": 30.95 }, { "eur": 20 }, { "aud": 40 }, etc... ]
}
The main reason for this is consideration for access paths to your elements which should make things easier to query. This I went through in some detail here which may be worth your reading.
It could also be a possibility to expand on this for something like your name field:
name: [
{ "lang": "en", "value": "cheese" },
{ "lang": "de", "value: "Käse" },
{ "lang": "es", "value": "queso" },
etc...
]
All would depend on your indexing and access requirements. It all really depends on what exactly your application needs, and the beauty of MongoDB is that it allows you to structure your documents to your needs.
P.S As to anything where you are storing Money values, I suggest you do some reading and start maybe with this post here:
MongoDB - What about Decimal type of value?