nosql inconsistent data structure - mongodb

I'm new to nosql (MongoDB) so go easy on me.
I'm scraping json-ld from various web pages and want to store/recall the data. However the value types keep changing. For instance sometimes the "author" field uses an "organization" type, other times it's a "person" type sometimes it's simply a string, and sometimes it's just missing.
Should I convert the data to some type of standard?
Should each object be put into it's own collection and referenced?
How do you deal with displays being different.
Looking for words of experience or links to good articles on how to deal with inconsistent data structure.

The whole point of No-Sql database is that its schema less, and the structure can vary from document to other, so I see no issue in here.
I think you are asking on how you should deal with it in your application business logic, so here is my suggestion:
You can save the author as an embedded sub-document which always have a field called “type” (as an enum of values: String, Person, Organization, etc…) and act accordingly when you fetch the data.
For example, if the author is simply a String then the document would look like something like:
{
…,
“author”: {
“type”: “String”,
“text”: <text>
}
}
If its a Person type then:
{
…,
“author”: {
“type”: “Person”,
“first_name”: <first name>,
“last_name”: <last name>
}
}

Related

Good DB-design to reference different collections in MongoDB

I'm regularly facing the similar problem on how to reference several different collections in the same property in MongoDB (or any other NoSQL database). Usually I use Meteor.js for my projects.
Let's take an example for a notes collection that includes some tagIds:
{
_id: "XXXXXXXXXXXXXXXXXXXXXXXX",
message: "This is an important message",
dateTime: "2018-03-01T00:00:00.000Z",
tagIds: [
"123456789012345678901234",
"abcdefabcdefabcdefabcdef"
]
}
So a certain id referenced in tagIds might either be a person, a product or even another note.
Of course the most obvious solutions for this imo is to save the type as well:
...
tagIds: [
{
type: "note",
id: "123456789012345678901234",
},
{
type: "person",
id: "abcdefabcdefabcdefabcdef",
}
]
...
Another solution I'm also thinking about is to use several fields for each collection, but I'm not sure if this has any other benefits (apart from the clear separation):
...
tagIdsNotes: ["123456789012345678901234"],
tagIdsPersons: ["abcdefabcdefabcdefabcdef"],
...
But somehow both solutions feel strange to me as they need a lot of extra information (it would be nice to have this information implicit) and so I wanted to ask, if this is the way to go, or if you know any other solution for this?
If you use Meteor Methods to pull this data, you have a chance to run some code, get from DB, run some mappings, pull again from DB etc and return a result. However, if you use pub/sub, things are different, you need to keep it really simple and light.
So, first question: method or pub/sub?
Your question is really more like: should I embed and how much to embed, or should I not embed and build relations (only keep an id of a tag in the message object) and later use aggregations or should I denormalize (duplicate data): http://highscalability.com/building-scalable-databases-denormalization-nosql-movement-and-digg
All these are ok in Mongo depending on your case: https://www.mongodb.com/blog/post/6-rules-of-thumb-for-mongodb-schema-design-part-3
The way I do this is to keep a tags Collection indexed by messageId and eventually date (for sorting). When you have a message, you get all tags by querying the Tags Collection rather than mapping over your tags in your message object and send 3 different queries to 3 different Collections (person, product, note).
If you embed your tags data in the message object, let's say in your UX you want to show there are 3 tags and on click you get those 3 tags. You can basically pull those tags when you pulled the message (and might not need that data) or pull the tags on an action such as click. So, you might want to consider what data you need in your view and only pull that. You could keep an Integer as number of tags on the message object and save the tags in either a tags Collection or embed in your message object.
Following the principles of NoSQL it is ok and advisable to save some data multiple times in different collections to make your queries super fast.
So in a Tags Collection you could save as well things related to your original objects. Let's say
// Tags
{
...
messageId: 'xxx',
createdAt: Date,
person: {
firstName: 'John',
lastName: 'Smith',
userId: 'yyyy',
...etc
},
{
...
messageId: 'xxy',
createdAt: Date,
product: {
name: 'product_name',
productId: 'yyzz',
...etc
},
}

NoSQL db schema design

I'm trying to find a way to create the db schema. Most operations to the database will be Read.
Say I'm selling books on the app so the schema might look like this
{
{ title : "Adventures of Huckleberry Finn"
author : ["Mark Twain", "Thomas Becker", "Colin Barling"],
pageCount : 366,
genre: ["satire"] ,
release: "1884",
},
{ title : "The Great Gatsby"
author : ["F.Scott Fitzgerald"],
pageCount : 443,
genre: ["Novel, "Historical drama"] ,
release: "1924"
},
{ title : "This Side of Paradise"
author : ["F.Scott Fitzgerald"],
pageCount : 233,
genre: ["Novel] ,
release: "1920"
}
}
So most operations would be something like
1) Grab all books by "F.Scott Fitzgerald"
2) Grab books under genre "Novel"
3) Grab all book with page count less than 400
4) Grab books with page count more than 100 no later than 1930
Should I create separate collections just for authors and genre and then reference them like in a relational database or embed them like above? Because it seems like if I embed them, to store data in the db I have to manually type in an author name, I could misspell F.Scott Fitzgerald in a document and I wouldn't get back the result.
First of all i would say a nice DB choice.
As far as mongo is concerned the schema should be defined such that it serves your access patterns best. While designing schema we also must observe that mongo doesn't support joins and transactions like SQL. So considering all these and other attributes i would suggest that your choice of schema is best as it serves your access patterns. Usually whenever we pull any book detail, we need all information like author, pages, genre, year, price etc. It is just like object oriented programming where a class must have all its properties and all non- class properties should be kept in other class.
Taking author in separate collection will just add an extra collection and then you need to take care of joins and transactions by your code. Considering your concern about manually typing the author name, i don't get actually. Let's say user want to see books by author "xyz" so he clicks on author name "xyz" (like some tag) and you can fetch a query to bring all books having that selected name as one of the author. If user manually types user name then also it is just finding the document by entered string. I don't see anything manual here.
Just adding on, a price key shall also fit in to every document.

MongoDB schema diagram

Would diagramming mongodb schema in a class diagram (UML format) be feasible as ER diagrams relate more to SQL.
When representing the id in a high level schema, which of the following 3 has the correct type: (int or objectId or _id)
id: int
OR
id: bson.ObjectId
OR
id: _id
When representing a subdocument object in a schema diagram, which of the following 2 has the correct type (String or Object)
comments : String [
{
userName : String
date : String
actualComment : String
}
]
OR
comments : Object [
{
userName : String
date : String
actualComment : String
}
]
UPDATE
If I have the following subdocument (here is JSON representation), how does Mongo store the replies - what type would it be?
comments : String [
{
userName : String
date : String
actualComment : String
replies : comment [ ] // how does mongo store nested replies
}
]
A UML class diagram is for classes in object-oriented programming and an ER diagram is for relations in a relational database. MongoDB is neither an object database nor a relational database, so neither tool is really a good fit for MongoDB. But given only those two tools, I would rather use UML class diagrams, because ER emphasizes something which should best be avoided in MongoDB: relations between documents.
By default, the _id field is filled with a generated value of type BSON ObjectId, so your second example bson.ObjectId would be technically correct if you use the default. However, you don't have to use the default. You can also explicitly set your _id fields to an own value of any type you want. So if you want to use integers for your ObjectId's for some reason (remember that you then need to take care of keeping them unique), you can of course do so and should say so in your documentation. When you don't use custom values for your _id's and also otherwise don't make any use of them, you might consider to just omit them from your diagrams, because they are implied.
In my opinion, embedded documents are best expressed in UML class diagrams by using composition (black-diamond arrow), while referenced documents are expressed using aggregation (white-diamond arrow). A sub-document is definitely not a String. Object is better, but even better would be to use the correct type.
Regarding your follow-up question: infinitely nested data structures (comments with an array of comments with an array of comments...) can be visualize in UML through a composition arrow pointing back at the box it comes from. But keep in mind that such data structures are a bad fit for MongoDB and usually best avoided. I would rather recommend you to put each comment into an own document which references the topic it belongs to and the parent comment (aggregation). But even that's not a particularly elegant solution. MongoDB isn't built for storing graphs.
Feasible, but not suitable in all cases. FK relationships can be represented the same way. For arrays, embedded, etc. you'd have to establish a representation/interpretation.
ObjectID is the type; that's a BSON type. _id is the field name. No idea how got behind int, BSON types are 32 bit integer and 64 bit integer.
None of them. It's a simply a (sub)document.
UPDATE
It's an array technically. No specific type. In that case you probably were thinking of an array of ids of comment entities, but could be anything you want I think (including subdocuments).

Confusion regarding Mongo db Schema. How to make it better?

I am using mongoose with node.js for this.
My current Schema is this:
var linkSchema = new Schema({
text: String,
tags: array,
body: String,
user: String
})
My use-case is this: There are a list of users and each user has a list of links associated with it. Users and links are different Schemas of course. Thus, how does one get that sort of one to one relationship done using mongo-db.
Should I make a User Schema and embed linkSchema in it? Or the other way around?
Another doubt regarding that. Tags would always be an array of strings which I can use to browse through links later. Should it be an array data type or is there a better way to represent it?
If it's 1:1 then nest one document inside the other. Which way around depends on the queries, but you could easily do both if you need to.
For tags, you can index an array field and use that for searching/filtering documents and from the information you've given that sounds reasonable IMHO.
If you had a fixed set of tags it would make sense to represent those as a nested object with named fields perhaps, depending on queries. Don't forget you not only can create nested documents in Mongo but you can also search on sub-fields and even use entire nested documents as searchable/indexable fields. For instance, you could have a username like this;
email: "joe#somewhere.com"
as a string, and you could also do;
email: {
user: "joe",
domain: "somewhere.com"
}
you could index email in both cases and use either for matching. In the latter case though you could also search on domain or user only without resorting to RegEx style queries. You could also store both variants, so there's lots of flexibile options in Mongo.
Going back to tags, I think your array of strings is a fine model given what you've described, but if you were doing more complex bulk aggregation, it wouldn't be crazy to store a document for every tag with the same document contents, since that's essentially what you'd have to do for every query during aggregation.

How to deal with Many-to-Many relations in MongoDB when Embedding is not the answer?

Here's the deal. Let's suppose we have the following data schema in MongoDB:
items: a collection with large documents that hold some data (it's absolutely irrelevant what it actually is).
item_groups: a collection with documents that contain a list of items._id called item_groups.items plus some extra data.
So, these two are tied together with a Many-to-Many relationship. But there's one tricky thing: for a certain reason I cannot store items within item groups, so -- just as the title says -- embedding is not the answer.
The query I'm really worried about is intended to find some particular groups that contain some particular items (i.e. I've got a set of criteria for each collection). In fact it also has to say how much items within each found group fitted the criteria (no items means group is not found).
The only viable solution I came up with this far is to use a Map/Reduce approach with a dummy reduce function:
function map () {
// imagine that item_criteria came from the scope.
// it's a mongodb query object.
item_criteria._id = {$in: this.items};
var group_size = db.items.count(item_criteria);
// this group holds no relevant items, skip it
if (group_size == 0) return;
var key = this._id.str;
var value = {size: group_size, ...};
emit(key, value);
}
function reduce (key, values) {
// since the map function emits each group just once,
// values will always be a list with length=1
return values[0];
}
db.runCommand({
mapreduce: item_groups,
map: map,
reduce: reduce,
query: item_groups_criteria,
scope: {item_criteria: item_criteria},
});
The problem line is:
item_criteria._id = {$in: this.items};
What if this.items.length == 5000 or even more? My RDBMS background cries out loud:
SELECT ... FROM ... WHERE whatever_id IN (over 9000 comma-separated IDs)
is definitely not a good way to go.
Thank you sooo much for your time, guys!
I hope the best answer will be something like "you're stupid, stop thinking in RDBMS style, use $its_a_kind_of_magicSphere from the latest release of MongoDB" :)
I think you are struggling with the separation of domain/object modeling from database schema modeling. I too struggled with this when trying out MongoDb.
For the sake of semantics and clarity, I'm going to substitute Groups with the word Categories
Essentially your theoretical model is a "many to many" relationship in that each Item can belong Categories, and each Category can then possess many Items.
This is best handled in your domain object modeling, not in DB schema, especially when implementing a document database (NoSQL). In your MongoDb schema you "fake" a "many to many" relationship, by using a combination of top-level document models, and embedding.
Embedding is hard to swallow for folks coming from SQL persistence back-ends, but it is an essential part of the answer. The trick is deciding whether or not it is shallow or deep, one-way or two-way, etc.
Top Level Document Models
Because your Category documents contain some data of their own and are heavily referenced by a vast number of Items, I agree with you that fully embedding them inside each Item is unwise.
Instead, treat both Item and Category objects as top-level documents. Ensure that your MongoDb schema allots a table for each one so that each document has its own ObjectId.
The next step is to decide where and how much to embed... there is no right answer as it all depends on how you use it and what your scaling ambitions are...
Embedding Decisions
1. Items
At minimum, your Item objects should have a collection property for its categories. At the very least this collection should contain the ObjectId for each Category.
My suggestion would be to add to this collection, the data you use when interacting with the Item most often...
For example, if I want to list a bunch of items on my web page in a grid, and show the names of the categories they are part of. It is obvious that I don't need to know everything about the Category, but if I only have the ObjectId embedded, a second query would be necessary to get any detail about it at all.
Instead what would make most sense is to embed the Category's Name property in the collection along with the ObjectId, so that pulling back an Item can now display its category names without another query.
The biggest thing to remember is that the key/value objects embedded in your Item that "represent" a Category do not have to match the real Category document model... It is not OOP or relational database modeling.
2. Categories
In reverse you might choose to leave embedding one-way, and not have any Item info in your Category documents... or you might choose to add a collection for Item data much like above (ObjectId, or ObjectId + Name)...
In this direction, I would personally lean toward having nothing embedded... more than likely if I want Item information for my category, i want lots of it, more than just a name... and deep-embedding a top-level document (Item) makes no sense. I would simply resign myself to querying the database for an Items collection where each one possesed the ObjectId of my Category in its collection of Categories.
Phew... confusing for sure. The point is, you will have some data duplication and you will have to tweak your models to your usage for best performance. The good news is that that is what MongoDb and other document databases are good at...
Why don't use the opposite design ?
You are storing items and item_groups. If your first idea to store items in item_group entries then maybe the opposite is not a bad idea :-)
Let me explain:
in each item you store the groups it belongs to. (You are in NOSql, data duplication is ok!)
for example, let's say you store in item entries a list called groups and your items look like :
{ _id : ....
, name : ....
, groups : [ ObjectId(...), ObjectId(...),ObjectId(...)]
}
Then the idea of map reduce takes a lot of power :
map = function() {
this.groups.forEach( function(groupKey) {
emit(groupKey, new Array(this))
}
}
reduce = function(key,values) {
return Array.concat(values);
}
db.runCommand({
mapreduce : items,
map : map,
reduce : reduce,
query : {_id : {$in : [...,....,.....] }}//put here you item ids
})
You can add some parameters (finalize for instance to modify the output of the map reduce) but this might help you.
Of course you need to have another collection where you store the details of item_groups if you need to have it but in some case (if this informations about item_groups doe not exist, or don't change, or you don't care that you don't have the most updated version of it) you don't need them at all !
Does that give you a hint about a solution to your problem ?