NoSQL - wrapping my head around the duplication

NoSQL - wrapping my head around the duplication - mongodb

I come from a strong RDBMS background and I'm trying to wrap my head around how I'd structure a NoSQL project. From what I can see, normalisation is disregarded in favour of speed.
Say for example I have a blog and categories. In MySQL, I would have a blog table and a categories table joined by a blog_categories table to maintain the many-to-many relationship. I would simply use JOINs to display the data on the website accordingly.
Easy enough.
But with NoSQL:
Would I store the blog category along with the blog article in the same document? So if categories "general", "news" and "articles" were assigned to an article, I would save this data along with the article for each article applicable?
Does this then mean that if I wanted to rename "news" to "latest news", I would need to cycle through all blog articles and their categories and manually update each instance of the "news" category?

You can store blog categories with the blog but it depends on how your blog will work. It will be faster but you can not:
1) Easily get a list of all tags
2) Change the name of a tag
How i tend to do it is have a table of tags and use a technique called manual referencing to populate data on screen.
In short it means having a collection of tags like :
{
_id : tag1,
tagName : "news"
}
Then in your blog:
{
_id : blog1,
blogTitle : "Title",
blogBody : "..."
tags : [
tag1,
....
]
}
That way you get the functionality you desire. for more information on manual referencing see mongodb documentation:
https://docs.mongodb.com/manual/reference/database-references/#manual-references

Related

NoSQL db schema design

I'm trying to find a way to create the db schema. Most operations to the database will be Read.
Say I'm selling books on the app so the schema might look like this
{
{ title : "Adventures of Huckleberry Finn"
author : ["Mark Twain", "Thomas Becker", "Colin Barling"],
pageCount : 366,
genre: ["satire"] ,
release: "1884",
},
{ title : "The Great Gatsby"
author : ["F.Scott Fitzgerald"],
pageCount : 443,
genre: ["Novel, "Historical drama"] ,
release: "1924"
},
{ title : "This Side of Paradise"
author : ["F.Scott Fitzgerald"],
pageCount : 233,
genre: ["Novel] ,
release: "1920"
}
}
So most operations would be something like
1) Grab all books by "F.Scott Fitzgerald"
2) Grab books under genre "Novel"
3) Grab all book with page count less than 400
4) Grab books with page count more than 100 no later than 1930
Should I create separate collections just for authors and genre and then reference them like in a relational database or embed them like above? Because it seems like if I embed them, to store data in the db I have to manually type in an author name, I could misspell F.Scott Fitzgerald in a document and I wouldn't get back the result.

First of all i would say a nice DB choice.
As far as mongo is concerned the schema should be defined such that it serves your access patterns best. While designing schema we also must observe that mongo doesn't support joins and transactions like SQL. So considering all these and other attributes i would suggest that your choice of schema is best as it serves your access patterns. Usually whenever we pull any book detail, we need all information like author, pages, genre, year, price etc. It is just like object oriented programming where a class must have all its properties and all non- class properties should be kept in other class.
Taking author in separate collection will just add an extra collection and then you need to take care of joins and transactions by your code. Considering your concern about manually typing the author name, i don't get actually. Let's say user want to see books by author "xyz" so he clicks on author name "xyz" (like some tag) and you can fetch a query to bring all books having that selected name as one of the author. If user manually types user name then also it is just finding the document by entered string. I don't see anything manual here.
Just adding on, a price key shall also fit in to every document.

MongoDB - Tag based search with autocomplete

I am looking to implement a tag search feature and was looking for some advice in terms of efficiency. I am new to MongoDB so I am unsure of best practices for performance.
Okay so I want to create a link sharing app which users tag the links based on their content. For instance a funny dog image would be tagged with "funny" and "dog". A link would have a:
title,
url,
user_id,
tags: array of tags
Now in order for me to allow users to search for links I need a list of all the tags used. For usability this needs to have auto-complete functionality. So I researched a bit and tested out using a collection of tags where I index the tag value e.g. "funny" and then use a regex.
db.tags.find({value:/^search/})
With a collection of 600,000 documents it searched for all documents beginning with "s" in 63 milliseconds. As the length of the search term increases the execution time decreases.
Now comes the part I'm unsure of. Say for instance I want to find all the links with have the tags "funny" and "dog" (need to use intersects). How should I store the tags? Should I store the object id of each tag? Can I index these object ids? Is there another way to structure the whole database?
Also id like to be able suggest tags based on tags they already entered. I was thinking of just having a related field in the tag document for instance:
tag
----
id
value
related: [{
tag_id
count
}]
(again unsure as it would suggest tags that could be related to one of the already entered tags and not to another. With an intersect this would return no results.)
Any advice would be much appreciated.
Edit: mistake

Create a text index on the tag array. This will enable you to search quickly for funny, dog, and funny or dog.
https://docs.mongodb.com/manual/core/index-text/
db.tags.createIndex( { tags: "text" }, {background:true} )
As to the related tags, I don't think that you want to reference the _id values. You can probably embed an array of related tags such as:
relatedTags: [{tag1}, {tag2}]

How to design product category and products models in mongodb?

I am new to mongo db database design,
I am currently designing a restaurant products system, and my design is similar to a simple eCommerce database design where each productcategory has products so in a relational database system this will be a one (productcategory) to many products.
I have done some research I do understand that in document databses.
Denationalization is acceptable and it results in faster database reads.
therefore in a nosql document based database I could design my models this way
//product
{
name:'xxxx',
price:xxxxx,
productcategory:
{
productcategoryName:'xxxxx'
}
}
my question is this, instead of embedding the category inside of each product, why dont we embed the products inside productcategory, then we can have all the products once we query the category, resulting in this model.
//ProductCategory
{
name:'categoryName',
//array or products
products:[
{
name:'xxxx',
price:xxxxx
},
{
name:'xxxx',
price:xxxxx
}
]
}
I have researched on this issue on this page http://www.slideshare.net/VishwasBhagath/product-catalog-using-mongodb and here http://www.stackoverflow.com/questions/20090643/product-category-management-in-mongodb-and-mysql both examples I have found use the first model I described (i.e they embed the productCategory inside product rather than embed array of products inside productCategory), I do not understand why, please explain. thanks

For the DB design, you have to consider the cardinality (one-to-few, one-to-many & one-to-gazillions) and also your data access patterns (your frequent queries, updates) while designing your DB schema to make sure that you get optimum performance for your operations.
From your scenario, it looks like each Product has a category, however it also looks like you need to query to find out Products for each category.
So, in this case, you could do with something like :
Product = { _id = "productId1", name : "SomeProduct", price : "10", category : ObjectId("111") }
ProductCategory = { _id = ObjectId("111"), name : "productCat1", products : ["productId1", productId2", productId3"]}
As i said about data access patterns, if you always read the Category-name and "Category-name" is something which is very infrequently updated then you can go for denormalizing with this two-way referencing by adding the product-category-name in the product:
Product = { _id = "productId1", name : "SomeProduct", price : "10", category : { ObjectId("111"), name:"productCat1" }
So, with embedding the documents, specific queries would be faster if no joins would be required, however other queries in which you need to access embedded details as stand-alone entities would be difficult.
This is a link from MongoDB which explains DB design for one-to-many scenario like you have with very nice examples in which you would realize that there are much more ways of doing it and many other things to think about.
http://blog.mongodb.org/post/88473035333/6-rules-of-thumb-for-mongodb-schema-design-part-3 (also has links for parts 1 & 2)
This link also describes the pros & cons for each scenario enabling you to reach a narrow down on a DB schema design.

It all depends on what queries you have in mind.
Suppose a product can only belong to one category, and one category applies to many products. Then, if you expect to retrieve the category together with the product, it makes sense to store it directly:
// products
{_id:"aabbcc", name:"foo", category:"bar"}
and if you expect to query all the products in a given category then it makes sense to create a separate collection
// categories
{_id:"bar", products=["aabbcc"]}
Remember that you cannot atomically update both the products and categories database (MongoDB is eventually consistent), but you can occasionally run a batch job which will make sure all categories are up-to-date.
I would recommend to think in terms of what kind of information will I often need, as opposed to how to normalize/denormalize this data, and make your collections reflect what you actually want.

Denormalization Data in MongoDb Doctrine Symfony 2

I'm Following this Doc
http://docs.doctrine-project.org/projects/doctrine-mongodb-odm/en/latest/tutorials/getting-started.html
And
http://symfony.com/doc/current/bundles/DoctrineMongoDBBundle/index.html
When I Save My Document, I have two Collection
like this:
{
"_id" : ObjectId("5458e370d16fb63f250041a7"),
"name" : "A Foo Bar",
"price" : 19.99,
"posts" : [
{
"$ref" : "Embedd",
"$id" : ObjectId("5458e370d16fb63f250041a8"),
"$db" : "test_database"
}
]
}
I'd like have
{
"_id" : ObjectId("5458e370d16fb63f250041a7"),
"name" : "A Foo Bar",
"price" : 19.99,
"posts" : [
{
"mycomment" :"dsdsds"
" date" : date
}
]
}
I want denormalization my data. How Can i Do it?
Can I use Methods like $push,$addToSet etc of mongoDb?
Thanks

Doctrine ODM supports both references and embedded documents.
In your first example, you're using references. The main document (let's assume it's called Product) references many Post documents. Those Post documents live in their own collection (for some reason this is named Embedd -- I would suggest renaming that if you keep this schema). By default, ODM uses the DBRef convention for references, so each reference is itself a small embedded document with $ref, $id, and $db fields.
Denormalization can be achieved by using embedded documents (an #EmbedMany mapping in your case). If you were embedding a Post document, the Post class should be mapped as an #EmbeddedDocument. This tells ODM that it's not a first-class document (belonging to its own collection), so it won't have to worry about tracking it by _id and the like (in fact, embedded documents won't even need identifiers unless you want to map one).
My rule of thumb for deciding to embed or references has generally been asking myself, "Will I need this document outside of the context of the parent document?" If a Post will not have an identity outside of the Product record, I'm comfortable embedding it; however, if I find later that my application also wants to show users a list of all of their Posts, or that I need to query by Posts (e.g. a feed of all recent Posts, irrespective of Product), then I may want to reference documents in a Posts collection (or simply duplicate embedded Posts as needed).
Alternatively, you may decide that Posts should exist in both their own collection and be embedded on Product. In that case, you can create an AbstractPost class as a #MappedSuperclass and define common fields there. Then, extend this with both Post and EmbeddedPost sub-classes (mapped accordingly). You'll be responsible for creating some code to generate an EmbeddedPost from a Post document, which will be suitable for embedding in the Product.posts array. Furthermore, you'll need to handle data synchronization between the top-level and embedded Posts (e.g. if someone edits a Post comment, you may want all the corresponding embedded versions updated as well).
On the subject of references: ODM also supports a simple option for reference mappings, in which case it will just store the referenced document's _id instead of the larger DBRef object. In most cases, having DBRef store the collection and database name for each referenced document is quite redundant; however, DBRef is actually useful if you're using single-collection inheritance, as ODM uses the object to store extra discriminator information (i.e. the class of the referenced object).

Unique vote, disable revote

I'm building simple Web App where users can vote.
What is the fastest way for checking if user has already voted. I'm interested in both relation databases and document based databases (mongodb,...)
I have few ideas but I am sure they can be improved:
Relation databases
Create a seperate table for voting:
|userid|articleid|
Before incrementing articles vote check if there is a row including both userid and articleid. We have two queries. Is possible to improve this with triggers? For example:
|useridarticleid| unique column
Before vote generate useridarticleid on application side. Try to insert useridarticleid. Trigger will fire if field is new and it will increment our vote column in article.
Document based
This is a bit more trickier. So having document structured like so:
{
"id": "123",
"content": "something",
"num_votes": 2,
"votes" : [
"userid1",
"userid2"
]
}
First "query" - check if userid is in votes array. Second "query" - Increment num_votes if not.
Again two queries. So I thought we can change this but I don't know really if it will increase performance:
Insert userid in votes array. When user want to check article "count" votes in array. But I think it possible that performance will drop because if traffic is high counting every article is a bit of waste. Imagine Reddit here.

Actually, it's a lot simpler in a document database. Your document structure is perfect for it.
{
"id": "123",
"content": "something",
"num_votes": 2,
"votes" : [
"userid1",
"userid2"
]
}
db.collection.update(
{id:"123", votes:{$ne:"userid"}},
{$push:{"votes":"userid"},$inc:{"num_votes":1}}
);
This will atomically update record id=123 adding userid to list of voters and incrementing votes by one only if userid is not already in the list of votes on this document.
So there is only one query and one update - and they are actually the same operation.

In a relational database |userid|articleid| would be the best approach, using both fields as primary keys.
In the second one you can also consider wther putting the votes in the user document, or in the article document.
Anyway, I'd suggest you really focus on creating a design, where changing all this decisions later is easy.
The different ways of designing this, favor things like "A lot of users at the same article at the same time" or "A lot of users in different articles", etc... Until you can see the real usage, you won't have enough information to decide which approach will work best and fastest... So create something that you can easily adapt to whatever information you learn later.
BTW: You might also consider don't counting the votes synchronically. I remember an article (which I can't find) where it mentioned that you tube votes numbers weren't actually "accurate"... They put an estimation of the current votes, and calculated the real number in a background worker thread.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse