Which of CouchDB or MongoDB suits my needs? - mongodb

Where I work, we use Ruby on Rails to create both backend and frontend applications. Usually, these applications interact with the same MySQL database. It works great for a majority of our data, but we have one situation which I would like to move to a NoSQL environment.
We have clients, and our clients have what we call "inventories"--one or more of them. An inventory can have many thousands of items. This is currently done through two relational database tables, inventories and inventory_items.
The problems start when two different inventories have different parameters:
# Inventory item from inventory 1, televisions
{
inventory_id: 1
sku: 12345
name: Samsung LCD 40 inches
model: 582903-4
brand: Samsung
screen_size: 40
type: LCD
price: 999.95
}
# Inventory item from inventory 2, accomodation
{
inventory_id: 2
sku: 48cab23fa
name: New York Hilton
accomodation_type: hotel
star_rating: 5
price_per_night: 395
}
Since we obviously can't use brand or star_rating as the column name in inventory_items, our solution so far has been to use generic column names such as text_a, text_b, float_a, int_a, etc, and introduce a third table, inventory_schemas. The tables now look like this:
# Inventory schema for inventory 1, televisions
{
inventory_id: 1
int_a: sku
text_a: name
text_b: model
text_c: brand
int_b: screen_size
text_d: type
float_a: price
}
# Inventory item from inventory 1, televisions
{
inventory_id: 1
int_a: 12345
text_a: Samsung LCD 40 inches
text_b: 582903-4
text_c: Samsung
int_a: 40
text_d: LCD
float_a: 999.95
}
This has worked well... up to a point. It's clunky, it's unintuitive and it lacks scalability. We have to devote resources to set up inventory schemas. Using separate tables is not an option.
Enter NoSQL. With it, we could let each and every item have their own parameters and still store them together. From the research I've done, it certainly seems like a great alterative for this situation.
Specifically, I've looked at CouchDB and MongoDB. Both look great. However, there are a few other bits and pieces we need to be able to do with our inventory:
We need to be able to select items from only one (or several) inventories.
We need to be able to filter items based on its parameters (eg. get all items from inventory 2 where type is 'hotel').
We need to be able to group items based on parameters (eg. get the lowest price from items in inventory 1 where brand is 'Samsung').
We need to (potentially) be able to retrieve thousands of items at a time.
We need to be able to access the data from multiple applications; both backend (to process data) and frontend (to display data).
Rapid bulk insertion is desired, though not required.
Based on the structure, and the requirements, are either CouchDB or MongoDB suitable for us? If so, which one will be the best fit?
Thanks for reading, and thanks in advance for answers.
EDIT: One of the reasons I like CouchDB is that it would be possible for us in the frontend application to request data via JavaScript directly from the server after page load, and display the results without having to use any backend code whatsoever. This would lead to better page load and less server strain, as the fetching/processing of the data would be done client-side.

I work on MongoDB, so you should take this with a grain of salt, but this looks like a great fit for Mongo.
We need to be able to select items from only one (or several) inventories.
It's easy to ad hoc queries on any fields.
We need to be able to filter items based on its parameters (eg. get all items from inventory 2 where type is 'hotel').
The query for this would be: {"inventory_id" : 2, "type" : "hotel"}.
We need to be able to group items based on parameters (eg. get the lowest price from items in inventory 1 where brand is 'Samsung').
Again, super easy: db.items.find({"brand" : "Samsung"}).sort({"price" : 1})
We need to (potentially) be able to retrieve thousands of items at a time.
No problem.
Rapid bulk insertion is desired, though not required.
MongoDB has much faster bulk inserts than CouchDB.
Also, there's a REST interface for MongoDB: http://github.com/kchodorow/sleepy.mongoose
You might want to read http://chemeo.com/doc/technology, who dealt with the arbitrary property search problem with MongoDB.

Related

Nosql database design - MongoDB

I am trying to build an app where I just have these 3 models:
topic (has just a title (max 100 chars.))
comment (has text (may be very long), author_id, topic_id, createdDate)
author (has just a username)
Actually a very simple db structure. A Topic may have many comments, which are created by authors. And an author may have many comments.
I am still trying to figure out the best way of designing the database structure (documents). First I though to put everything to its own schema like above. 3 Documents. But since this is a nosql db, I should actually try to eliminate the needs for a join. And now I am really thinking of putting everything to a single document, which also sounds crazy.
These are my actually queries from ui:
Homepage query: Listing all the topics, which have received the most comments today (will run very often)
Auto suggestion list for search field: Listing all the topics, whose title contains string "X"
Main page of a topic query: Listing all the comments of a topic, with their authors' username.
Since most of my queries need data from at least 2 documents, should I really just use them all together in a single document like this:
Comment (text, username, topic_title, createdDate)
This way I will not need any join, but also save i.e. the title of topics multiple times.. in every comment..
I just could not decide.
I appreciate any help.
You can do the second design you suggested but it all comes down to how you want to use the data. I assume you’re going to be using it for a website.
If you want the comments to be clickable, in such that clicking on the topic name will redirect to the topic’s page or clicking the username will redirect to the user’s page where you can see all his comments, i suggest you keep them as IDs. Since you can later use .populate(“field1 field2”) and you can select the fields you would like to get from that ID.
Alternatively you can store both the topic_name and username and their IDs in the same document to reduce queries, but you would end up storing more redundant data.
Revised design:
The three queries (in the question post) are likely to be like this (pseudo-code):
select all topics from comments, where date is today, group by topic and count comments, order by count (desc)
select topics from comments, where topic matches search, group by topic.
select all from comments, where topic matches topic_param, order by comment_date (desc).
So, as you had intended (in your question post) it is likely there will be one main collection, comments.
comments:
date
author
text
topic
The user and topic collections with one field each, are optional, to maintain uniqueness.
Note the group-by queries will be aggregation queries, for example, the main query will be like this:
db.comments.aggregate( [
{ $match: { date: ISODate("2019-11-15") } },
{ $group: { _id: "$topic", count: { $sum: 1 } } },
{ $sort: { count: -1 } }
] )
This will give you all the topics names, today and with highest counted topics first.
You could also take a bit different approach. Storing information redundant is not a bad thing in all cases.
1. Homepage query: Listing all the topics, which have received the most comments today (will run very often)
You could implement this as two extra fields in your Topic entity. One describing the last date a comment was added and the second to count the amount of comments added that day. By doing so you do not need to join but can write a query that only looks at the Topic collection.
You could also store these statistics independently of the other data and update it when required. Think of this as having a document that describes your database its current state (at least those parts relevant to you).
This might give you a time penalty on storing information but it improves reading times.
2. Auto suggestion list for search field: Listing all the topics, whose title contains string "X"
Far as I understand this one you only need the topic title. Meaning you can query the database once and retrieve all titles. If the collection grows so big this becomes slow you could trigger a refresh of the retrieval query that only returns a subset (a user is not likely to go through 100 possible topics).
3. Main page of a topic query: Listing all the comments of a topic, with their authors' username.
This is actually the tricky one. If this is really what it is you want to do then you are most likely best off storing all data in one document. However I would ask you: what is the problem making more than one query? I doubt you will be showing all comments at once when there are thousands (as you say). Instead of storing each in a separate document or throwing all in one document, you could also bucket them and retrieve only the 20 most recent ones (if you would create buckets of size 20). Read more about the bucket pattern here and update the ones shown when required.
You said:
"Since most of my queries need data from at least 2 documents, should I really just use them all together in a single document like this..."
I"ll make an argument from a 'domain driven design' point of view.
Given that all your data exists within the same bounded context (business domain). Then it is acceptable to encapsulate it all within the same document!

Many to many relationship on Mongodb based e-learning webapp?

I am relatively new to No-SQL databases. I am designing a data structure for an e-learning web app. There would be X quantity of courses and Y quantity of users.
Every user will be able to take any number of courses.
Every course will be compound of many sections (each section may be a video or a quiz).
I will need to keep track of every section a user takes, so I think the whole course should be part of the user set (for each user), like so:
{
_id: "ed",
name: "Eduardo Ibarra",
courses: [
{
name: "Node JS",
progress: "100%",
section: [
{name: "Introdiction", passed:"100%", field3:"x", field4:""},
{name: "Quiz 1", passed:"75%", questions:[...], field3:"x", field4:""},
]
},
{
name: "MongoDB",
progress: "65%",
...
}
]
}
Is this the best way to do it?
I would say that design your database depending upon your queries. One thing is for sure.. You will have to do some embedding.
If you are going to perform more queries on what a user is doing, then make user as the primary entity and embed the courses within it. You don't need to embed the entire course info. The info about a course is static. For ex: the data about Node JS course - i.e. the content, author of the course, exercise files etc - will not change. So you can keep the courses' info separately in another collection. But how much of the course a user has completed is dependent on the individual user. So you should only keep the id of the course (which is stored in the separate 'course' collection) and for each user you can store the information that is related to that (User, Course) pair embedded in the user collection itself.
Now the most important question - what to do if you have to perform queries which require 'join' of user and course collections? For this you can use javascript to first get the courses (and maybe store them in an array or list etc) and then fetch the user for each of those courses from the courses collection or vice-versa. There are a few drivers available online to help you accomplish this. One is UnityJDBC which is available here.
From my experience, I understand that knowing what you are going to query from MongoDB is very helpful in designing your database because the NoSQL nature of MongoDB implies that you have no correct way for designing. Every way is incorrect if it does not allow you in accomplishing your task. So clearly, knowing beforehand what you will do (i.e. what you will query) with the database is the only guide.

Structuring cassandra database

I don't understand one thing about Cassandra. Say, I have similar website to Facebook, where people can share, like, comment, upload images and so on.
Now, let's say, I want to get all of the things my friends did:
Username1 liked you comment
username 2 updated his profile picture
And so on.
So after a lot of reading, I guess I would need to do is create new Column Family for each single thing, for example: user_likes user_comments, user_shares. Basically, anything you can think off, and even after I do that, I would still need to create secondary indexes for most of the columns just so I could search for data? And even so how would I know which users are my friends? Would I need to first get all of my friends id's and then search through all of those Column Families for each user id?
EDIT
Ok so i did some more reading and now i understand things a little bit better, but i still can't really figure out how to structure my tables, so i will set a bounty and i want to get a clear example of how my tables should look like if i want to store and retrieve data in this kind of order:
All
Likes
Comments
Favourites
Downloads
Shares
Messages
So let's say i want to retrieve ten last uploaded files of all my friends or the people i follow, this is how it would look like:
John uploaded song AC/DC - Back in Black 10 mins ago
And every thing like comments and shares would be similar to that...
Now probably the biggest challenge would be to retrieve 10 last things of all categories together, so the list would be a mix of all the things...
Now i don't need an answer with a fully detailed tables, i just need some really clear example of how would i structure and retrieve data like i would do in mysql with joins
With sql, you structure your tables to normalize your data, and use indexes and joins to query. With cassandra, you can't do that, so you structure your tables to serve your queries, which requires denormalization.
You want to query items which your friends uploaded, one way to do this is t have a single table per user, and write to this table whenever a friend of that user uploads something.
friendUploads { #columm family
userid { #column
timestamp-upload-id : null #key : no value
}
}
as an example,
friendUploads {
userA {
12313-upload5 : null
12512-upload6 : null
13512-upload8 : null
}
}
friendUploads {
userB {
11313-upload3 : null
12512-upload6 : null
}
}
Note that upload 6 is duplicated to two different columns, as whoever did upload6 is a friend of both User A and user B.
Now to query the friends upload display of a friend, do a getSlice with a limit of 10 on the userid column. This will return you the first 10 items, sorted by key.
To put newest items first, use a reverse comparator that sorts larger timestamps before smaller timestamps.
The drawback to this code is that when User A uploads a song, you have to do N writes to update the friendUploads columns, where N is the number of people who are friends of user A.
For the value associated with each timestamp-upload-id key, you can store enough information to display the results (probably in a json blob), or you can store nothing, and fetch the upload information using the uploadid.
To avoid duplicating writes, you can use a structure like,
userUploads { #columm family
userid { #column
timestamp-upload-id : null #key : no value
}
}
This stores the uploads for a particular user. Now when want to display the uploads of User B's friends, you have to do N queries, one for each friend of User B, and merge the result in your application. This is slower to query, but faster to write.
Most likely, if users can have thousands of friends, you would use the first scheme, and do more writes rather than more queries, as you can do the writes in the background after the user uploads, but the queries have to happen while the user is waiting.
As an example of denormalization, look at how many writes twitter rainbird does when a single click occurs. Each write is used to support a single query.
In some regards, you "can" treat noSQL as a relational store. In others, you can denormalize to make things faster. For instance, PlayOrm's #OneToMany stored the many like so
user1 -> friend.user23, friend.user25, friend.user56, friend.user87
This is the wide row approach so when you find your user, you have all the foreign keys to his friends. Each row can be different lengths. You may also have a reverse reference stored as well so the user might have references to the people that marked him as a friend but he did not mark them back(let's call it buddy) so you might have
user1 -> friend.user23, friend.user25, buddy.user29, buddy.user37
Notice that if designed right, you may NOT need to "search" for the data. That said, with PlayOrm, you can still do Scalable SQL and do joins(you just have to figure out how to partition your tables so it can scale to trillions of rows).
A row can have millions of columns in it or it could have just 10. We are actually in the process of updating alot of the documentation in PlayOrm and the noSQL patterns this month so if you keep an eye on that, you can also learn more about general noSQL there as well.
Dean
Think of each DB query as of request to the service running on another machine. Your goal is to minimize number of these requests (because each request requires network roundtrip).
Here comes the main difference from RDBMS paradigm: In SQL you would typically use joins and secondary indexes. In cassandra joins aren't possible, since related data would reside on different servers. Things like materialized views are used in cassandra for the same purpose (to fetch all related data with single query).
I'd recommend to read this article:
http://maxgrinev.com/2010/07/12/do-you-really-need-sql-to-do-it-all-in-cassandra/
And to look into twissandra sample project https://github.com/twissandra/twissandra
This is nice collection of optimization technics for the kind of projects you described.

Ensure data coherence across documents in MongoDB

I'm working on a Rails app that implements some social network features as relationships, following, etc. So far everything was fine until I came across with a problem on many to many relations. As you know mongo lacks of joins, so the recommended workaround is to store the relation as an array of ids on both related documents. OK, it's a bit redundant but it should work, let's say:
field :followers, type: Array, default: []
field :following, type: Array, default: []
def follow!(who)
self.followers << who.id
who.following << self.id
self.save
who.save
end
That works pretty well, but this is one of those cases where we would need a transaction, uh, but mongo doesn't support transactions. What if the id is added to the 'followed' followers list but not to the 'follower' following list? I mean, if the first document is modified properly but the second for some reason can't be updated.
Maybe I'm too pessimistic, but there isn't a better solution?
I would recommend storing relationships only in one direction, storing the users someone follows in their user document as "following". Then if you need to query for all followers of user U1, you can query for {users.following : "U1"} Since you can have a multi-key index on an array, this query will be fast if you index this field.
The other reason to go in that direction only is a single user has a practical limit to how many different users they may be following. But the number of followers that a really popular user may have could be close to the total number of users in your system. You want to avoid creating an array in a document that could be that large.

How can I efficiently use MongoDB to create real-time analytics with pivots?

So I'm getting a ton of data continuously that's getting put into a processedData collection. The data looks like:
{
date: "2011-12-4",
time: 2243,
gender: {
males: 1231,
females: 322
},
age: 32
}
So I'll get lots and lots of data objects like this continually. I want to be able to see all "males" that are above 40 years old. This is not an efficient query it seems because of the sheer size of the data.
Any tips?
Generally speaking, you can't.
However, there may be some shortcuts, depending on actual requirements. Do you want to count 'males above 40' across all dataset, or just one day?
1 day: split your data into daily collections (processedData-20111121, ...), this will help your queries. Also you can cache results of such query.
whole dataset: pre-aggregate data. That is, upon insertion of new data entry, do something like this:
db.preaggregated.update({_id : 'male_40'},
{$set : {gender : 'm', age : 40}, $inc : {count : 1231}},
true);
Similarly, if you know all your queries beforehand, you can just precalculate them (and not keep raw data).
It also depends on how you define "real-time" and how big a query load you will have. In some cases it is ok to just fire ad-hoc map-reduces.
My guess your target GUI is a website? In that case you are looking for something called comet. You should make a layer which processes all the data and broadcasts new mutations to your client or event bus (more on that below). Mongo doesn't enable real-time data as it doesn't emit anything on an mutation. So you can use any data store which suites you.
Depending on the language you'll use you have different options (for comet):
Socket.io (nodejs) - Javascript
Cometd - Java
SignalR - C#
Libwebsocket - C++
Most of the times you'll need an event bus or message queue to put the mutation events on. Take a look at JMS, Redis or NServiceBus (depending on what you'll use).