I work in the promotional products industry. We sell pretty much anything that you can print, embroider, engrave, or any other method to customize. Popular products are pens, mugs, shirts, caps, etc. Because we have such a large variety of products, storing information about these products including all the possible product options, decoration options, and all associated extra charges gets extremely complicated. So much so, that although many have tried, no one has been able to provide industry product data in such a way that you could algorithmically turn the data into an eCommerce store without some degree of data massaging. It seems near impossible to store this information to properly in a relational database. I am curious if MongoDB, or any other NoSQL option, would allow me to model the information in such a way that makes it easier to store and manipulate our product information better than an RDBMS like MySQL. The company I work for is over 100 years old and has been using DB2 on an AS400 for many years. I'll need some good reasons to convince them to go with a non relational DB solution.
A common example product used in our industry is the Bic Clic Stic Pen which has over 20 color options each for barrel and trim colors. Even more colors to choose for silkscreen decoration. Then you can choose additional options for what type of ink to use. There are multiple options for packaging. After all that is selected, you have additional option for rush processing. All of these options may or may not have additional charges that can be based on how many pens you order or how many colors in your decoration. Pricing is usually based on quantity, so ordering 250 pens would be cost more per pen than ordering 1000. Similarly, the extra charge for getting special ink would be cheaper per pen ordered when you order 1000 than 250.
Without wanting to sound harsh, this has the ring of a silver bullet question.
You have an inherently complex business domain. It's not clear to me that a different way of storing your data will have any impact on that complexity - storing documents rather than relational data probably doesn't make it easier to price your pen at $0.02 less if the customer orders more than 250.
I'd recommend focussing on the business domain, and not worrying too much about the storage mechanism. I'm a big fan of Domain Driven Design - this sounds like a perfect case for that approach.
Using a document database won't solve your problem completely, but it probably can help.
If your documents represent the options available on a product and an order for that product, in most cases you will be accessing the document as a whole - it's nothing you can't do with SQL, but a good fit for a document database. Since the structure of the documents is flexible, it is relatively easy to define an object within the document as a complex type to define a particular option or rule without changing the database.
However, that only helps with the data - your real problem is on the UI side. The two documents together map directly to the order form, but whatever method you use to define the options/rules some of the products are going to end up with extremely complex settings pages.
Yes, MongoDB is what you need. It doesn't have a strict documents structure, so you'll be able to create set of models you need and embed them into your product page in any order and combinations you need. Actually its possible to work with this data without describing the real model fields directly, so I (for example) can use fields my Rails application doesn't know about at all.
Also MongoDB is extremely easy to set up for replication and sharding. Also it supports GridFS virtual filesystem, so you can store images for your products with documents which describe them and manipulate them as a single object easily.
You should definitely give it a try.
UPD: Anyway it would be good to keep your RDBMS for financial data and crunching numbers, like grouping reports for the sales analysys and so on. NoSQL bases aren't very good at this.
Related
I work for a financial organization where user can place 20 different type of investment requets like following.
lumpsum investment
setting up a regular saving plan
switch order.
Etc...
Each order has specific attributes and different from each other.
Traditionally I have been using relational database for 5 years but I think this what I don't want as I have to have a common schema which at times difficult to manage .
Apart from flexible storage (document style) I have requirements to Search all the orders based on certain attributes like customer name or order type.
Few NFRs
I need strong consistency
Volumes per day are limited so basically scaling is not a big problem.
Can someone suggest which NoSQL will meet my needs
One of the reasons NoSQL databases gained popularity was because of the need to scale effectively. If your primary requirement is to not manage a "common schema" , I am not sure if NoSQL is the best option for you.
Any way, with the requirements you have, you might want to look at CouchDB which is a document based NoSQL database that has document-level ACID semantics.
I guess this might be a good read. https://www.quora.com/What-are-the-differences-between-Cassandra-and-MongoDB . The question might not be stack overflow material that much. There is just so much other info one would need to make a decision.
Basically if your access pattern is not clear right away cassandra might not be the best choice. But once you know how you access your data etc. you might port to cassandra.
There are a lot of success stories on this subject.
If scaling is NOT the problem, I suggest you just to use PostgreSQL and store all specific attributes in one column in JSON format. Postgres have a good performance for storing and indexing such kind of data.
Over the last few days I've been working on a very simple web service for myself (and a few others) that allows me to keep track of books that I've read and when I've read them. Whilst storing users and books (titles + authors + maybe more data in the future) is relatively simple because they can just be stored as hashes with keys user:username and book:uniqueID respectively storing which users read which books and when is proving to be a bit more challenge.
My original plan was to have a sorted set for a user (user:username:readbooks) that used the timestamp as a score (for when the user read the book) and each book's unique ID as the value. The problem with this approach is that I can't store that a user has read a book twice (as you can't have duplicate values in a set). It also means that in order to track readers of a book I have to add them to a second set readersof:bookID.
My current approach that is rather than directly storing book IDs in the set user:username:readbooksto instead store a value in the form uniqueReadingEventId.bookId, however the problem with this is that if I delete a book (rather than the unique reading event) I have to iterate through every user in the set readersof:bookID, iterate through every value in user:username:readbooks and deleting values that match x.bookId, which seems a little inefficient. Furthermore, I may want to find users that have read two or more books in common.
My question is therefore two fold: is there a simpler way to structure my data in Redis or is my data better structured to a different NoSQL system? I would really like to continue working with Redis because I like its API, however because it is a personal project it doesn't really matter what I use.
Unless you need really high throughput here for some reason, it doesn't sound like Redis is the right choice. It sounds like you want to store a lot of document level information, and neither high-throughput nor data structures are a huge concern for you. To me that screams for just using SQL. Your data is very schematic-- and from what you've said, there's really no reason SQL wouldn't best and most simply fit your use case. If you're married to the idea of using NoSQL, one of the more general use-case databases like Mongo would also serve well.
Redis as a persistent database is specialized for cases where you need high throughput, data structures are useful, and you don't mind paying the extra cost of keeping everything in memory instead of much less expensive HD space. There are lots of scenarios where Redis fits perfectly, but yours isn't one of them.
I am developing a system which has multiple modules,
Social Media User Demography - (Document) - Name, loc, interests, work, education
Social Media User Connections - (Graph) - friends
CRM - (Rows and Columns) - telecom + banking etc
to name a few. I'm pretty sure that I have already crossed millions of records in each one of them.
When I look for a NoSql database to choose from I have at least 10 in each category. For Document database I have a arraylist right from MongoDB to DjonDB. Its the same case when I look for a graph database, so on and so forth. And also I have seen other key value store databases, columnar databases etc to name a few at http://nosql-database.org/.
So I wanted to know are there any generic thumb rules that I should follow to choose among these databases, when a columnar DB is optimized, to what type of data does a key value store suits best etc..
What are the best suited databases for what type of data and why? and most importantly
What are the worst suited databases for what type of data and why?
Thanks
This is a very open question, but I'll give it a shot.
Some things to keep in mind:
Pick a database that has a great (not good, great) community around it. You might find FooDB and it claims to do everything you want, but it no one is contributing to it, then you'll have built your application around a dead technology. You want active contributions, lots of customers with production deployments and ideally something not in its first version.
Try to find technologies that play well together. For example, Elastic Search, MongoDB, CouchDB, Couchbase all more or less work with JSON. That should help you narrow your choices of technologies.
I wouldn't try and spread myself too thin. Each type of database (graph, document, row/column, key value pair) has it's own learning curve. It takes quite a while to learn how to model data in a denormalized fashion. The more variety you have, the harder it will be to maintain all those different databases.
I don't know why I rarely see this as advice, but I would pick something you actually like developing in. Does the query syntax seem intuitive and fun? If not, you're going to hate developing in it. This isn't the most important factor, but I think it should be considered.
I would like to test the NoSQL world. This is just curiosity, not an absolute need (yet).
I have read a few things about the differences between SQL and NoSQL databases. I'm convinced about the potential advantages, but I'm a little worried about cases where NoSQL is not applicable. If I understand NoSQL databases essentially miss ACID properties.
Can someone give an example of some real world operation (for example an e-commerce site, or a scientific application, or...) that an ACID relational database can handle but where a NoSQL database could fail miserably, either systematically with some kind of race condition or because of a power outage, etc ?
The perfect example will be something where there can't be any workaround without modifying the database engine. Examples where a NoSQL database just performs poorly will eventually be another question, but here I would like to see when theoretically we just can't use such technology.
Maybe finding such an example is database specific. If this is the case, let's take MongoDB to represent the NoSQL world.
Edit:
to clarify this question I don't want a debate about which kind of database is better for certain cases. I want to know if this technology can be an absolute dead-end in some cases because no matter how hard we try some kind of features that a SQL database provide cannot be implemented on top of nosql stores.
Since there are many nosql stores available I can accept to pick an existing nosql store as a support but what interest me most is the minimum subset of features a store should provide to be able to implement higher level features (like can transactions be implemented with a store that don't provide X...).
This question is a bit like asking what kind of program cannot be written in an imperative/functional language. Any Turing-complete language and express every program that can be solved by a Turing Maching. The question is do you as a programmer really want to write a accounting system for a fortune 500 company in non-portable machine instructions.
In the end, NoSQL can do anything SQL based engines can, the difference is you as a programmer may be responsible for logic in something Like Redis that MySQL gives you for free. SQL databases take a very conservative view of data integrity. The NoSQL movement relaxes those standards to gain better scalability, and to make tasks that are common to Web Applications easier.
MongoDB (my current preference) makes replication and sharding (horizontal scaling) easy, inserts very fast and drops the requirement for a strict scheme. In exchange users of MongoDB must code around slower queries when an index is not present, implement transactional logic in the app (perhaps with three phase commits), and we take a hit on storage efficiency.
CouchDB has similar trade-offs but also sacrifices ad-hoc queries for the ability to work with data off-line then sync with a server.
Redis and other key value stores require the programmer to write much of the index and join logic that is built in to SQL databases. In exchange an application can leverage domain knowledge about its data to make indexes and joins more efficient then the general solution the SQL would require. Redis also require all data to fit in RAM but in exchange gives performance on par with Memcache.
In the end you really can do everything MySQL or Postgres do with nothing more then the OS file system commands (after all that is how the people that wrote these database engines did it). It all comes down to what you want the data store to do for you and what you are willing to give up in return.
Good question. First a clarification. While the field of relational stores is held together by a rather solid foundation of principles, with each vendor choosing to add value in features or pricing, the non-relational (nosql) field is far more heterogeneous.
There are document stores (MongoDB, CouchDB) which are great for content management and similar situations where you have a flat set of variable attributes that you want to build around a topic. Take site-customization. Using a document store to manage custom attributes that define the way a user wants to see his/her page is well suited to the platform. Despite their marketing hype, these stores don't tend to scale into terabytes that well. It can be done, but it's not ideal. MongoDB has a lot of features found in relational databases, such as dynamic indexes (up to 40 per collection/table). CouchDB is built to be absolutely recoverable in the event of failure.
There are key/value stores (Cassandra, HBase...) that are great for highly-distributed storage. Cassandra for low-latency, HBase for higher-latency. The trick with these is that you have to define your query needs before you start putting data in. They're not efficient for dynamic queries against any attribute. For instance, if you are building a customer event logging service, you'd want to set your key on the customer's unique attribute. From there, you could push various log structures into your store and retrieve all logs by customer key on demand. It would be far more expensive, however, to try to go through the logs looking for log events where the type was "failure" unless you decided to make that your secondary key. One other thing: The last time I looked at Cassandra, you couldn't run regexp inside the M/R query. Means that, if you wanted to look for patterns in a field, you'd have to pull all instances of that field and then run it through a regexp to find the tuples you wanted.
Graph databases are very different from the two above. Relations between items(objects, tuples, elements) are fluid. They don't scale into terabytes, but that's not what they are designed for. They are great for asking questions like "hey, how many of my users lik the color green? Of those, how many live in California?" With a relational database, you would have a static structure. With a graph database (I'm oversimplifying, of course), you have attributes and objects. You connect them as makes sense, without schema enforcement.
I wouldn't put anything critical into a non-relational store. Commerce, for instance, where you want guarantees that a transaction is complete before delivering the product. You want guaranteed integrity (or at least the best chance of guaranteed integrity). If a user loses his/her site-customization settings, no big deal. If you lose a commerce transation, big deal. There may be some who disagree.
I also wouldn't put complex structures into any of the above non-relational stores. They don't do joins well at-scale. And, that's okay because it's not the way they're supposed to work. Where you might put an identity for address_type into a customer_address table in a relational system, you would want to embed the address_type information in a customer tuple stored in a document or key/value. Data efficiency is not the domain of the document or key/value store. The point is distribution and pure speed. The sacrifice is footprint.
There are other subtypes of the family of stores labeled as "nosql" that I haven't covered here. There are a ton (122 at last count) different projects focused on non-relational solutions to data problems of various types. Riak is yet another one that I keep hearing about and can't wait to try out.
And here's the trick. The big-dollar relational vendors have been watching and chances are, they're all building or planning to build their own non-relational solutions to tie in with their products. Over the next couple years, if not sooner, we'll see the movement mature, large companies buy up the best of breed and relational vendors start offering integrated solutions, for those that haven't already.
It's an extremely exciting time to work in the field of data management. You should try a few of these out. You can download Couch or Mongo and have them up and running in minutes. HBase is a bit harder.
In any case, I hope I've informed without confusing, that I have enlightened without significant bias or error.
RDBMSes are good at joins, NoSQL engines usually aren't.
NoSQL engines is good at distributed scalability, RDBMSes usually aren't.
RDBMSes are good at data validation coinstraints, NoSQL engines usually aren't.
NoSQL engines are good at flexible and schema-less approaches, RDBMSes usually aren't.
Both approaches can solve either set of problems; the difference is in efficiency.
Probably answer to your question is that mongodb can handle any task (and sql too). But in some cases better to choose mongodb, in others sql database. About advantages and disadvantages you can read here.
Also as #Dmitry said mongodb open door for easy horizontal and vertical scaling with replication & sharding.
RDBMS enforce strong consistency while most no-sql are eventual consistent. So at a given point in time when data is read from a no-sql DB it might not represent the most up-to-date copy of that data.
A common example is a bank transaction, when a user withdraw money, node A is updated with this event, if at the same time node B is queried for this user's balance, it can return an outdated balance. This can't happen in RDBMS as the consistency attribute guarantees that data is updated before it can be read.
RDBMs are really good for quickly aggregating sums, averages, etc. from tables. e.g. SELECT SUM(x) FROM y WHERE z. It's something that is surprisingly hard to do in most NoSQL databases, if you want an answer at once. Some NoSQL stores provide map/reduce as a way of solving the same thing, but it is not real time in the same way it is in the SQL world.
I'm new to this whole NOSQL stuff and have recently been intrigued with mongoDB. I'm creating a new website from scratch and decided to go with MONGODB/NORM (for C#) as my only database. I've been reading up a lot about how to properly design your document model database and I think for the most part I have my design worked out pretty well. I'm about 6 months into my new site and I'm starting to see issues with data duplication/sync that I need to deal with over and over again. From what I read, this is expected in the document model, and for performance it makes sense. I.E. you stick embedded objects into your document so it's fast to read - no joins; but of course you can't always embed, so mongodb has this concept of a DbReference which is basically analogous to a foreign key in relational DBs.
So here's an example: I have Users and Events; both get their own document, Users attend events, Events have users attendees. I decided to embed a list of Events with limited data into the User objects. I embedded a list of Users also into the Event objects as their "attendees". The problem here is now I have to keep the Users in sync with the list of Users that is also embedded in the Event object. As I read it, this seems to be the preferred approach, and the NOSQL way to do things. Retrieval is fast, but the fall-back is when I update the main User document, I need to also go into the Event objects, possibly find all references to that user and update that as well.
So the question I have is, is this a pretty common problem people need to deal with? How much does this problem have to happen before you start saying "maybe the NOSQL strategy doesn't fit what I'm trying to do here"? When does the performance advantage of not having to do joins turn into a disadvantage because you're having a hard time keeping data in sync in embedded objects and doing multiple reads to the DB to do so?
Well that is the trade off with document stores. You can store in a normalized fashion like any standard RDMS, and you should strive for normalization as much as possible. It's only where its a performance hit that you should break normalization and flatten your data structures. The trade off is read efficiency vs update cost.
Mongo has really efficient indexes which can make normalizing easier like a traditional RDMS (most document stores do not give you this for free which is why Mongo is more of a hybrid instead of a pure document store). Using this, you can make a relation collection between users and events. It's analogous to a surrogate table in a tabular data store. Index the event and user fields and it should be pretty quick and will help you normalize your data better.
I like to plot the efficiency of flatting a structure vs keeping it normalized when it comes to the time it takes me to update a records data vs reading out what I need in a query. You can do it in terms of big O notation but you don't have to be that fancy. Just put some numbers down on paper based on a few use cases with different models for the data and get a good gut feeling about how much works is required.
Basically what I do is first try to predict the probability of how many updates a record will have vs. how often it's read. Then I try to predict what the cost of an update is vs. a read when it's both normalized or flattened (or maybe partially combination of the two I can conceive... lots of optimization options). I can then judge the savings of keeping it flat vs. the cost of building up the data from normalized sources. Once I plotted all the variables, if the savings of keeping it flat saves me a bunch, then I will keep it flat.
A few tips:
If you require fast lookups to be quick and atomic (perfectly up to date) you may want a favor a solution where you favor flattening over normalization and taking the hit on the update.
If you require update to be quick, and access immediately then favor normalization.
If you require fast lookups but don't require perfectly up to date data, consider building out your normalized data in batch jobs (using map/reduce possibly).
If your queries need to be fast, and updates are rare, and do not necessarily require your update to be accessible immediately or require transaction level locking that it went through 100% of the time (to guarantee your update was written to disk), you can consider writing your updates to a queue processing them in the background. (In this model, you will probably have to deal with conflict resolution and reconciliation later).
Profile different models. Build out a data query abstraction layer (like an ORM in a way) in your code so you can refactor your data store structure later.
There are lot of other ideas that you can employ. There a lot of great blogs on line that go into it like highscalabilty.org and make sure you understand CAP theorem.
Also consider a caching layer, like Redis or memcache. I will put one of those products in front my data layer. When I query mongo (which is storing everything normalized), I use the data to construct a flattened representation and store it in the cache. When I update the data, I will invalidate any data in the cache that references what I'm updating. (Although you have to take the time it takes to invalidate data and tracking data in the cache that is getting updated into consideration of your scaling factors). Someone once said "The two hardest things in Computer Science are naming things and cache invalidation."
Try adding an IList of type UserEvent property to your User object. You didn't specify much about how your domain model is designed. Check the NoRM group http://groups.google.com/group/norm-mongodb/topics
for examples.