Which noSQL database to choose for a network daemon? - mongodb

I am writing a custom server, which should be very performant.
It has 100.000-600.000 clients connected, and like 10 million records stored.
Database will run on a single server.
The server code is realized via twisted framework (in python).
Now I had it use MySQL, but I think a NoSQL database would be much more efficient (no complex queries, many simple writes / timestamp changes and many simple reads).
Which NoSQL database should I go for? Easy indexing would be a plus, I want the option to search the database from an administration system, create groups from logs containing a specific keyword and stuff like that.
I had a look at Cassandra and MongoDB, MongoDB seemed easier to get in / use for me.
Thanks for the help!

As far as pure learning curve goes, MongoDB has positioned itself to be a very friendly alternative to MySQL. Cassandra is a very different beast and will have a higher learning curve. That said, both have the potential to solve your problem based upon what you describe.

You have pretty simple requirements: easy indexing, arbitrary searches, grouping on keyword, etc -- pretty much every NoSQL system would work. It really comes down to the technologies with which you're comfortable. Like C#? Then go with RavenDB -- it can even automatically add indices as you execute queries. Like Erlang? Then you're a freak, but you should go with CouchDB. Like Javascript and JSON? Go with MongoDB.
Personally I really like Mongo, as it feels like a lovely hybrid of SQL and NoSQL databases. You can index the hell out of it (and get amazing performance!), which makes it almost like a RDBMS. You can also use it like a key/value store, and use it like a "giant hashtable in the sky". Still, YMMV. Play with them and see what works for you.

Cassandra is really designed for multiple server nodes, providing transparent replication. So you won't get the best value out of it with a single server host. Cassandra is also designed primarily for large-scale (and sacrifices indexing and flexible queries as a result). 10 million records isn't really very big, so you can afford to try something more flexible but less scalable.

Related

NoSQL in a single machine

As part of my university curriculum I ended up with a real project which consists in helping a company shifting from their relational data warehouse into a NoSQL data warehouse. The thing is that what they are looking for is better performance in large jobs but so far they have used a single machine and if they indeed migrate to NoSQL they still wish to keep using a single machine for cost reasons.
As far as I know the whole point of NoSQL is to run it in a large distributed system of several machines. So I don't see the point of this migration, specially since I am pretty sure (but not entirely) that if they do install NoSQL, they will probably end having even worst performance.
But still I am not comfortable telling them this since I am still new to this area (less than a month), so I wonder, is there are any situation where using NoSQL in a single machine for a datawarehouse would be justifiable performance wise? Or is it just a plain bad idea?
The answer to your question, like the answer to so many questions, is "it depends."
Ignoring the commentary on the question, I think there may be legitimacy to your client's question. Both relational and non-relational databases ultimately hold data in key-value tuples, with indexes and such to ensure quick and speedy access to the data. The difference is that SQL/relational databases contain an incredible amount of overhead to attempt the optimal way to retrieve results given an unknown set of queries, as well as ensure stable concurrency. This overhead is both computationally expensive and rarely results in the optimal solution. As a result, SQL databases often perform significantly slower for simple repetitive queries.
No-sql databases, on the other hand, are more of a "bare-bones" database, relying on programmers and intelligent design to achieve success. They are optimized to retrieve a value for a given key very quickly, often sub-millisecond. As a result, increased up front investment in the design results in superior and near-optimal performance. It will be necessary to determine the cost-benefit of doing this up-front design, but it is all but guaranteed that the no-sql approach will perform better regardless of the number of machines involved (in fact, SQL databases are very difficult or impossible to cluster together and is one of the main reasons why NoSql was developed).
Eventually we will see relational-like solutions implemented on a no-sql platform. In fact, Mongo, Elasticsearch, and Couchbase (probably others) already have SQL-like query functionality. But right now, you are faced with this dilemma.
For a single machine if the load is write heavy e.g. your logging a lot of events you could do for cassandra. Also a good alternative is hbase but its heavy and not suggested for single node. If they expose api in json you could look into document based dbs such as couchbase, mongo db. If you have an idea about the load then selecting a nosql data store is much easier
If you're in a position where you need to pick one, I think you should look first at MongoDB. If you've never tried it, I really recommend you visit their live demo with tutorial and give it a try. If you like, download and follow the installation guide on their site. It's free, runs well on a single machine, and is incredibly easy to use.
In addition to MongoDB, I've used Oracle, SQL Server, MySQL, SQLite, and HBase. I understand Cassandra should be in the list but I've not tried it. With MongoDB, I was fully deployed and executing reads and writes from an application in like two hours. I attribute most of that to their website's clear and concise instructional content. The biggest learning curve was figuring out how the queries work for things like updating a record or deleting a record without deleting the entire set of similar records.
Regarding NoSQL vs RDBMS, some points to consider:
Adding a new column to RDBMS table can lock the database in or degrade performance in another
MongoDB is schema-less so adding a new field, does not effect old documents and will be instant (think how flexible that really is - throw any dimension of data into this system without maintenance overhead)
You're less likely to require a DBA to solve your schema problems when an application changes
I think problems related to table size are irrelevant, so you won't run into a scaling problem - just a disk space problem on single machine

Neo4j instead of relational database

I am implementing a sinatra/rails based web portal that might eventually have few many:many relationships between tables/models. This is a one man team and part time but real world app.
I discussed my entity with someone and was advised to try neo4j. Coming from real 'non-sexy' enterprise world, my inclination is to use relational db until it stops scaling or becomes a nightmare because of sharding etc and then think about anything else.
HOWEVER,
I am using postgres for the first time in this project along with datamapper and its taking me time to get started very fast
I am just trying out few things and building more use cases so I consitently have to update my schema (prototyping idea and feedback from beta) . I wont have to do this in neo4j (except changing my queries)
Seems like its very easy to setup search using neo4j . But Postgres can do full text search as well.
Postgres recently announced support for json and javascript. Wondering if I should just stick with PG and invest more time learning PG (which has a good community) instead neo4j.
Looking for usecases where neo4j is better, especially at protyping/initial phase of a project. I understand if the website grows I might end up having multiple persistent technologies like s3, relational (PG), mongo etc.
Also it would be good to know how it plays out with Rails/Ruby ecosystem.
Update1:
I got a lot of good answers and seems like the right thing to do is stick with Postgres for now (especially since I deploy to heroku)
However the idea of being schema-less is tempting. Basically I am thinking of a approach where you don't define a datamodel until you have say 100-150 users and you have yourself figured out a good schema (business use cases) for your product , while you are just demoing the concept and getting feedback with limited signups. Then one can decide a schema and start with relational.
Would be nice to know if there are easy to use schema/less persistence option (based on ease to use/setup for new user) that might give up say scaling etc.
Graph databases should be considered if you have a really chaotic data model. They were needed to express highly complex relationships between entities. To do that, they store relationships at the data level whereas RDBMS use a declarative approach. Storing relationships only makes sense if these relationships are very different, otherwise you'll just end up duplicating data over and over, taking a lot of space for nothing.
To require such variety in relationships you'd have to handle huge amount of data. This is where graph databases shines because instand of doing tons of joins, they just pick a record and follow his relationships. To support my statement : you'll notice that every use cases on Neo4j's website are dealing with very complex data.
In brief, if you don't feel concerned with what I said above, I think you should use another technology. If this is just about scaling, schemalessness or starting fast a project, then look at other NoSQL solutions (more specifically, either column or document oriented databases). Otherwise you should stick with PostgreSQL. You could also, like you said, consider polyglot persistence,
About your update, you might consider hStore. I think it fits your requirements. It's a PostgreSQL module which also works on Heroku.
I don't think I agree that you should only use a graph database when your data model is very complex. I'm sure they could handle a simple data model/relationships as well.
If you have no prior experience with Neo4j or Postgres, then most likely both with take quite a bit of time to learn well.
Some things to keep in mind when picking:
It's not just about development against a database technology. You should consider deployment as well. How easy is it to deploy and scale Postgres/Neo4j?
Consider the community and tools around each technology. Is there a data mapper for Neo4j like there is for Postgres?
Consider that the data models are considerably different between the two. If you can already think relationally, then I'd probably stick with Postgres. If you go with Neo4j you're going to be making a lot of mistakes for several months with your data models.
Over time I've learned to keep it simple when I can. Postgres might be the boring choice compared to Neo4j, but boring doesn't keep you up at night. =)
Also I never see anyone mention it, but you should look at Riak (http://basho.com/riak/) too. It's a document database that also provides relationships (links) between objects. Not as mature as a graph database, but it can connect a few entities quickly.
The most appropriate choice depends on what problem you are trying to solve.
If you just have a few many to many tables, a relational database can be fine. In general, there is better OR-mapper support for relational databases, as they are much older and have a standardized interface and row-column structure. They also have been improved on for a long time, so they are stable and optimized for what they are doing.
A graph database is better if e.g. your problem is more about the connections between entities, especially if you need higher distance connections, like "detect cycles (of unspecified length)", some "what do friends-of-a-friend like". Things like that get unwieldy when restricted to SQL joins. A problem specific language like cypher in case of Neo4j makes that much more concise. On the downside, there are mappers between graph dbs and objects, but not for every framework and language under the sun.
I recently implemented a system prototype using neo4j and it was very useful to be able to talk about the structure and connections of our data and be able to model that one to one in the data storage. Also, adding other connections between data points was easy, neo4j being a schemaless storage. We ended up switching to mongodb due to troubles with write performance, but I don't think we could have finished the prototype with that in the same time.
Other NoSQL datastores like document based, column, key-value also cover specific usecases. Polyglot persistence is definitively something to look at, so keep your choice of backend reasonably separated from your business logic, to allow you to change your technology later if you learned something new.

rewrite website architecture using a nosql database

i want to rewrite an existing website, for a client, that has 100000+ visitors a day and i am considering using Cassandra db, Couch Db or Mongo Db instead of using Mysql and couple it with Solr.
what i want to ask is if it is a good idea to switch to nosql for a website that sits on a single server(would not use for now multiple nodes)?
what problems that may arise on the long term. I am a little afraid of using nosql because these db`s are relatively young. But considering the speed gain for queries makes it really attractive.
i am using php as the backend programming language.
Thanks
Although the platforms you mention are very young compared to SQL, they have now been around long enough that they are somewhat mature and you don't risk much by using them instead of SQL if they fit what you are trying to do.
However, in this case it may be better to stick with SQL - you already have all the code working well with SQL, and you can get most of the performance improvements you need by adding a search engine or cache component rather than rewriting the entire system.
If the rewrite is something you were planning to do anyway, you can use any datastore you want - just pick the one where the standard datamodel is closest to your data and the queries you need to support.
I suspect the most difficult thing will be to transform your data model for nosql DB. There will be no JOIN, and 'workarounds' for joins are not that straightforward in nosql databases.
Also, performance is not guaranteed out of the box, you will have to work hard to achieve it. Nosql databases have relaxed constraints on your data, which in turn provides developers with more options on how to work with that data; which in turn enables higher-performance solutions.
Many nosql DBs are still quite young. They may be used in many successful projects, but yet, in general they are not as reliable as popular relational DBs. Of course, it is unlikely for them to fail in a big way, but the likelihood of small bugs here and there is higher.
Perhaps the most well known failure associated with nosql was foursquare's mongodb outage. But it doesn't look that big of a deal to me.

MongoDB for personal non-distributed work

This might be answered here (or elsewhere) before but I keep getting mixed/no views on the internet.
I have never used anything else except SQL like databases and then I came across NoSQL DBs (mongoDB, specifically). I tried my hands on it. I was doing it just for fun, but everywhere the talk is that it is really great when you are using it across distributed servers. So I wonder, if it is any helpful(in a non-trivial way) for doing small projects and things mainly only on a personal computer? Are there some real advantages when there is just one server.
Although it would be cool to use MapReduce (and talk about it to peers :d) won't it be an overkill when used for small projects run on single servers? Or are there other advantages of this? I need some clear thought. Sorry if I sounded naive here.
Optional: Some examples where/how you have used would be great.
Thanks.
IMHO, MongoDB is perfectly valid for use for single server/small projects and it's not a pre-requisite that you should only use it for "big data" or multi server projects.
If MongoDB solves a particular requirement, it doesn't matter on the scale of the project so don't let that aspect sway you. Using MapReduce may be a bit overkill/not the best approach if you truly have low volume data and just want to do some basic aggregations - these could be done using the group operator (which currently has some limitations with regard to how much data it can return).
So I guess what I'm saying in general is, use the right tool for the job. There's nothing wrong with using MongoDB on small projects/single PC. If a RDBMS like SQL Server provides a better fit for your project then use that. If a NoSQL technology like MongoDB fits, then use that.
+1 on AdaTheDev - but there are 3 more things to note here:
Durability: From version 1.8 onwards, MongoDB has single server durability when started with --journal, so now it's more applicable to single-server scenarios
Choosing a NoSQL DB over say an RDBMS shouldn't be decided upon the single or multi server setting, but based on the modelling of the database. See for example 1 and 2 - it's easy to store comment-like structures in MongoDB.
MapReduce: again, it depends on the data modelling and the operation/calculation that needs to occur. Depending on the way you model your data you may or may not need to use MapReduce.

Are document-oriented databases meant to replace relational databases?

Recently I've been working a little with MongoDB and I have to say I really like it. However it is a completely different type of database then I am used. I've noticed that it is most definitely better for certain types of data, however for heavily normalized databases it might not be the best choice.
It appears to me however that it can completely take the place of just about any relational database you may have and in most cases perform better, which is mind boggling. This leads me to ask a few questions:
Are document-oriented databases being developed to be the next generation of databases and basically replace relational databases completely?
Is it possible that projects would be better off using both a document-oriented database and a relational database side by side for various data which is better suited for one or the other?
If document-oriented databases are not meant to replace relational databases, then does anyone have an example of a database structure which would absolutely be better off in a relational database (or vice-versa)?
Are document-oriented databases have been developed to be the next generation of databases and basically replace relational databases completely?
No. Document-oriented databases (like MongoDB) are very good at the type of tasks that we typically see in modern web sites (fast look-ups of individual items or small sets of items).
But they make some big trade-offs with relational systems. Without things like ACID compliance they're not going to be able to replace certain RDBMS. And if you look at systems like MongoDB, the lack of ACID compliance is a big reason it's so fast.
Is it possible that projects would be better off using both a document-oriented database and a relational database side by side for various data which is better suited for one or the other?
Yes. In fact, I'm running a very large production web-site that uses both. The system was started in MySQL, but we've migrated part of it over to MongoDB, b/c we need a Key-Value store and MySQL just isn't very good at finding one item in a 150M records.
If document-oriented databases are not meant to replace relational databases, then does anyone have an example of a database structure which would absolutely be better off in a relational database (or vice-versa)?
Document-oriented databases are great storing data that is easily contained in "key-value" and simple, linear "parent-child" relationships. Simple examples here are things like Blogs and Wikis.
However, relational databases still have a strong leg up on things like reporting, which tends to be "set-based".
Honestly, I can see a world where most data is "handled" by Document-oriented database, but where the reporting is done in a relational database that is updated by Map-reduce jobs.
This is really a question of fitness for purpose.
If you want to be able to join some tables together and return a filtered set of results, you can only do that with a relational database. If you want mind-bending performance and have incredible volumes of data, that's when column-family or document-oriented databases come into their own.
This is a classic trade-off. Relational databases offer you a whole suite of features, which comes with a performance cost. If you couldn't join, index, scan or perform a whole other list of features, you remove the need to have any view over ALL data, which gives you the performance and distribution you need to crunch serious data.
Also, I recommend you follow the blogs of Ayende Rahien on this topic.
http://ayende.com/blog/
#Sohnee is spot on. I might add that relational databases
are excellent for retrieving information in unexpected combinations -- even if that occasionally leads to the Bad Idea of extensive reports being run on time-sensitive production systems rather than on a separate data warehouse.
are a mature technology where you can easily find staff and well tested solutions to any number of problems (including the limitations of the relational model, as well as the imperfect implementation that is SQL).
Ask yourself what you want to do, and what qualities are important to you. You can do everything programming related in shell scripts. Do you want to?
I keep asking the same question, which is what landed me here. I use both MySQL and MongoDB (not in tandem currently, though its an idea). I have to honestly say I'm very happy to never touch MySQL again. Sure there's the "ACID" compliance, but have you ever run into the need to repair your tables with MySQL? Have you ever had a corrupted database? It happens. Have you ever had any other issues with MySQL? Any lock contentions or dead locks? Any problems with clustering? How easy was it to setup and configure?
MongoDB...You turn it on and it's done....Then it's autosharding. It's incredibly simple and it's also incredibly fast. So think about that. Your time.
No, they don't have JOINs but it's a completely incorrect statement to say that it discounts more than 99% of data management needs. I often get opposition when trying to explain MongoDB, people even snickering. Let's just face it. People don't want to learn new things and they think that what they know is all they need. Sure, you can get away using MySQL the rest of your life and build your web sites. It works, we know it works. We also know it fails. If it didn't, you'd never ask the question and we probably wouldn't see so many document oriented databases. We know that yes it does scale but it's a pain in the rear to scale it.
Also let's eliminate traffic and scaling from the picture. Take out setup. Now let's focus on use. What is your experience when using MySQL? How good are you with MySQL architecture and making efficient queries? How much time do you spend looking over queries with EXPLAIN? How much time do you spend making schema diagrams? ... I say take that time back. It's better spent elsewhere.
That's my two cents. I really do love MongoDB and hope to never use MySQL again and for the type of web sites I build, it's very possible that I won't need to. Though I'm still trying to find out WHEN I would want to use MySQL over MongoDB, not when I CAN (let's face it, it stores data, congratulations, I could write a ton of XML files too but it's not a good idea), but when it would BENEFIT to use one or the other. In the meantime, I'm going to go do my job with MongoDB and have less headaches.
As long as you don't need multi-object transactions, MongoDB can be a favorable replacement for an RDMBS, especially in a web application context. Speed, schemalessness, and document modeling are all helpful this domain.
In my opinion document-oriented databases are only good for
Databases which data is better represented using a hierarchical (tree) model. This is not common for website databases.
Databases with huge amount of data like the Facebook and Amazon databases. In this case it is required to sacrifice the benefits of the relational model.
MongoDB main characteristics are
document model (JSON)
high level(close to real world object), less collections
sharding (optional)
programmer friendly
drivers, same data structures arrays/hash maps
Document databases
A document is more general than a table, its far easier to represent a table with a JSON than storing JSON to a table.
So yes document databases could replace table databases.
Sharding
Joins in sharded collections are expensive for any database.
MongoDB added $lookup years now, and in MongoDB 5.1+ it can be used even when both collections are sharded.
But looks like joins in distributed databases are slow, and should be avoided, so relational way of modelling should be avoided.
No sharding
I think when sharding isn't used, MongoDB will co-exists and overlap with relational databases(especially after ACID support and $lookup support), to replace them its hard, and doesn't look like goal of MongoDB right now.
So overal looks like MongoDB could do what relational databases do,
but for now its not a replacement.
The opposite isn't true, relational databases have much bigger problems if they try to behave like MongoDB
AFAIK, document databases don't have JOIN. That's pretty much a show-stopper for > 99% of data management needs.
As Matthew Flaschen points out in the comments, even on the desktop, databases such as SQLite are introducing SQL semantics to areas that have traditionally used propriety file formats or XML.