Up to this point, I have been using MongoDB (Node.js + Mongoose) to save posts which belong to a user, so that I can later retrieve them to display in a stream (just like Facebook, Twitter, etc.)
It recently became necessary to allow the user to deeply search his stream; MongoDB's search was insufficient, so I implemented ElasticSearch on my servers (Amazon EC2 m1.large instances running CentOS, FWIW).
My question: I'm now in a position that I'm duplicating the data between MongoDB (where the user's stream is cached) and ElasticSearch (where it is searched).
Is there any disadvantage to moving my cache ENTIRELY into ElasticSearch, getting rid of the MongoDB all together? It seems a waste to double the storage, and there's no other place that I'm accessing this data (it is only used when presenting/searching the stream of posts).
Specifically, I want to make sure I'm not overlooking anything re: performance. I like the idea of reducing MongoDB as a bottleneck, yet I worry about the memory overhead of ElasticSearch. MongoDB runs on its own server in my cloud setup, whereas ElasticSearch is running on the same instances as node.js. This means I would have MORE ElasticSearch servers (the node.js servers are in an auto-scaling array), but they each are not DEDICATED servers (unlike MongoDB).
The only big obstacle to using ES as a "primary datasource" is that there isn't a good backup mechanism right now. The ES team is working on it and expect it to be out by the end of the year, but in the mean time, you'll have to implement your own backup scripts.
As far as performance, it's really hard to say because almost every situation is unique. ES benefits from memory - so more is always better. In particular, sorts/filters/facets/geo all like to eat memory. If you aren't doing much in the way of faceting, for example, you may be fine with less memory.
ES doesn't need to run on a dedicated node...but it will happily use as many resources as you give it.
Another option is to use just the elastic search indexes. You can choose to not save data in a readable format, so you search in ES and then retrieve documents from MongoDB to your user as needed.
The question bellow comments exactly on that.
Storing only selected fields and not storing _all in pyes/elasticsearch
Related
We have a new project there for index a large amount of data and for provide real time. I have also complexe search with facets, full text, geospatial...
The first prototype is to index in MongoDB and next, into Elasticsearch, because I had read that Elasticsearch does not apply a checksum on stored files and the index can't be fully trusted.
But since last versions (in the version 1.5), there is now a checksum and I'm guessing if we can use Elasticsearch as primary data store ? And what is the benefit to use MongoDB in addition to Elasticsearch ?
I can't find up to date answer about thoses features in Elasticsearch
Thanks a lot
Talking about arguments to use Mongo instead of/together with ES:
User/role management.
Built-in in MongoDB. May not fit all your needs, may be clumsy somewhere, but it exists and it was implemented pretty long time ago.
The only thing for security in ES is shield. But it ships only for Gold/Platinum subscription for production use.
Schema
ES is schemaless, but its built on top of Lucene and written in Java. The core idea of this tool - index and search documents, and working this way requires index consistency. At back end, all documents should be fitted in flat lucene index, which requires some understanding about how ES should deal with your nested documents and values, and how you should organize your indexes to maintain balance between speed and data completeness/consistency. Working with ES requires you to keep some things about schema in mind constantly. I.e: as you can index almost anything to ES without putting corresponding mapping in advance, ES can "guess" mapping on the fly but sometimes do it wrong and sometimes implicit mapping is evil, because once it put, it can't be changed w/o reindexing whole index. So, its better to not treat ES as schemaless store, because you can step on a rake some time (and this will be pain :) ), but rather treat it as schema-intensive, at least when you work with documents, that can be sliced to concrete fields.
Mongo, on the other hand, can "chew and leave no crumbs" out of almost anything you put in it. And most your queries will work fine, `til you remember how Mongo will deal with your data from JavaScript perspective. And as JS is weakly typed, you can work with really schemaless workflow (for sure, if you need such)
Handling non-table-like data.
ES is limited to handle data without putting it to search index. And this solution is good enough, when you need to store and retrieve some extra data (comparing to data you want to search against).
MongoDB supports gridFS. This gives you ability to handle large chunks of data behind the same interface. I.e., you can store binary data in Mongo and retrieve it within the same interface, from your code perspective.
Well, choose the right tool for the right job. If you require searching capabilities such as full text search, faceting etc, then nothing can beat a full fledged search engine. ElasticSearch(ES) or Solr is just a matter of choice.
You can actually feed(index) documents into ES for searching and then fetch the complete details for a particular entry from MongoDB or any other database.
I can make your task easier, do take a look at my open source work that's using MongoDB, ES, Redis and RabbitMQ, all integrated at one place, here on github
Please note that the application is built in .Net C#.
After having used Elasticsearch on production, I can add up to this thread few notes :
We securized our Elasticsearch clustering via a reverse proxy which check client certificate authenticity at request time before letting the query in : it proves that there is multiple way to add authentication anyway. (If you need more accuracy in security, like by using roles, there is few plugins that can be added to manage permissions)
Elasticsearch mapping and settings (tuning) are really important concepts to fully understand before going on production with it, and that's no that easy to get how everything works quickly.
Clustering and horizontal scaling is very flexible and easy to set up
The suite tools (Kibana, beats, etc ..) are a very convinient way to gather logs, expose key data, etc ...
Search features are extremely advanced, you can really do amazing things when you master a bit how full text search works (fuzzyness, boosting, scoring, stemming, tokenizer, analyzers, and so on ...).
API's are a bit scattered and there is not unique ways to achieve something. And some API are really WTF to use, like the bulk insert API: you need to pass binary data, with JSON format (ofc don't forget end of line characters) and repeating some fields multiple times. This is very verbose and I guess it's legacy code like we all have in our projects ;).
Last thing : if you develop a Java project, do not use Hibernate Search to duplicate data from a datasource to your ES cluster, we had so much issues with Hibernate Search, if we had to do that again, we'd do that manually.
Now about the real question :
To my mind, using only Elasticsearch is sufficient and may reduce complexity of having a multiple NoSQL storage systems.
I think it's worthy when you are doing a duo Relational and Transactional database + NoSQL search engine, but having two system which roughly serves the same purposes is a bit overkilled
I have recently developed a feature in my company,
we wanted to perform some searches and rank the result according to its relevance on multiple factors and conditions.
So in my application, we were already using MongoDB as Db,
So on ElasticSearch index, I exported some of the fields from MongoDB that I want to perform search and filters on.
So according to required conditions I prepared my mongo query and elasticsearch query also and performed the search. Then I filtered and sorted the result according to my need.
The whole flow will was designed in such a way that,
even if there is an error from ES, mongo will fetch the records.
If I get the result from ES then, mongo result will depend on ES result.
This is how I used mongo and ES in combination.
Also, don't forget to properly handle all updates, deletes and new record insertions.
And Just to Know, results for me were Really Good.
I am working on the high-level design of a web application with the following characteristics:
Millions of records
Heavily indexed/searchable by various criteria
Variable document schema
Regular updates in blocks of 10K - 200K records at a time
Data needs to remain highly available during updates
Must scale horizontally effectively
Today, this application exists in MySQL and we suffer from a few huge problems, particularly that it is challenging to adapt to flexible schema, and that large bulk updates lock the data for 10 - 15 seconds at a time, which is unacceptable. Some of these things can be tackled by better database design within the context of MySQL, however, I am looking for a better "next generation" solution.
I have never used MongoDB, but its feature set seemed to most closely match what I am looking for, so that was my first area of interest. It has some things I am excited about, such as data sharding, the ability to find-update-return in a single statement, and of course the schema flexibility of NoSQL.
There are two things I am not sure about, though, with MongoDB:
I can't seem to find solid
information about the concurrency of
updates with large data sets (see my
use case above) so I have a hard
time understanding how it might
perform.
I do need open text search
That second requirement brought me to Lucene (or possibly to Solr if I kept it external) as a search store. I did read a few cases where Lucene was being used in place of a NoSQL database like MongoDB entirely, which made me wonder if I am over-complicating things by trying to use both in a single app -- perhaps I should just store everything directly in Lucene and run it like that?
Given the requirements above, does it seem like a combination of MongoDB and Lucene would make this work effectively? If not, might it be better to attempt to tackle it entirely in Lucene?
Currently with MongoDB, updates are locking at the server-level. There are a few JIRAs open that address this, planned for v1.9-2.0. I believe the current plan is to yield writes to allow reads to perform better.
With that said, there are plenty of great ways to scale MongoDB for super high concurrency - many of which are simiar for MySQL. One such example is to use RAID 10. Another is to use master-slave where you write to master and read from slave.
You also need to consider if your "written" data needs to be 1) durable and 2) accessible via slaves immediately. The mongodb drivers allow you to specify if you want the data to be written to disk immediately (or hang in memory for the next fsync) and allow you to specify how many slaves the data should be written to. Both of these will slow down MongoDB writing, which as noted above can affect read performance.
MongoDB also does not have nearly the capability for full-text search that Solr\Lucene have and you will likely want to use both together. I am currently using both Solr and MongoDB together and am happy with it.
I would like all of my users to be able to read and write to the datastore very quickly. It seems like MongoDb has blazing reads, but the writes seem like they could be very very slow if the one master db needs to be located very far away from the client.
Couchdb seems that it has slow reads, but how about the writes in the case when the client is very far away from the master.
With couchdb, we can have multiple masters, meaning we can always have a write node close to the client. Could couchdb actually be faster for writes than mongodb in the case when our user base is spread very far out geographically?
I would love to use mongoDb due to its blazing fast speed, but some of my users very far away from the only master will have a horrible experience.
For worldwide types of systems, wouldn't couchDb be better. Isn't mongodb completely ruled out in the case where you have users all around the world?
MongoDb, if you're listening, why don't you do some simple multi-master setups, where conflict resolution can be part of the update semantic?
This seems to be the only thing standing in between mongoDb completely dominating the nosql marketshare. Everything else is very impressive.
Disclosure: I am a MongoDB fan and user, i have zero experience with CouchDB.
I have a heavy duty app that is very read write intensive. I'd say reads outnumber writes by a factor of around 30:1. The way mongo is designed reads are always going to be much faster than writes the trick (in my experience) is to make your writes so efficient that you can dedicate a higher percentage of your system resources to the writes.
When building a product on top of mongo the key thing to remember is the _id field. This field is automatically generated and added to all of your JSON objects it will look something like 47cc67093475061e3d95369d when you design your queries (Find's) try and query on this field wherever possible as it contains the machine location (and i think also disk location??? - i should check this) where the object lives so when you use a find or update using this field will really speed up your machine. Consider this in the design of your system.
Example:
2 of the clusters in my database are "users" and "posts". A user can create multiple posts. These two collections have to reference each other alot in the implementation of my app.
In each post object i store the _id of the parent user.
In each user object i store an array of all the posts the user has authored.
Now on each user page I can generate a list of all the authored posts without a resource stressful query but with a direct look up of the _id. The bigger the mongo cluster the bigger the difference this is going to make.
If you're at all familiar with oracle's physical location rowids you may understand this concept only in mongo it is much more awesome and powerful.
I was scared last year when we decided to finally ditch MySQL for mongo but I can tell you the following about my experience:
- Data porting is always horrible but it went as well as I could have imagined.
- Mongo is probably the best documented NoSQL DB out there and the Open Source community is fantastic.
- When they say fast and scalable there not kidding, it flies.
- Schema design is very easy and much more natural and orderly than key/value type db's in my opinion.
- The whole system seems designed for minimal user complexity, adding nodes etc is a breeze.
Ok, seriously I swear mongo didn't pay me to write this (I wish) but apologies for the love fest.
Whatever your choice, best of luck.
Here is a detailed article that 10gen has created, and gives examples of when you should choose MongoDB or CouchDB, with reasons as well.
http://www.mongodb.org/display/DOCS/Comparing+Mongo+DB+and+Couch+DB
Edit
The above link was removed, but can be viewed here: http://web.archive.org/web/20120614072025/http://www.mongodb.org/display/DOCS/Comparing+Mongo+DB+and+Couch+DB
Your question as of now, is full with speculation and guessing.
...why can't we opt out of consistency for certain writes, so long as we're sure that the person that wrote the data will be able to read it consistently, whereas others will observe eventual consistency
What if those writes effect other writes? What if those writes would prevent other people from doing stuff. It's hard to tell the possible side effect if since you didn't tell us any specifics.
My main suggestion to you is that you do some testing. Unless you've tested it, speculation about bottle necks is a complete waste of time. You don't need to test it via remote machines, set up some local DBs and add some artificial lag, then run your tests.
This way you can test the different options you've got, see where MongoDB is better, or where CouchDB excels at. Then you can either take one of them and go with the contras, or you can try and tweak your Database Model it self and do more tests.
Nobody here will be able to give you a general solution to your specific problem (well unless you give us all your code and you pay us for working on it :P ) databases aren't easy especially if you need to scale them under certain requirements.
For worldwide types of systems, wouldn't couchDb be better. Isn't mongodb completely ruled out in the case where you have users all around the world?
MongoDB supports sharding. So you don't need a single master. In fact, it looks like you have a ready shard key (region).
MongoDB also supports replica sets along with sharding. So if you need to run in multiple data centers (DCs) you put a master and one of the replicas in the same DC. In fact, they also suggest adding a 3rd node to a separate DC as a hot backup failover.
You will have to drill into the more detailed configuration of MongoDB, but you can definitely control where data is stored and you can prioritize that other replicas in a DC are "next in line" for promotion to Master.
At this point however, you're well into the details of MongoDB and you'll need to dig around and "play" quite a bit. However, you'll need lots of "play time" for any solution that's really going to handle masters across data centers.
What scenario makes more sense - host several EC2 instances with MongoDB installed, or much rather use the Amazon SimpleDB webservice?
When having several EC2 instances with MongoDB I have the problem of setting the instance up by myself.
When using SimpleDB I have the problem of locking me into Amazons data structure right?
What differences are there development-wise? Shouldn't I be able to just switch the DAO of my service layers, to either write to MongoDB or AWS SimpleDB?
SimpleDB has some scalability limitations. You can only scale by sharding and it has higher latency than mongodb or cassandra, it has a throughput limit and it is priced higher than other options. Scalability is manual (you have to shard).
If you need wider query options and you have a high read rate and you don't have so much data mongodb is better. But for durability, you need to use at least 2 mongodb server instances as master/slave. Otherwise you can lose the last minute of your data. Scalability is manual. It's much faster than simpledb. Autosharding is implemented in 1.6 version.
Cassandra has weak query options but is as durable as postgresql. It is as fast as mongo and faster on higher data size. Write operations are faster than read operations on cassandra. It can scale automatically by firing ec2 instances, but you have to modify config files a bit (if I remember correctly). If you have terabytes of data cassandra is your best bet. No need to shard your data, it was designed distributed from the 1st day. You can have any number of copies for all your data and if some servers are dead it will automatically return the results from live ones and distribute the dead server's data to others. It's highly fault tolerant. You can include any number of instances, it's much easier to scale than other options. It has strong .net and java client options. They have connection pooling, load balancing, marking of dead servers,...
Another option is hadoop for big data but it's not as realtime as others, you can use hadoop for datawarehousing. Neither cassandra or mongo have transactions, so if you need transactions postgresql is a better fit. Another option is Amazon RDS, but it's performance is bad and price is high. If you want to use databases or simpledb you may also need data caching (eg: memcached).
For web apps, if your data is small I recommend mongo, if it is large cassandra is better. You don't need a caching layer with mongo or cassandra, they are already fast. I don't recommend simpledb, it also locks you to Amazon as you said.
If you are using c#, java or scala you can write an interface and implement it for mongo, mysql, cassandra or anything else for data access layer. It's simpler in dynamic languages (eg rub,python,php). You can write a provider for two of them if you want and can change the storage maybe in runtime by a only a configuration change, they're all possible. Development with mongo,cassandra and simpledb is easier than a database, and they are free of schema, it also depends on the client library/connector you're using. The simplest one is mongo. There's only one index per table in cassandra, so you've to manage other indexes yourself, but with the 0.7 release of cassandra secondary indexes will bu possible as I know. You can also start with any of them and replace it in the future if you have to.
I think you have both a question of time and speed.
MongoDB / Cassandra are going to be much faster, but you will have to invest $$$ to get them going. This means you'll need to run / setup server instances for all them and figure out how they work.
On the other hand, you don't have to per a "per transaction" cost directly, you just pay for the hardware which is probably more efficient for larger services.
In the Cassandra / MongoDB fight here's what you'll find (based on testing I'm personally involved with over the last few days).
Cassandra:
Scaling / Redundancy is very core
Configuration can be very intense
To do reporting you need map-reduce, for that you need to run a hadoop layer. This was a pain to get configured and a bigger pain to get performant.
MongoDB:
Configuration is relatively easy (even for the new sharding, this week)
Redundancy is still "getting there"
Map-reduce is built-in and it's easy to get data out.
Honestly, given the configuration time required for our 10s of GBs of data, we went with MongoDB on our end. I can imagine using SimpleDB for "must get these running" cases. But configuring a node to run MongoDB is so ridiculously simple that it may be worth skipping the "SimpleDB" route.
In terms of DAO, there are tons of libraries already for Mongo. The Thrift framework for Cassandra is well supported. You can probably write some simple logic to abstract away connections. But it will be harder to abstract away things more complex than simple CRUD.
Now 5 years later it is not hard to set up Mongo on any OS. Documentation is easy to follow, so I do not see setting up Mongo as a problem. Other answers addressed the questions of scalability, so I will try to address the question from the point of view of a developer (what limitations each system has):
I will use S for SimpleDB and M for Mongo.
M is written in C++, S is written in Erlang (not the fastest language)
M is open source, installed everywhere, S is proprietary, can run only on amazon AWS. You should also pay for a whole bunch of staff for S
S has whole bunch of strange limitations. M limitations are way more reasonable. The most strange limitations are:
maximum size of domain (table) is 10 GB
attribute value length (size of field) is 1024 bytes
maximum items in Select response - 2500
maximum response size for Select (the maximum amount of data S can return you) - 1Mb
S supports only a few languages (java, php, python, ruby, .net), M supports way more
both support REST
S has a query syntax very similar to SQL (but way less powerful). With M you need to learn a new syntax which looks like json (also it is straight-forward to learn the basics)
with M you have to learn how you architect your database. Because many people think that schemaless means that you can throw any junk in the database and extract this with ease, they might be surprised that Junk in, Junk out maxim works. I assume that the same is in S, but can not claim it with certainty.
both do not allow case insensitive search. In M you can use regex to somehow (ugly/no index) overcome this limitation without introducing the additional lowercase field/application logic.
in S sorting can be done only on one field
because of 5s timelimit count in S can behave strange. If 5 seconds passed and the query has not finished, you end up with a partial number and a token which allows you to continue query. Application logic is responsible for collecting all this data an summing up.
everything is a UTF-8 string, which makes it a pain in the ass to work with non string values (like numbers, dates) in S. M type support is way richer.
both do not have transactions and joins
M supports compression which is really helpful for nosql stores, where the same field name is stored all-over again.
S support just a single index, M has single, compound, multi-key, geospatial etc.
both support replication and sharding
One of the most important things you should consider is that SimpleDB has a very rudimentary query language. Even basic things like group by, sum average, distinct as well as data manipulation is not supported, so the functionality is not really way richer than Redis/Memcached. On the other hand Mongo support a rich query language.
I'm building a system that tracks and verifies ad impressions and clicks. This means that there are a lot of insert commands (about 90/second average, peaking at 250) and some read operations, but the focus is on performance and making it blazing-fast.
The system is currently on MongoDB, but I've been introduced to Cassandra and Redis since then. Would it be a good idea to go to one of these two solutions, rather than stay on MongoDB? Why or why not?
Thank you
For a harvesting solution like this, I would recommend a multi-stage approach. Redis is good at real time communication. Redis is designed as an in-memory key/value store and inherits some very nice benefits of being a memory database: O(1) list operations. For as long as there is RAM to use on a server, Redis will not slow down pushing to the end of your lists which is good when you need to insert items at such an extreme rate. Unfortunately, Redis can't operate with data sets larger than the amount of RAM you have (it only writes to disk, reading is for restarting the server or in case of a system crash) and scaling has to be done by you and your application. (A common way is to spread keys across numerous servers, which is implemented by some Redis drivers especially those for Ruby on Rails.) Redis also has support for simple publish/subscribe messenging, which can be useful at times as well.
In this scenario, Redis is "stage one." For each specific type of event you create a list in Redis with a unique name; for example we have "page viewed" and "link clicked." For simplicity we want to make sure the data in each list is the same structure; link clicked may have a user token, link name and URL, while the page viewed may only have the user token and URL. Your first concern is just getting the fact it happened and whatever absolutely neccesary data you need is pushed.
Next we have some simple processing workers that take this frantically inserted information off of Redis' hands, by asking it to take an item off the end of the list and hand it over. The worker can make any adjustments/deduplication/ID lookups needed to properly file the data and hand it off to a more permanent storage site. Fire up as many of these workers as you need to keep Redis' memory load bearable. You could write the workers in anything you wish (Node.js, C#, Java, ...) as long as it has a Redis driver (most web languages do now) and one for your desired storage (SQL, Mongo, etc.)
MongoDB is good at document storage. Unlike Redis it is able to deal with databases larger than RAM and it supports sharding/replication on it's own. An advantage of MongoDB over SQL-based options is that you don't have to have a predetermined schema, you're free to change the way data is stored however you want at any time.
I would, however, suggest Redis or Mongo for the "step one" phase of holding data for processing and use a traditional SQL setup (Postgres or MSSQL, perhaps) to store post-processed data. Tracking client behavior sounds like relational data to me, since you may want to go "Show me everyone who views this page" or "How many pages did this person view on this given day" or "What day had the most viewers in total?". There may be even more complex joins or queries for analytic purposes you come up with, and mature SQL solutions can do a lot of this filtering for you; NoSQL (Mongo or Redis specifically) can't do joins or complex queries across varied sets of data.
I currently work for a very large ad network and we write to flat files :)
I'm personally a Mongo fan, but frankly, Redis and Cassandra are unlikely to perform either better or worse. I mean, all you're doing is throwing stuff into memory and then flushing to disk in the background (both Mongo and Redis do this).
If you're looking for blazing fast speed, the other option is to keep several impressions in local memory and then flush them disk every minute or so. Of course, this is basically what Mongo and Redis do for you. Not a real compelling reason to move.
All three solutions (four if you count flat-files) will give you blazing fast writes. The non-relational (nosql) solutions will give you tunable fault-tolerance as well for the purposes of disaster recovery.
In terms of scale, our test environment, with only three MongoDB nodes, can handle 2-3k mixed transactions per second. At 8 nodes, we can handle 12k-15k mixed transactions per second. Cassandra can scale even higher. 250 reads is (or should be) no problem.
The more important question is, what do you want to do with this data? Operational reporting? Time-series analysis? Ad-hoc pattern analysis? real-time reporting?
MongoDB is a good option if you want the ability to do ad-hoc analysis based on multiple attributes within a collection. You can put up to 40 indexes on a collection, though the indexes will be stored in-memory, so watch for size. But the result is a flexible analytical solution.
Cassandra is a key-value store. You define a static column or set of columns that will act as your primary index right up front. All queries run against Cassandra should be tuned to this index. You can put a secondary on it, but that's about as far as it goes. You can, of course, use MapReduce to scan the store for non-key attribution, but it will be just that: a serial scan through the store. Cassandra also doesn't have the notion of "like" or regex operations on the server nodes. If you want to find all customers where the first name starts with "Alex", you'll have to scan through the entire collection, pull the first name out for each entry and run it through a client-side regex.
I'm not familiar enough with Redis to speak intelligently about it. Sorry.
If you are evaluating non-relational platforms, you might also want to consider CouchDB and Riak.
Hope this helps.
Just found this: http://blog.axant.it/archives/236
Quoting the most interesting part:
This second graph is about Redis RPUSH vs Mongo $PUSH vs Mongo insert, and I find this graph to be really interesting. Up to 5000 entries mongodb $push is faster even when compared to Redis RPUSH, then it becames incredibly slow, probably the mongodb array type has linear insertion time and so it becomes slower and slower. mongodb might gain a bit of performances by exposing a constant time insertion list type, but even with the linear time array type (which can guarantee constant time look-up) it has its applications for small sets of data.
I guess everything depends at least on data type and volume. Best advice probably would be to benchmark on your typical dataset and see yourself.
According to the Benchmarking Top NoSQL Databases (download here)
I recommend Cassandra.
If you have the choice (and need to move away from flat fies) I would go with Redis. Its blazingly fast, will comfortably handle the load you're talking about, but more importantly you won't have to manage the flushing/IO code. I understand its pretty straight forward but less code to manage is better than more.
You will also get horizontal scaling options with Redis that you may not get with file based caching.
I can get around 30k inserts/sec with MongoDB on a simple $350 Dell. If you only need around 2k inserts/sec, I would stick with MongoDB and shard it for scalability. Maybe also look into doing something with Node.js or something similar to make things more asynchronous.
The problem with inserts into databases is that they usually require writing to a random block on disk for each insert. What you want is something that only writes to disk every 10 inserts or so, ideally to sequential blocks.
Flat files are good. Summary statistics (eg total hits per page) can be obtained from flat files in a scalable manner using merge-sorty map-reducy type algorithms. It's not too hard to roll your own.
SQLite now supports Write Ahead Logging, which may also provide adequate performance.
I have hand-on experience with mongodb, couchdb and cassandra. I converted a lot of files to base64 string and insert these string into nosql.
mongodb is the fastest. cassandra is slowest. couchdb is slow too.
I think mysql would be much faster than all of them, but I didn't try mysql for my test case yet.