NoSQL database and many semi-large blobs - nosql

Is there a NoSQL (or other type of) database suitable for storing a large number (i.e. >1 billion) of "medium-sized" blobs (i.e. 20 KB to 2 MB). All I need is a mapping from A (an identifier) to B (a blob), the ability to retrieve "B" given A, a consistent external API for access, and the ability to "just add another computer" to scale the system.
Something simpler than a database, e.g. a distributed key-value system, may just fine, and I'd appreciate any thoughts along that vein as well.
Thank you for reading.
Brian

If your API requirements are purely along the lines of "Get(key), Put(key,blob), Remove(key)" then a key-value store (or more accurately a "Persistent distributed hash table") is exactly what you are looking for.
There a quite a few of these available, but without additional information it is hard to make a solid recommendation - What OS are you targeting? Which language(s) are you developing with? What are the I/O characteristics of your app (cold/immutable data such as images? high write loads aka tweets?)
Some of the KV systems worth looking into:
- MemcacheDB
- Berkeley DB
- Voldemort
You may also want to look into document stores such as CouchDB or RavenDB*. Document Stores are similar to KV stores but they understand the persistence format (usually JSON) so they can provide additional services such as indexing.
If you are developing in .Net then skip directly to RavenDB (you'll thank me later)

What about Jackrabbit?
Apache Jackrabbit™ is a fully
conforming implementation of the
Content Repository for Java Technology
API (JCR, specified in JSR 170 and
283).
A content repository is a hierarchical
content store with support for
structured and unstructured content,
full text search, versioning,
transactions, observation, and more.
I knew Jackrabbit when I worked with Liferay CMS. Liferay uses Jackrabbit to implement its Document Library. It stores user files in the server's file system.

You'll also want to take a look at Riak. Riak is very focused on doing exactly what you're asking (just add node, easy to access).

Related

What are the pros and cons of a Relational DB vs Mongo vs Flat file behind a CDN

let's say that I have an ecommerce website with million of products, that have millions of pageviews a day, mostly for product details pages.
Let's say that I currently have all my data in a relational DB, the old good way.
What would be the pros and cons of keeping the data in the relational DB for doing queries, aggreating and filtering products and all that...but using flat json files for the product details?
So, having 1 file per 1 product, with all details serialized to json. These files would be placed under a high-performance cdn, geographically distributed and all that. When the user goes to
www.mysite.com/prods/00123
the server (or even the client) would load a template file for the layout, and then fill it with the data it reads from something like cdn.mysite.com/prods/00123.json
So I basically don't need to do queries in this case - I jump straight to the file named after the product id. I guess it should be very fast, and yet I would delegate the scalability / caching / geographic distribution to an external strong partner (cdn like akamai, amazon etc.) instead of building my own (expensive and hard to maintain) distributed db server?
I look forward to your suggestions / feedback...especially if it comes to real world experience :)
Thanks!
As per your requirements,
It is better to store product descriptions in a schema free database like MongoDB since your products can have very different fields with wide variation in number of attributes (and corresponding fields). Also such information is written far less often then they are read. MongoDB has collection level write locks which deter write heavy applications if you like to do consistent writes. However the reads in MongoDB are very fast because you dont have to do joins or fetch field values from a EAV schema table. Needless to say, based on your data volume sharding and replication needs to be done in a production environment.
It is better than storing in a flat file since MongoDB's read performance is very good because of memory mapped files and you get replication/sharding as well.
However, if the filesystem (or the filesystem network) provides the security, speed and accessibility provided by the database then storing the data in filesystem is not a bad idea. The traditional db vs flat-file argument does not hold true if the flat files are configured to be served in an efficient manner.
However, you should not store information like shopping cart, checkout transaction, etc in MongoDB since you dont have ACID transactions and frequent writes and updates 'with consistency' is not MongoDB's cup of tea.

MongoDB/CouchDB for storing files + replication?

if I would like to store a lot of files + replicate the db, what NoSql databse would be the best for this kind of job?
I was testing MongoDB and CouchDB and these DBs are really nice and easy to use. If it would be possible I would use one of them for storing files. Now I see the difference between Mongo and Couch, but I cannot explain which one is better for storing files. And if Im talking about storing files I mean files with 10-50MB but also maybe files with 50-500MB - and maybe a lot of updates.
I found here a nice table:
http://weblogs.asp.net/britchie/archive/2010/08/17/document-databases-compared-mongodb-couchdb-and-ravendb.aspx
Still not sure which of these properties are the best for filestoring and replication. But maybe I should choose another NoSql DB?
That table is way out of date:
Master-Slave replication has been deprecated in favour of replica sets for starters and also consistency is wrong there as well. You will want to completely re-read this section on the MongoDB docs.
Map/Reduce is only JavaScript, there is no others.
I have no idea what that table means by attachments but GridFS is a storage standard built into the drivers to help make storing large files in MongoDB easier. Meta-data is also supported through this method.
MongoDB is on version 2.2 so anything it mentions about versions before is now obsolete (i.e. sharding and single server durability).
I do not have personal experience with CouchDBs interface for storing files however I wouldn't be surprised if there was hardly any differences between the two. I would think this part is too subjective for us to answer and you will need to just go for which one suites you better.
It is actually possible to build MongoDB clusters multi-regional (which S3 buckets are not and cannot be replicated as such without work) and replicate the most accessed files in a specific part of the world through MongoDB to these clusters.
I mean the main upshot I have found at times is that MongoDB can act like S3 and Cloudfront put together which is great since you have the redundant storage and the ability to distribute your data.
However that being said S3 is very valid option here and I would seriously give it a try, you might not be looking for the same stuff as me in a content network.
Database storage of files do not come without their serious downsides, however speed shouldn't be a huge problem here since you should get the same speed from a none Cloudfront fronted S3 as you should get from MongoDB really (remember S3 is a redundant storage network, not a CDN).
If you were to use S3 you would then store a row in your database that points to the file and houses meta-data about it.
There is a project called CBFS by Dustin Sallings (one of the Couchbase founders, and creator of spymemcached and core contributor of memcached) and Marty Schoch that uses Couchbase and Go.
It's an Infinite Node file store with redundancy and replication. Basically your very own S3 that supports lots of different hardware and sizes. It uses REST HTTP PUT/GET/DELETE, etc. so very easy to use. Very fast, very powerful.
CBFS on Github: https://github.com/couchbaselabs/cbfs
Protocol: https://github.com/couchbaselabs/cbfs/wiki/Protocol
Blog Post: http://dustin.github.com/2012/09/27/cbfs.html
Diverse Hardware: https://plus.google.com/105229686595945792364/posts/9joBgjEt5PB
Other Cool Visuals:
http://www.youtube.com/watch?v=GiFMVfrNma8
http://www.youtube.com/watch?v=033iKVvrmcQ
Contact me if you have questions and I can put you in touch.
Have you considered Amazon S3 as an option? It's highly available, proven and has redundant storage etc....
CouchDB, even though I personally like it a lot as it works very well with node.js, has the disadvantage that you need to compact it regularly if you don't want to waste too much diskspace. In your case if you are going to be doing a lot of updates to the same documents, that might be an issue.
I can't really commment on MongoDB as I haven't used it, but again, if file storage is your main concern, then have a look at S3 and similar as they are completely focused on filestorage.
You could combine the two where you store your meta data in a NoSql or Sql datastore and your actual files in a separate file store but keeping those 2 stores in sync and replicated might be tricky.

Using Multiple Database Types to Model Data in a single application

Does it make sense to break up the data model of an application into different database systems? For example, the application stores all user data and relationships in a graph database (ideal for storing relationships), while storing other data in a document database, such as CouchDB or MongoDB? This would require the user graph database to reference unique ids in the document databases and vice versa.
Is this over complicating the data model and application? Or is this using the best uses of both types of database systems for scaling your application?
It definitely can make sense and depends fully on the requirements of your application. If you can use other database systems for things in which they are really good at.
Take for example full text search. Of course you can do more or less complex full text searches with a relational database like MySql. But there are systems like e.g. Lucene/Solr which are optimized for such things and can search fast in millions of documents. So you could use these systems for their special task (here: make a nifty full text search), then you return the identifiers and maybe load the relational structured data from the RDBMS.
Or CouchDB. I use couchDB in some projects as a caching systems. In combination with a relational database. Of course I need to care about consistency, but it it's definitely worth the effort. It pushed performance in the projects a lot and decreases for example load on the server from 2 to 0.2. :)
Something like this is for instance called cross-store persistence. As you mentioned you would store certain data in your relational database, social relationships in a graphdb, user-generated data (documents) in a document-db and user provided multimedia files (pictures, audio, video) in a blob-store like S3.
It is mainly about looking at the use-cases and making sure that from wherever you need it you might access the "primary" or index key of each store (back and forth). You can encapsulate the actual lookup in your domain or dao layer.
Some frameworks like the Spring Data projects provide some initial kind of cross-store persistence out of the box, mostly integrating JPA with a different NOSQL datastore. For instance Spring Data Graph allows it to store your entities in JPA and add social graphs or other highly interconnected data as a secondary concern and leverage a graphdb for the typical traversal and other graph operations (e.g. ranking, suggestions etc.)
Another term for this is polyglot persistence.
Here are two contrary positions on the question:
Pro:
"Contrary to that, I’m a big fan of polyglot persistence. This simply means using the right storage backend for each of your usecases. For example file storages, SQL, graph databases, data ware houses, in-memory databases, network caches, NoSQL. Today there are mostly two storages used, files and SQL databases. Both are not optimal for every usecase."
http://codemonkeyism.com/nosql-polyglott-persistence/
Con:
"I don’t think I need to say that I’m a proponent of polyglot persistence. And that I believe in Unix tools philosophy. But while adding more components to your system, you should realize that such a system complexity is “exploding” and so will operational costs grow too (nb: do you remember why Twitter started to into using Cassandra?) . Not to mention that the more components your system has the more attention and care must be invested figuring out critical aspects like overall system availability, latency, throughput, and consistency."
http://nosql.mypopescu.com/post/1529816758/why-redis-and-memcached-cassandra-lucene

Embedded nosql open source java database

I'm developing an open source product and need an embedded dbms.
Can you recommend an embedded open source database that ...
Can handle objects over 10 GB each
Has a license friendly to embedding (LGPL, not GPL).
Is pure Java
Is (preferably) nosql. Sql might work, but prefer nosql
I've looked over some of the document DBMSs, like mongodb,
but they seem to be limited to 4 or 16 mb documents.
Berkeley DB looked attractive but has a GPL like license.
Sqlite3 is attractive: good license, and you can compile
with whatever max blob size you like. But, it's not Java.
I know JDBC drivers exist, but we need a pure Java system.
Any suggestions?
Thanks
Steve
Although it's an old question, I've been looking into this recently and have come across the following (at least two of which were written after this question was asked). I'm not sure how any of these handle very large objects - and at 10GB you would probably have to do some serious testing, as I presume few database developers would have objects of that size in mind for their products (just a guess). I would definitely consider storing them to disk directly, with just a reference to the file location in your database.
(Opinions below are all pretty superficial, by the way, as I haven't used them in earnest yet).
OrientDB looks like the most mature of the three I found. It appears to be a document and/or graph database and claims to be very fast (making use of and "RB+Tree" data structure - a combination of B+ and Red Black trees). It claims to be super fast and light, with no external dependencies. There seems to be an active community developing it, with lots of commits over the last few days, for example. It's also compliant with TinkerPop graph database standard, which adds another layer of features (such as the Gremlin graph querying language). It's ACID compliant, has REST and other external APIs and even a web based management app (which presumably could be deployed with your embedded DB, but I'm not sure).
The next two fall more into the simple key-value store camp of N(ot)O(nly)SQL world.
JDBM3 is an extremely minimal data store: it has a hash map, tree map, tree set and linked list which are written to disk through memory mapped files. It claims to be very light and fast, is fully transactional and is being actively developed.
HawtDB looks similary very simple and fast - a BTree or Hash based index persisted to disk with memory mapped files. It's (optionally) fully transactional. There has been no commit in the past seven months (to end March 2012) and there's not much activity on the mailing list. That's not to say it's not a good library, but worth mentioning.
JDBM3 and HawtDB are pretty minimal, so you're not going to get any fancy GUIs. But I think they both look very attractive for their speed and simplicity.
Those are all I've found matching your requirements. In addition, Neo4J is great - a graph database, which is now a pretty mature and works very well in embedded mode. It's GPL/AGPL licensed, though, so may require a paid license, unless you can open source your code too:
http://neotechnology.com/products/price-list/
Of course, you could also use the H2 SQL database with one big table and no indices!

Storing millions of log files - Approx 25 TB a year

As part of my work we get approx 25TB worth log files annually, currently it been saved over an NFS based filesystem. Some are archived as in zipped/tar.gz while others reside in pure text format.
I am looking for alternatives of using an NFS based system. I looked at MongoDB, CouchDB. The fact that they are document oriented database seems to make it the right fit. However the log files content needs to be changed to JSON to be store into the DB. Something I am not willing to do. I need to retain the log files content as is.
As for usage we intend to put a small REST API and allow people to get file listing, latest files, and ability to get the file.
The proposed solutions/ideas need to be some form of distributed database or filesystem at application level where one can store log files and can scale horizontally effectively by adding more machines.
Ankur
Since you dont want queriying features, You can use apache hadoop.
I belive HDFS and HBase will be nice fit for this.
You can see lot of huge storage stories inside Hadoop powered by page
Take a look at Vertica, a columnar database supporting parallel processing and fast queries. Comcast used it to analyze about 15GB/day of SNMP data, running at an average rate of 46,000 samples per second, using five quad core HP Proliant servers. I heard some Comcast operations folks rave about Vertica a few weeks ago; they still really like it. It has some nice data compression techniques and "k-safety redundancy", so they could dispense with a SAN.
Update: One of the main advantages of a scalable analytics database approach is that you can do some pretty sophisticated, quasi-real time querying of the log. This might be really valuable for your ops team.
Have you tried looking at gluster? It is scalable, provides replication and many other features. It also gives you standard file operations so no need to implement another API layer.
http://www.gluster.org/
I would strongly disrecommend using a key/value or document based store for this data (mongo, cassandra, etc.). Use a file system. This is because the files are so large, and the access pattern is going to be linear scan. One thing problem that you will run into is retention. Most of the "NoSQL" storage systems use logical delete, which means that you have to compact your database to remove deleted rows. You'll also have a problem if your individual log records are small and you have to index each one of them - your index will be very large.
Put your data in HDFS with 2-3 way replication in 64 MB chunks in the same format that it's in now.
If you are to choose a document database:
On CouchDB you can use the _attachement API to attach the file as is to a document, the document itself could contain only metadata (like timestamp, locality and etc) for indexing. Then you will have a REST API for the documents and the attachments.
A similar approach is possible with Mongo's GridFs, but you would build the API yourself.
Also HDFS is a very nice choice.