OpenStack (Swift) or CEPH deduplication feature? or any deduplication HA storage cluster solutions? - owncloud

For an owncloud (or nextcloud) project we need to add a great amount of storage, I've been checking all options such as: CEPH, Openstack Swift/Cinder, GlusterFS, SDFS and Tahoe-lafs.
With this service we expect many of the same files to be added by users, that is why deduplication is quite important for us. So far the only solutions for deduplication of clustered storage data would be SDFS and Tahoe-lafs. However our concerns are these two are Java and Python and will hurt CPU to much. (*Yes deduplication will likely mean more RAM and CPU as well)
Perhaps one of you have a better solution?
*deduplication filesystem (e.g. ZSF) will not work as data is stored on multiple machines (HA Cluster).

This is not a complete solution which is what I think you are looking for, but rather an open source deduplication library for Node.js with a native binding written in C++ and a reference implementation written in Javascript:
https://github.com/ronomon/deduplication
It should be fast enough if you can implement the indexing yourself using an LSM-Tree backed KV store.

Related

Social-networking: Hadoop, HBase, Spark over MongoDB or Postgres?

I am architecting a social-network, incorporating various features, many powered by big-data intensive workloads (such as Machine Learning). E.g.: recommender systems, search-engines and time-series sequence matchers.
Given that I currently have 5< users—but forsee significant growth—what metrics should I use to decide between:
Spark (with/without HBase over Hadoop)
MongoDB or Postgres
Looking at Postgres as a means of reducing porting pressure between it and Spark (use a SQL abstraction layer which works on both). Spark seems quite interesting, can imagine various ML, SQL and Graph questions it can be made to answer speedily. MongoDB is what I usually use, but I've found its scaling and map-reduce features to be quite limiting.
I think you are on the right direction to search for software stack/architecture which can:
handle different types of load: batch, real time computing etc.
scale in size and speed along with business growth
be a live software stack which are well maintained and supported
have common library support for domain specific computing such as machine learning, etc.
To those merits, Hadoop + Spark can give you the edges you need. Hadoop is relatively mature for now to handle large scale data in a batch manner. It supports reliable and scalable storage(HDFS) and computation(Mapreduce/Yarn). With the addition of Spark, you can leverage storage (HDFS) plus real-time computing (performance) added by Spark.
In terms of development, both systems are natively supported by Java/Scala. Library support, performance tuning of those are abundant here in stackoverflow and everywhere else. There are at least a few machine learning libraries(Mahout, Mlib) working with hadoop, spark.
For deployment, AWS and other cloud provider can provide host solution for hadoop/spark. Not an issue there either.
I guess you should separate data storage and data processing. In particular, "Spark or MongoDB?" is not a good thing to ask, but rather "Spark or Hadoop or Storm?" and also "MongoDB or Postgres or HDFS?"
In any case, I would refrain from having the database do processing.
I have to admit that I'm a little biased but if you want to learn something new, you have serious spare time, you're willing to read a lot, and you have the resources (in terms of infrastructure), go for HBase*, you won't regret it. A whole new universe of possibilities and interesting features open up when you can have +billions of atomic counters in real time.
*Alongside Hadoop, Hive, Spark...
In my opinion, it depends more on your requirements and the data volume you will have than the number of users -which is also a requirement-. Hadoop (aka Hive/Impala, HBase, MapReduce, Spark, etc.) works fine with big amounts -GB/TB per day- of data and scales horizontally very well.
In the Big Data environments I have worked with I have always used Hadoop HDFS to store raw data and leverage the distributed file system to analyse the data with Apache Spark. The results were stored in a database system like MongoDB to obtain low latency queries or fast aggregates with many concurrent users. Then we used Impala to get on demmand analytics. The main question when using so many technologies is to scale well the infraestructure and the resources given to each one. For example, Spark and Impala consume a lot of memory (they are in memory engines) so it's a bad idea to put a MongoDB instance on the same machine.
I would also suggest you a graph database since you are building a social network architecture; but I don't have any experience with this...
Are you looking to stay purely open-sourced? If you are going to go enterprise at some point, a lot of the enterprise distributions of Hadoop include Spark analytics bundled in.
I have a bias, but, there is also the Datastax Enterprise product, which bundles Cassandra, Hadoop and Spark, Apache SOLR, and other components together. It is in use at many of the major internet entities, specifically for the applications you mention. http://www.datastax.com/what-we-offer/products-services/datastax-enterprise
You want to think about how you will be hosting this as well.
If you are staying in the cloud, you will not have to choose, you will be able to (depending on your cloud environment, but, with AWS for example) use Spark for continuous-batch process, Hadoop MapReduce for long-timeline analytics (analyzing data accumulated over a long time), etc., because the storage will be decoupled from the collection and processing. Put data in S3, and then process it later with whatever engine you need to.
If you will be hosting the hardware, building a Hadoop cluster will give you the ability to mix hardware (heterogenous hardware supported by the framework), will give you a robust and flexible storage platform and a mix of tools for analysis, including HBase and Hive, and has ports for most of the other things you've mentioned, such as Spark on Hadoop (not a port, actually the original design of Spark.) It is probably the most versatile platform, and can be deployed/expanded cheaply, since the hardware does not need to be the same for every node.
If you are self-hosting, going with other cluster options will force hardware requirements on you that may be difficult to scale with later.
We use Spark +Hbase + Apache Phoenix + Kafka +ElasticSearch and scaling has been easy so far.
*Phoenix is a JDBC driver for Hbase, it allows to use java.sql with hbase, spark (via JDBCrdd) and ElasticSearch (via JDBC river), it really simplifies integration.

Implementing blob storage

I'm looking for a way to implement (provide) blob storage for an application I'm building.
What I need is the following:
Access is done using simple keys (like primary keys; I don't need a hierarchy);
Blobs with sizes will be from 1KiB to 1GiB. Both scenarios must be fast and supported (so systems that work based on large blocks, like I believe Hadoop does, are out);
Streaming access to blobs (i.e. to be able to read random parts of the blob);
Access over REST;
No eventual consistency.
My infrastructure requirements are as follows:
Horizontally scalable, but sharding is OK (so it is not necessary that the system natively supports horizontal scaling);
High availability (so replication and automatic failover);
I can't use Azure or Google blob storage; this is a private cloud application.
I'm prepared to implement such a system myself, but I prefer an out of the box system that implements this or at least parts of it.
I have e.g. looked at Hadoop, but that has eventual consistency, so is out. There seem to be a number of Linux DFS implementations, but these all work using mounting and I just need REST access. Also it looks like the range of blob sizes makes things difficult.
What system could I use for this?
It's a pretty old post, but I'm looking pretty much the same. I've found the stack of GridFS and ngnix-based HTTP access module.

MongoDB/CouchDB for storing files + replication?

if I would like to store a lot of files + replicate the db, what NoSql databse would be the best for this kind of job?
I was testing MongoDB and CouchDB and these DBs are really nice and easy to use. If it would be possible I would use one of them for storing files. Now I see the difference between Mongo and Couch, but I cannot explain which one is better for storing files. And if Im talking about storing files I mean files with 10-50MB but also maybe files with 50-500MB - and maybe a lot of updates.
I found here a nice table:
http://weblogs.asp.net/britchie/archive/2010/08/17/document-databases-compared-mongodb-couchdb-and-ravendb.aspx
Still not sure which of these properties are the best for filestoring and replication. But maybe I should choose another NoSql DB?
That table is way out of date:
Master-Slave replication has been deprecated in favour of replica sets for starters and also consistency is wrong there as well. You will want to completely re-read this section on the MongoDB docs.
Map/Reduce is only JavaScript, there is no others.
I have no idea what that table means by attachments but GridFS is a storage standard built into the drivers to help make storing large files in MongoDB easier. Meta-data is also supported through this method.
MongoDB is on version 2.2 so anything it mentions about versions before is now obsolete (i.e. sharding and single server durability).
I do not have personal experience with CouchDBs interface for storing files however I wouldn't be surprised if there was hardly any differences between the two. I would think this part is too subjective for us to answer and you will need to just go for which one suites you better.
It is actually possible to build MongoDB clusters multi-regional (which S3 buckets are not and cannot be replicated as such without work) and replicate the most accessed files in a specific part of the world through MongoDB to these clusters.
I mean the main upshot I have found at times is that MongoDB can act like S3 and Cloudfront put together which is great since you have the redundant storage and the ability to distribute your data.
However that being said S3 is very valid option here and I would seriously give it a try, you might not be looking for the same stuff as me in a content network.
Database storage of files do not come without their serious downsides, however speed shouldn't be a huge problem here since you should get the same speed from a none Cloudfront fronted S3 as you should get from MongoDB really (remember S3 is a redundant storage network, not a CDN).
If you were to use S3 you would then store a row in your database that points to the file and houses meta-data about it.
There is a project called CBFS by Dustin Sallings (one of the Couchbase founders, and creator of spymemcached and core contributor of memcached) and Marty Schoch that uses Couchbase and Go.
It's an Infinite Node file store with redundancy and replication. Basically your very own S3 that supports lots of different hardware and sizes. It uses REST HTTP PUT/GET/DELETE, etc. so very easy to use. Very fast, very powerful.
CBFS on Github: https://github.com/couchbaselabs/cbfs
Protocol: https://github.com/couchbaselabs/cbfs/wiki/Protocol
Blog Post: http://dustin.github.com/2012/09/27/cbfs.html
Diverse Hardware: https://plus.google.com/105229686595945792364/posts/9joBgjEt5PB
Other Cool Visuals:
http://www.youtube.com/watch?v=GiFMVfrNma8
http://www.youtube.com/watch?v=033iKVvrmcQ
Contact me if you have questions and I can put you in touch.
Have you considered Amazon S3 as an option? It's highly available, proven and has redundant storage etc....
CouchDB, even though I personally like it a lot as it works very well with node.js, has the disadvantage that you need to compact it regularly if you don't want to waste too much diskspace. In your case if you are going to be doing a lot of updates to the same documents, that might be an issue.
I can't really commment on MongoDB as I haven't used it, but again, if file storage is your main concern, then have a look at S3 and similar as they are completely focused on filestorage.
You could combine the two where you store your meta data in a NoSql or Sql datastore and your actual files in a separate file store but keeping those 2 stores in sync and replicated might be tricky.

Difference between Memcached and Hadoop?

What is the basic difference between Memcached and Hadoop? Microsoft seems to do memcached with the Windows Server AppFabric.
I know memcached is a giant key value hashing function using multiple servers. What is hadoop and how is hadoop different from memcached? Is it used to store data? objects? I need to save giant in memory objects, but it seems like I need some kind of way of splitting this giant objects into "chunks" like people are talking about. When I look into splitting the object into bytes, it seems like Hadoop is popping up.
I have a giant class in memory with upwards of 100 mb in memory. I need to replicate this object, cache this object in some fashion. When I look into caching this monster object, it seems like I need to split it like how google is doing. How is google doing this. How can hadoop help me in this regard. My objects are not simple structured data. It has references up and down the classes inside, etc.
Any idea, pointers, thoughts, guesses are helpful.
Thanks.
memcached [ http://en.wikipedia.org/wiki/Memcached ] is a single focused distributed caching technology.
apache hadoop [ http://hadoop.apache.org/ ] is a framework for distributed data processing - targeted at google/amazon scale many terrabytes of data. It includes sub-projects for the different areas of this problem - distributed database, algorithm for distributed processing, reporting/querying, data-flow language.
The two technologies tackle different problems. One is for caching (small or large items) across a cluster. And the second is for processing large items across a cluster. From your question it sounds like memcached is more suited to your problem.
Memcache wont work due to its limit on the value of object stored.
memcache faq . I read some place that this limit can be increased to 10 mb but i am unable to find the link.
For your use case I suggest giving mongoDB a try.
mongoDb faq . MongoDB can be used as alternative to memcache. It provides GridFS for storing large file systems in the DB.
You need to use pure Hadoop for what you need (no HBASE, HIVE etc). The Map Reduce mechanism will split your object into many chunks and store it in Hadoop. The tutorial for Map Reduce is here. However, don't forget that Hadoop is, in the first place, a solution for massive compute and storage. In your case I would also recommend checking Membase which is implementation of Memcached with addition storage capabilities. You will not be able to map reduce with memcached/membase but those are still distributed and your object may be cached in a cloud fashion.
Picking a good solution depends on requirements of the intended use, say the difference between storing legal documents forever to a free music service. For example, can the objects be recreated or are they uniquely special? Would they be requiring further processing steps (i.e., MapReduce)? How quickly does an object (or a slice of it) need to be retrieved? Answers to these questions would affect the solution set widely.
If objects can be recreated quickly enough, a simple solution might be to use Memcached as you mentioned across many machines totaling sufficient ram. For adding persistence to this later, CouchBase (formerly Membase) is worth a look and used in production for very large game platforms.
If objects CANNOT be recreated, determine if S3 and other cloud file providers would not meet requirements for now. For high-throuput access, consider one of the several distributed, parallel, fault-tolerant filesystem solutions: DDN (has GPFS and Lustre gear), Panasas (pNFS). I've used DDN gear and it had a better price point than Panasas. Both provide good solutions that are much more supportable than a DIY BackBlaze.
There are some mostly free implementations of distributed, parallel filesystems such as GlusterFS and Ceph that are gaining traction. Ceph touts an S3-compatible gateway and can use BTRFS (future replacement for Lustre; getting closer to production ready). Ceph architecture and presentations. Gluster's advantage is the option for commercial support, although there could be a vendor supporting Ceph deployments. Hadoop's HDFS may be comparable but I have not evaluated it recently.

What is the best database/storage to store statistic data?

I'm having a system that collects real-time Apache log data from about 90-100 Web Servers. I had also defined some url patterns.
Now I want to build another system that updates the time of occurrence of each pattern based on those logs.
I had thought about using MySQL to store statistic data, update them by statement:
"Update table set count=count+1 where ....",
but i'm afraid that MySQL will be slow for data from such amount of servers. Moreover, I'm looking for some database/storage solutions that more scalable and simple. (As a RDBMS, MySQL supports too much things that I don't need in this situation) . Do you have any idea ?
Apache Cassandra is a high-performance column-family store and can scale extremely well. The learning curve is a bit steep, but will have no problem handling large amounts of data.
A more simple solution would be a key-value store, like Redis. It's easier to understand than Cassandra. Redis only seems to support master-slave replication as a way to scale, so the write performance of your master server could be a bottleneck. Riak has a decentralized architecture without any central nodes. It has no single point of failure nor any bottlenecks, so it's easier to scale out.
Key value storage seems to be an appropriate solution for my system. After taking a quick look on those storages, I'm concerning about race-condition issue, as there will be a lot of clients trying to do these steps on the same key:
count = storage.get(key)
storage.set(key,count+1)
I had worked with Tokyo Cabinet before, and they have 'addint' method which perfectly matched with my case, I wonder if other storages have similar feature? I didn't choose Tokyo Cabinet/Tyrant cause I had experienced some issues about its scalability and data stability (e.g. repair corrupted data, ...)