Can Put Large Binary Files in RabbitMQ Queue - deployment

I'm trying to design one multiserver updates deployment system, I was thinking if there is any limitation for big binary strings. If I put for example a string from one 100MB file in the queue?
Thanks,
Pedro

I've done it and I would not necessarily recommend it. Its probably better to store the file in something like GridFS (MongoDB) and then reference the _id in the RabbitMQ message. You can then pull the file on the consumer using Mongo's interface and delete it once done.
I have this running with about 20M objects in GridFS and its been rocksolid.

Searching for "RabbitMQ Large Files" turned up a significant amount of advice on the subject.
The standard response seems to be that it should, in theory, be able to handle it, but you may find that your broker becomes unresponsive.
If you own both sides of the queue (sender/receiver), then you may consider chunking the data into more manageable 'chunks' of data. e.g. 100KB chunks. This will be nicer to your broker. One of the search hits from above had a link to a 'streaming' sender written in ruby, which did chunking.
If you do not own both sides of the queue, then consider using a form of 'claim check', where your message contains the location of the large blob/file/data in storage location better suited to it.
This could be pretty interesting background information: http://rabbitmq.1065348.n5.nabble.com/Can-RabbitMQ-handle-big-messages-tt566.html#a569

Related

MongoDB as file storage

i'm trying to find the best solution to create scalable storage for big files. File size can vary from 1-2 megabytes and up to 500-600 gigabytes.
I have found some information about Hadoop and it's HDFS, but it looks a little bit complicated, because i don't need any Map/Reduce jobs and many other features. Now i'm thinking to use MongoDB and it's GridFS as file storage solution.
And now the questions:
What will happen with gridfs when i try to write few files
concurrently. Will there be any lock for read/write operations? (I will use it only as file storage)
Will files from gridfs be cached in ram and how it will affect read-write perfomance?
Maybe there are some other solutions that can solve my problem more efficiently?
Thanks.
I can only answer for MongoDB here, I will not pretend I know much about HDFS and other such technologies.
The GridFs implementation is totally client side within the driver itself. This means there is no special loading or understanding of the context of file serving within MongoDB itself, effectively MongoDB itself does not even understand they are files ( http://docs.mongodb.org/manual/applications/gridfs/ ).
This means that querying for any part of the files or chunks collection will result in the same process as it would for any other query, whereby it loads the data it needs into your working set ( http://en.wikipedia.org/wiki/Working_set ) which represents a set of data (or all loaded data at that time) required by MongoDB within a given time frame to maintain optimal performance. It does this by paging it into RAM (well technically the OS does).
Another point to take into consideration is that this is driver implemented. This means that the specification can vary, however, I don't think it does. All drivers will allow you to query for a set of documents from the files collection which only houses the files meta data allowing you to later serve the file itself from the chunks collection with a single query.
However that is not the important thing, you want to serve the file itself, including its data; this means that you will be loading the files collection and its subsequent chunks collection into your working set.
With that in mind we have already hit the first snag:
Will files from gridfs be cached in ram and how it will affect read-write perfomance?
The read performance of small files could be awesome, directly from RAM; the writes would be just as good.
For larger files, not so. Most computers will not have 600 GB of RAM and it is likely, quite normal in fact, to house a 600 GB partition of a single file on a single mongod instance. This creates a problem since that file, in order to be served, needs to fit into your working set however it is impossibly bigger than your RAM; at this point you could have page thrashing ( http://en.wikipedia.org/wiki/Thrashing_%28computer_science%29 ) whereby the server is just page faulting 24/7 trying to load the file. The writes here are no better as well.
The only way around this is to starting putting a single file across many shards :\.
Note: one more thing to consider is that the default average size of a chunks "chunk" is 256KB, so that's a lot of documents for a 600GB file. This setting is manipulatable in most drivers.
What will happen with gridfs when i try to write few files concurrently. Will there be any lock for read/write operations? (I will use it only as file storage)
GridFS, being only a specification uses the same locks as on any other collection, both read and write locks on a database level (2.2+) or on a global level (pre-2.2). The two do interfere with each other as well, i.e. how can you ensure a consistent read of a document that is being written to?
That being said the possibility for contention exists based on your scenario specifics, traffic, number of concurrent writes/reads and many other things we have no idea about.
Maybe there are some other solutions that can solve my problem more efficiently?
I personally have found that S3 (as #mluggy said) in reduced redundancy format works best storing a mere portion of meta data about the file within MongoDB, much like using GridFS but without the chunks collection, let S3 handle all that distribution, backup and other stuff for you.
Hopefully I have been clear, hope it helps.
Edit: Unlike what I accidently said, MongoDB does not have a collection level lock, it is a database level lock.
Have you considered saving meta data onto MongoDB and writing actual files to Amazon S3? Both have excellent drivers and the latter is highly redundant, cloud/cdn-ready file storage. I would give it a shot.
I'll start by answering the first two:
There is a write lock when writing in to GridFS, yes. No lock for reads.
The files wont be cached in memory when you query them, but their metadata will.
GridFS may not be the best solution for your problem. Write locks can become something of a pain when you're dealing with this type of situation, particularly for huge files. There are other databases out there that may solve this problem for you. HDFS is a good choice, but as you say, it is very complicated. I would recommend considering a storage mechanism like Riak or Amazon's S3. They're more oriented around being storage for files, and don't end up with major drawbacks. S3 and Riak both have excellent admin facilities, and can handle huge files. Though with Riak, last I knew, you had to do some file chunking to store files over 100mb. Despite that, it generally is a best practice to do some level of chunking for huge file sizes. There are a lot of bad things that can happen when transferring files in to DBs- From network time outs, to buffer overflows, etc. Either way, your solution is going to require a fair amount of tuning for massive file sizes.

messaging service: redis or mongodb?

I am working on a messaging system that is a bit more advanced than simply sending receiving messages; it is something that looks like facebook chat/messaging: it has chat aspects but also messaging ones, like group messages, read/unread messages, and other.
On redis, I would simply use lists to store received messages, for example like this:
myID = [ "amy|how are you?", "frank|long time no see!" ]
amyID = [ "john|I'm good! you?" ]
(I have simplified it all a lot for easier reading.
But in this way I would not be able to keep track of single conversations, as they will all be always flushed once the messages are received (so basically no "inbox" feature.
On the other hand, if I use mongodb, I could use something like this: How to keep track of a private messaging system using MongoDB?
I though of the following benefits/disadvantages:
MONGODB
advantages:
can see inbox view
can check read/unread messages on each conversation
disadvantages
not as fast as redis
storage size increases a lot
REDIS
advantages:
easy to pick up new messages
no storage problems (messages are flushed)
disadvantages:
once messages are sent to the client are lost, so no read/unread features and
no inbox
Any ideas?
Thanks in advance.
I cannot answer for Redis because I don't use it and never have so I won't pretend I have.
However, if for some reason, you are not using something like an XMPP client like Facebook does: http://www.ibm.com/developerworks/xml/tutorials/x-realtimeXMPPtut/section3.html (aka Jabber) for chat then I will describe about a pure MongoDB solution in this situation.
MongoDB uses the OS' LRU as a means to cache documents and queries, fair enough it provides no direct query cache however if you are smart you will not need one; instead you just read all your queries directly from RAM. With this in mind MongoDB can be just as fast as Redis, since Redis uses the computers RAM too.
Speed between the two on a optimised query is negligible I would think. The true measure of speed comes from your schema, indexes, cluster setup and the queries you perform.
A note about storage size here, taking your comment into consideration:
the problem with flushing mongodb is bigger than I initially though: apparently when you delete something on mongo you only delete its reference, so if you delete 4mb of documents, it won't free up that much space. the only way to actually free up that memory is to run a dbRepair (or something among this line) that basically blocks the db while running....
You seem to have some misconceptions about exactly how MongoDB works.
This link will be of help to you: http://www.10gen.com/presentations/storage-engine-internals it will describe some of the reasons why excessive disk space is used and will also explain some of the misconceptions you have about how a computer works and how MongoDB frees space and reuses it.
MongoDB does not free space on a record level. Instead it will send that "empty" record (record and document are two different things as the presentation will tell you), shove it into a deleted bucket list and then reuse that space when a new document (or a updated document that has been moved) comes along and fits in that space.
It is true that if you are not careful and understanding on how MongoDB works on this level that you will probably be forced to run repairDB fairly regularly to keep any sort of performance after fragmentation.
As for memory handling. The OS handles this as I said. A good explanation of when the OS will free memory is on Wikipedia: http://en.wikipedia.org/wiki/Paging
Until there is not enough RAM to store all the data needed, the process of obtaining an empty page frame does not involve removing another page from RAM.
So the OS will handle removing pages for you and you shouldn't concern yourself with that part, instead you should be concerned with making your working set fit into RAM.
If you are worried about storing messages and don't really want to, i.e. you want them to be "flushed" you can actually use the TTL feature that comes with the later MongoDB installations: http://docs.mongodb.org/manual/tutorial/expire-data/ which will basically allow you to set a time-out for when a message should be deleted from the collection.
So personally if set-up right MongoDB could do messaging and chat like Facebook do it, of course they use the XMPP protocol and then archive messages into Cassandra for search but you don't have to do it like they do, that is just one way to achieve the same goal.
Hope this makes sense and I haven't gone round in circles, it is a bit of a long answer.
I think the big point here is the storage problems. You would need a lot of machine or a good system of flushing some conversations for you to use MongoDB. Despite wanting a sort of "inbox" system... I think redis would be more conducive to a well-working chat system - you just need to come up with some very creative workaround... or give up that design goal.
We use a mixed design, so we when we need snappy performance as in messages, queues and caches it´s on Redis and when we need to search on secondary indexes or update whole documents, we use MongoDB.
You can also try Riak, which can grow more linearly and smoothly than MongoDB.

Is Cassandra good for storing files?

I'm developing a php platform that will make huge use of images, documents and any file format that will come in my mind so i was wondering if Cassandra is a good choice for my needs.
If not, can you tell me how should i store files? I'd like to keep using cassandra because it's fault-tolerant and uses auto-replication among nodes.
Thanks for help.
From the cassandra wiki,
Cassandra's public API is based on Thrift, which offers no streaming abilities
any value written or fetched has to fit in memory. This is inherent to Thrift's
design and is therefore unlikely to change. So adding large object support to
Cassandra would need a special API that manually split the large objects up
into pieces. A potential approach is described in http://issues.apache.org/jira/browse/CASSANDRA-265.
As a workaround in the meantime, you can manually split files into chunks of whatever
size you are comfortable with -- at least one person is using 64MB -- and making a file correspond
to a row, with the chunks as column values.
So if your files are < 10MB you should be fine, just make sure to limit the file size, or break large files up into chunks.
You should be OK with files of 10MB. In fact, DataStax Brisk puts a filesystem on top of Cassandra if I'm not mistaken: http://www.datastax.com/products/enterprise.
(I'm not associated with them in any way- this isn't an ad)
As fresh information, Netflix provides utilities for their cassandra client called astyanax for storing files as handled object stores. Description and examples can be found here. It can be a good starting point to write some tests using astyanax and evaluate Cassandra as a file storage.

What's a suitable storage RDBMS,NoSQL, for caching web site responses?

We're in the process of building an internal, Java-based RESTful web services application that exposes domain-specific data in XML format. We want to supplement the architecture and improve performance by leveraging a cache store. We expect to host the cache on separate but collocated servers, and since the web services are Java/Grails, a Java or HTTP API to the cache would be ideal.
As requests come in, unique URI's and their responses would be cached using a simple key/value convention, for example...
KEY VALUE
http://prod1/financials/reports/JAN/2007 --> XML response of 50Mb
http://prod1/legal/sow/9004 --> XML response of 250Kb
Response values for a single request can be quite large, perhaps up to 200Mb, but could be as small as 1Kb. And the number of requests per day is small; not more than 1000, but averaging 250; we don't have a large number of consumers; again, it's an internal app.
We started looking at MongoDB as a potential cache store, but given that MongoDB has a max document size of 8 or 16Mb, we did not feel it was the best fit.
Based on the limited details I provided, any suggestions on other types of stores that could be suitable in this situation?
The way I understand your question, you basically want to cache the files, i.e. you don't need to understand the files' contents, right?
In that case, you can use MongoDB's GridFS to cache the xml as a file. This way, you can smoothly stream the file in and out of the database. You could use the URI as a 'file name' and, well, that should do the job.
There are no (reasonable) file size limits and it is supported by most, if not all, of the drivers.
Twitter's engineering team just blogged about their SpiderDuck project that does something like what you're describing. They use Cassandra and Scribe+HDFS for their backends.
http://engineering.twitter.com/2011/11/spiderduck-twitters-real-time-url.html
The simplest solution here is just caching these pieces of data in a file system. You can use tmpfs to ensure everything is in the main memory or any normal file system if you want the size of your cache be larger than the memory you have. Don't worry, even in the latter case the OS kernel will efficiently cache everything that is used frequently in the main memory. Still you have to delete the old files via cron if you're using Linux.
It seems to be like an old school solution, but it could be simpler to implement and less error prone than many others.

Storing millions of log files - Approx 25 TB a year

As part of my work we get approx 25TB worth log files annually, currently it been saved over an NFS based filesystem. Some are archived as in zipped/tar.gz while others reside in pure text format.
I am looking for alternatives of using an NFS based system. I looked at MongoDB, CouchDB. The fact that they are document oriented database seems to make it the right fit. However the log files content needs to be changed to JSON to be store into the DB. Something I am not willing to do. I need to retain the log files content as is.
As for usage we intend to put a small REST API and allow people to get file listing, latest files, and ability to get the file.
The proposed solutions/ideas need to be some form of distributed database or filesystem at application level where one can store log files and can scale horizontally effectively by adding more machines.
Ankur
Since you dont want queriying features, You can use apache hadoop.
I belive HDFS and HBase will be nice fit for this.
You can see lot of huge storage stories inside Hadoop powered by page
Take a look at Vertica, a columnar database supporting parallel processing and fast queries. Comcast used it to analyze about 15GB/day of SNMP data, running at an average rate of 46,000 samples per second, using five quad core HP Proliant servers. I heard some Comcast operations folks rave about Vertica a few weeks ago; they still really like it. It has some nice data compression techniques and "k-safety redundancy", so they could dispense with a SAN.
Update: One of the main advantages of a scalable analytics database approach is that you can do some pretty sophisticated, quasi-real time querying of the log. This might be really valuable for your ops team.
Have you tried looking at gluster? It is scalable, provides replication and many other features. It also gives you standard file operations so no need to implement another API layer.
http://www.gluster.org/
I would strongly disrecommend using a key/value or document based store for this data (mongo, cassandra, etc.). Use a file system. This is because the files are so large, and the access pattern is going to be linear scan. One thing problem that you will run into is retention. Most of the "NoSQL" storage systems use logical delete, which means that you have to compact your database to remove deleted rows. You'll also have a problem if your individual log records are small and you have to index each one of them - your index will be very large.
Put your data in HDFS with 2-3 way replication in 64 MB chunks in the same format that it's in now.
If you are to choose a document database:
On CouchDB you can use the _attachement API to attach the file as is to a document, the document itself could contain only metadata (like timestamp, locality and etc) for indexing. Then you will have a REST API for the documents and the attachments.
A similar approach is possible with Mongo's GridFs, but you would build the API yourself.
Also HDFS is a very nice choice.