Storing lots of files - postgresql

I need to store a lot of files (like millions per day). On average, a file is 20 KB. I also need to store some meta-data for these files (date, source, classification etc.) I need to be able to access and retrieve the files according to queries on metadata (No joins, only filtering with WHERE clauses). Writes must be fast, read times are not as important.
As far as I understand, I have 3 possible ways of storing data:
Use an RDBMS (e.g. PostgreSQL) to store meta-data and store file paths. Execute queries then read matching files from file system
Use only Cassandra (my company uses Cassandra). Store meta-data and file content on Cassandra.
Use Postgres + Cassandra together. Store meta-data and Cassandra keys on Postgres, query Postgres and retrieve Cassandra keys,then get actual file content from Cassandra
What are the advantages, disadvantages for these options? I am thinking I should go with option 2 but cannot be sure.
Thanks

It really depends on the size of your files. Storing large files in Cassandra generally isn't the best solution. You'd have to chunk your files at some point to store the content in separate columns using wide rows. In this case it would be better to use a distributed file system like ceph.
But if files are only 20k, the overhead using a distributed FS will not be worth it and Cassandra will do a good job storing this amount of data in a single column as a blob. You just need to be aware of the memory footprint while working with those rows. Each time you retrieve such a file from Cassandra, the whole content will be loaded into the heap, if you don't use chunking using a clustering key.

Related

Is it efficient to store images inside MongoDB using GridFS?

I know how to do it, but I wonder if it's effective. As I know MongoDB has very efficient clusters and I can flexibly control the collections and the servers they reside on. The only problem is the size of the files and the speed of accessing them through MongoDB.
Should I explore something like Apache Hadoop or if I intelligently cluster MongoDB, will I get similar access speed results?
GridFS is provided for convenience, it is not designed to be the ultimate binary blob storage platform.
MongoDB imposes a limit of 16 MB on each document it stores. This is unlike, for example, many relational databases which permit much larger values to be stored.
Since many applications deal with large binary blobs, MongoDB's solution to this problem is GridFS, which roughly works like this:
For each blob to be inserted, a metadata document is inserted into the metadata collection.
Then, the actual blob is split into 16 MB chunks and uploaded as a sequence of documents into the blob collection.
MongoDB drivers provide helpers for writing and reading the blobs and the metadata.
Thus, on first glance, the problem is solved - the application can store arbitrarily large blobs in a straightforward manner. However, digging deeper, GridFS has the following issues/limitations:
On the server side, documents storing blob chunks aren't stored separately from other documents. As such they compete for cache space with the actual documents. A database which has both content documents and blobs is likely to perform worse than a database that has only content documents.
At the same time, since the blob chunks are stored in the same way as content documents, storing them is generally expensive. For example, S3 is much cheaper than EBS storage, and GridFS would put all data on EBS.
To my knowledge there is no support for parallel writes or parallel reads of the blobs (writing/reading several chunks of the same blob at a time). This can in principle be implemented, either in MongoDB drivers or in an application, but as far as I know this isn't provided out of the box by any driver. This limits I/O performance when the blobs are large.
Similarly, if a read or write fails, the entire blob must be re-read or re-written as opposed to just the missing fragment.
Despite these issues, GridFS may be a fine solution for many use cases:
If the overall data size isn't very large, the negative cache effects are limited.
If most of the blobs fit in a single document, their storage should be quite efficient.
The blobs are backed up and otherwise transfered together with the content documents in the database, improving data consistency and reducing the risk of data loss/inconsistencies.
The good practice is to upload image somewhere (your server or cloud), and then only store image url in MongoDB.
Anyway, I did a little investigating. The short conclusion is: if you need to store user avatars you can use MongoDB, but only if it's a single avatar (You can't store many blobs inside MongoDB) and if you need to store videos or just many and heavy files, then you need something like CephFS.
Why do I think so? The thing is, when I was testing with MongoDB and media files on a slow instance, files weighing up to 10mb(Usually about 1 megabyte) were coming back at up to 3000 milliseconds. That's an unacceptably long time. When there were a lot of files (100+), it could turn into a pain. A real pain.
Ceph is designed just for storing files. To store petabytes of information. That's what's needed.
How do you implement this in a real project? If you use the OOP implementation of MongoDB(Mongoose), you can just add methods to the database objects that access Ceph and do what you need. You can make methods "load file", "delete file", "count quantity" and so on, and then just use it all together as usual. Don't forget to maintain Ceph, add servers as needed, and everything will work perfectly. The files themselves should be accessed only through your web server, not directly, i.e. the web server should throw a request to Ceph when the user needs to give the file and return the response from Ceph to the user.
I hope I helped more than just myself. I'll go add Ceph to my tags. Good luck!
GridFS
Ceph File System
More Ceph

Is MongoDB a good choice for storing a huge set of text files?

I'm currently building a system (with GCP) for storing large set of text files of different sizes (1kb~100mb) about different subjects. One fileset could be more than 10GB.
For example:
dataset_about_some_subject/
- file1.txt
- file2.txt
...
dataset_about_another_subject/
- file1.txt
- file2.txt
...
The files are for NLP, and after pre-processing, as pre-processed data are saved separately, will not be accessed frequently. So saving all files in MongoDB seems unnecessary.
I'm considering
saving all files into some cloud storage,
save file information like name and path to MongoDB as JSON.
The above folders turn to:
{
name: dataset_about_some_subject,
path: path_to_cloud_storage,
files: [
{
name: file1.txt
...
},
...
]
}
When any fileset is needed, search its name in MongoDB and read the files from cloud storage.
Is this a valid way? Will there be any I/O speed problem?
Or is there any better solution for this?
And I've read about Hadoop. Maybe this is a better solution?
Or maybe not. My data is not that big.
As far as I remember, MongoDB has a maximum object size of 16 MB, which is below the maximum size of the files (100 MB). This means that, unless one splits, storing the original files in plaintext JSON strings would not work.
The approach you describe, however, is sensible. Storing the files on cloud storage such as S3 or Azure, is common, not very expensive, and does not require a lot of maintenance comparing to having your own HDFS cluster. I/O would be best by performing the computations on the machines of the same provider, and making sure the machines are in the same region as the data.
Note that document stores, in general, are very good at handling large collections of small documents. Retrieving file metadata in the collection would thus be most efficient if you store the metadata of each file in a separate object (rather than in an array of objects in the same document), and have a corresponding index for fast lookup.
Finally, there is another aspect to consider, namely, whether your NLP scenario will process the files by scanning them (reading them all entirely) or whether you need random access or lookup (for example, a certain word). In the first case, which is throughput-driven, cloud storage is a very good option. In the latter case, which is latency-driven, there are document stores like Elasticsearch that offer good fulltext search functionality and can index text out of the box.
I recommend you to store large file using storage service provide by below. It also support Multi-regional access through CDN to ensure the speed of file access.
AWS S3: https://aws.amazon.com/tw/s3/
Azure Blob: https://azure.microsoft.com/zh-tw/pricing/details/storage/blobs/
GCP Cloud Storage: https://cloud.google.com/storage
You can rest assured that for the metadata storage you propose in mongodb, speed will not be a problem.
However, for storing the files themselves, you have various options to consider:
Cloud storage: fast setup, low initial cost, medium cost over time (compare vendor prices), datatransfer over public network for every access (might be a performance problem)
Mongodb-Gridfs: already in place, operation cost varies, data transfer is just as fast as from mongo itself
Hadoop cluster: high initial hardware and setup cost, lower cost over time. Data transfer in local network (provided you build it on-premise.) Specialized administration skills needed. Possibility to use the cluster for parrallel calculations (i.e. this is not only storage, this is also computing power.) (As a rule of thumb: if you are not going to store more than 500 TB, this is not worthwile.)
If you are not sure about the amount of data you cover, and just want to get started, I recommend starting out with gridfs, but encapsulate in a way that you can easily exchange the storage.
I have another answer: as you say, 10GB is really not big at all. You may want to also consider the option of storing it on your local computer (or locally on one single machine in the cloud), simply on your regular file system, and executing in parallel on your cores (Hadoop, Spark will do this too).
One way of doing it is to save the metadata as a single large text file (or JSON Lines, Parquet, CSV...), the metadata for each file on a separate line, then have Hadoop or Spark parallelize over this metadata file, and thus process the actual files in parallel.
Depending on your use case, this might turn out to be faster than on a cluster, or not exceedingly slower, especially if your execution is CPU-heavy. A cluster has clear benefits when the problem is that you cannot read from the disk fast enough, and for workloads executed occasionally, this is a problem that one starts having from the TB range.
I recommend this excellent paper by Frank McSherry:
https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-mcsherry.pdf

Store videos and images dataset in Apache Xindice in XML format

To have a large dataset with images and videos, I would like to use Apache Xindice. There are very few tutorials and guides on WWW for Apache Xindece. How to store image and video files in Apache Xindice? Is Apache Xindice suitable to stroe large set of data? Is there any latest repository which can store large set of data in XML format (Not SQL type of databases. Should save TB size data)? Can I use MongoDB for storing large dataset?
I suggest to store external documents (images/videos, XML files) in MongoDB, using the GridFS file system. GirdFS collection consist of two parts: the chunks collection, where the binary data are stored, and the files collection, holding the information about the files, including customer defined metadata. From the FAQ:
In some situations, storing large files may be more efficient in a
MongoDB database than on a system-level filesystem.
If your filesystem limits the number of files in a directory, you can
use GridFS to store as many files as needed. When you want to keep
your files and metadata automatically synced and deployed across a
number of systems and facilities.
When using geographically
distributed replica sets MongoDB can distribute files and their
metadata automatically to a number of mongod instances and
facilitates.
When you want to access information from portions of
large files without having to load whole files into memory, you can
use GridFS to recall sections of files without reading the entire file
into memory.
For large data sets, GridFS can be sharded (see http://docs.mongodb.org/manual/core/sharded-cluster-internals/#sharding-gridfs-stores).
For fast delivery of GridFS data, there are modules for ngnix (ngnix-gridfs) and Apache (mod_gridfs). See also http://nosql.mypopescu.com/post/28085493064/mongodb-gridfs-over-http-with-mod-gridfs for a quick comparison

Is Cassandra good for storing files?

I'm developing a php platform that will make huge use of images, documents and any file format that will come in my mind so i was wondering if Cassandra is a good choice for my needs.
If not, can you tell me how should i store files? I'd like to keep using cassandra because it's fault-tolerant and uses auto-replication among nodes.
Thanks for help.
From the cassandra wiki,
Cassandra's public API is based on Thrift, which offers no streaming abilities
any value written or fetched has to fit in memory. This is inherent to Thrift's
design and is therefore unlikely to change. So adding large object support to
Cassandra would need a special API that manually split the large objects up
into pieces. A potential approach is described in http://issues.apache.org/jira/browse/CASSANDRA-265.
As a workaround in the meantime, you can manually split files into chunks of whatever
size you are comfortable with -- at least one person is using 64MB -- and making a file correspond
to a row, with the chunks as column values.
So if your files are < 10MB you should be fine, just make sure to limit the file size, or break large files up into chunks.
You should be OK with files of 10MB. In fact, DataStax Brisk puts a filesystem on top of Cassandra if I'm not mistaken: http://www.datastax.com/products/enterprise.
(I'm not associated with them in any way- this isn't an ad)
As fresh information, Netflix provides utilities for their cassandra client called astyanax for storing files as handled object stores. Description and examples can be found here. It can be a good starting point to write some tests using astyanax and evaluate Cassandra as a file storage.

Storing millions of log files - Approx 25 TB a year

As part of my work we get approx 25TB worth log files annually, currently it been saved over an NFS based filesystem. Some are archived as in zipped/tar.gz while others reside in pure text format.
I am looking for alternatives of using an NFS based system. I looked at MongoDB, CouchDB. The fact that they are document oriented database seems to make it the right fit. However the log files content needs to be changed to JSON to be store into the DB. Something I am not willing to do. I need to retain the log files content as is.
As for usage we intend to put a small REST API and allow people to get file listing, latest files, and ability to get the file.
The proposed solutions/ideas need to be some form of distributed database or filesystem at application level where one can store log files and can scale horizontally effectively by adding more machines.
Ankur
Since you dont want queriying features, You can use apache hadoop.
I belive HDFS and HBase will be nice fit for this.
You can see lot of huge storage stories inside Hadoop powered by page
Take a look at Vertica, a columnar database supporting parallel processing and fast queries. Comcast used it to analyze about 15GB/day of SNMP data, running at an average rate of 46,000 samples per second, using five quad core HP Proliant servers. I heard some Comcast operations folks rave about Vertica a few weeks ago; they still really like it. It has some nice data compression techniques and "k-safety redundancy", so they could dispense with a SAN.
Update: One of the main advantages of a scalable analytics database approach is that you can do some pretty sophisticated, quasi-real time querying of the log. This might be really valuable for your ops team.
Have you tried looking at gluster? It is scalable, provides replication and many other features. It also gives you standard file operations so no need to implement another API layer.
http://www.gluster.org/
I would strongly disrecommend using a key/value or document based store for this data (mongo, cassandra, etc.). Use a file system. This is because the files are so large, and the access pattern is going to be linear scan. One thing problem that you will run into is retention. Most of the "NoSQL" storage systems use logical delete, which means that you have to compact your database to remove deleted rows. You'll also have a problem if your individual log records are small and you have to index each one of them - your index will be very large.
Put your data in HDFS with 2-3 way replication in 64 MB chunks in the same format that it's in now.
If you are to choose a document database:
On CouchDB you can use the _attachement API to attach the file as is to a document, the document itself could contain only metadata (like timestamp, locality and etc) for indexing. Then you will have a REST API for the documents and the attachments.
A similar approach is possible with Mongo's GridFs, but you would build the API yourself.
Also HDFS is a very nice choice.