Do any common OS file systems use hashes to avoid storing the same content data more than once? - hash

Many file storage systems use hashes to avoid duplication of the same file content data (among other reasons), e.g., Git and Dropbox both use SHA256. The file names and dates can be different, but as long as the content gets the same hash generated, it never gets stored more than once.
It seems this would be a sensible thing to do in a OS file system in order to save space. Are there any file systems for Windows or *nix that do this, or is there a good reason why none of them do?
This would, for the most part, eliminate the need for duplicate file finder utilities, because at that point the only space you would be saving would be for the file entry in the file system, which for most users is not enough to matter.
Edit: Arguably this could go on serverfault, but I feel developers are more likely to understand the issues and trade-offs involved.

ZFS supports deduplication since last month: http://blogs.oracle.com/bonwick/en_US/entry/zfs_dedup
Though I wouldn't call this a "common" filesystem (afaik, it is currently only supported by *BSD), it is definitely one worth looking at.

It would save space, but the time cost is prohibitive. The products you mention are already io bound, so the computational cost of hashing is not a bottleneck. If you hashed at the filesystem level, all io operations which are already slow will get worse.

NTFS has single instance storage.

NetApp has supported deduplication (that's what its called in the storage industry) in the WAFL filesystem (yeah, not your common filesystem) for a few years now. This is one of the most important features found in the enterprise filesystems today (and NetApp stands out because they support this on their primary storage also as compared to other similar products which support it only on their backup or secondary storage; they are too slow for primary storage).
The amount of data which is duplicate in a large enterprise with thousands of users is staggering. A lot of those users store the same documents, source code, etc. across their home directories. Reports of 50-70% data deduplicated have been seen often, saving lots of space and tons of money for large enterprises.
All of this means that if you create any common filesystem on a LUN exported by a NetApp filer, then you get deduplication for free, no matter what the filesystem created in that LUN. Cheers. Find out how it works here and here.

btrfs supports online de-duplication of data at the block level. I'd recommend duperemove as an external tool is needed.

It would require a fair amount of work to make this work in a file system. First of all, a user might be creating a copy of a file, planning to edit one copy, while the other remains intact -- so when you eliminate the duplication, the hard link you created that way would have to give COW semantics.
Second, the permissions on a file are often based on the directory into which that file's name is placed. You'd have to ensure that when you create your hidden hard link, that the permissions were correctly applied based on the link, not just the location of the actual content.
Third, users are likely to be upset if they make (say) three copies of a file on physically separate media to ensure against data loss from hardware failure, then find out that there was really only one copy of the file, so when that hardware failed, all three copies disappeared.
This strikes me as a bit like a second-system effect -- a solution to a problem long after the problem ceased to exist (or at least matter). With hard drives current running less than $100US/terabyte, I find it hard to believe that this would save most people a whole dollar worth of hard drive space. At that point, it's hard to imagine most people caring much.
There are file systems that do deduplication, which is sort of like this, but still noticeably different. In particular, deduplication is typically done on a basis of relatively small blocks of a file, not on complete files. Under such a system, a "file" basically just becomes a collection of pointers to de-duplicated blocks. Along with the data, each block will typically have some metadata for the block itself, that's separate from the metadata for the file(s) that refer to that block (e.g., it'll typically include at least a reference count). Any block that has a reference count greater than 1 will be treated as copy on write. That is, any attempt at writing to that block will typically create a copy, write to the copy, then store the copy of the block to the pool (so if the result comes out the same as some other block, deduplication will coalesce it with the existing block with the same content).
Many of the same considerations still apply though--most people don't have enough duplication to start with for deduplication to help a lot.
At the same time, especially on servers, deduplication at a block level can serve a real purpose. One really common case is dealing with multiple VM images, each running one of only a few choices of operating systems. If we look at the VM image as a whole, each is usually unique, so file-level deduplication would do no good. But they still frequently have a large chunk of data devoted to storing the operating system for that VM, and it's pretty common to have many VMs running only a few operating systems. With block-level deduplication, we can eliminate most of that redundancy. For a cloud server system like AWS or Azure, this can produce really serious savings.

Related

Key Value storage without a file system?

I am working on an application, where we are writing lots and lots of key value pairs. On production the database size will run into hundreds of Terabytes, even multiple Petabytes. The keys are 20 bytes and the value is maximum 128 KB, and very rarely smaller than 4 KB. Right now we are using MongoDB. The performance is not very good, because obviously there is a lot of overhead going on here. MongoDB writes to the file system, which writes to the LVM, which further writes to a RAID 6 array.
Since our requirement is very basic, I think using a general purpose database system is hitting the performance. I was thinking of implementing a simple database system, where we could put the documents (or 'values') directly to the raw drive (actually the RAID array), and store the keys (and a pointer to where the value lives on the raw drive) in a fast in-memory database backed by an SSD. This will also speed-up the reads, as all there would not be no fragmentation (as opposed to using a filesystem.)
Although a document is rarely deleted, we would still have to maintain a pool of free space available on the device (something that the filesystem would have provided).
My question is, will this really provide any significant improvements? Also, are there any document storage systems that do something like this? Or anything similar, that we can use as a starting poing?
Apache Cassandra jumps to mind. It's the current elect NoSQL solution where massive scaling is concerned. It sees production usage at several large companies with massive scaling requirements. Having worked a little with it, I can say that it requires a little bit of time to rethink your data model to fit how it arranges its storage engine. The famously citied article "WTF is a supercolumn" gives a sound introduction to this. Caveat: Cassandra really only makes sense when you plan on storing huge datasets and distribution with no single point of failure is a mission critical requirement. With the way you've explained your data, it sounds like a fit.
Also, have you looked into redis at all, at least for saving key references? Your memory requirements far outstrip what a single instance would be able to handle but Redis can also be configured to shard. It isn't its primary use case but it sees production use at both Craigslist and Groupon
Also, have you done everything possible to optimize mongo, especially investigating how you could improve indexing? Mongo does save out to disk, but should be relatively performant when optimized to keep the hottest portion of the set in memory if able.
Is it possible to cache this data if its not too transient?
I would totally caution you against rolling your own with this. Just a fair warning. That's not a knock at you or anyone else, its just that I've personally had to maintain custom "data indexes" written by in house developers who got in way over their heads before. At my job we have a massive on disk key-value store that is a major performance bottleneck in our system that was written by a developer who has since separated from the company. It's frustrating to be stuck such a solution among the exciting NoSQL opportunities of today. Projects like the ones I cited above take advantage of the whole strength of the open source community to proof and optimize their use. That isn't something you will be able to attain working on your own solution unless you make a massive investment of time, effort and promotion. At the very least I'd encourage you to look at all your nosql options and maybe find a project you can contribute to rather than rolling your own. Writing a database server itself is definitely a nontrivial task that needs a huge team, especially with the requirements you've given (but should you end up doing so, I wish you luck! =) )
Late answer, but for future reference I think Spider does this

MongoDB as file storage

i'm trying to find the best solution to create scalable storage for big files. File size can vary from 1-2 megabytes and up to 500-600 gigabytes.
I have found some information about Hadoop and it's HDFS, but it looks a little bit complicated, because i don't need any Map/Reduce jobs and many other features. Now i'm thinking to use MongoDB and it's GridFS as file storage solution.
And now the questions:
What will happen with gridfs when i try to write few files
concurrently. Will there be any lock for read/write operations? (I will use it only as file storage)
Will files from gridfs be cached in ram and how it will affect read-write perfomance?
Maybe there are some other solutions that can solve my problem more efficiently?
Thanks.
I can only answer for MongoDB here, I will not pretend I know much about HDFS and other such technologies.
The GridFs implementation is totally client side within the driver itself. This means there is no special loading or understanding of the context of file serving within MongoDB itself, effectively MongoDB itself does not even understand they are files ( http://docs.mongodb.org/manual/applications/gridfs/ ).
This means that querying for any part of the files or chunks collection will result in the same process as it would for any other query, whereby it loads the data it needs into your working set ( http://en.wikipedia.org/wiki/Working_set ) which represents a set of data (or all loaded data at that time) required by MongoDB within a given time frame to maintain optimal performance. It does this by paging it into RAM (well technically the OS does).
Another point to take into consideration is that this is driver implemented. This means that the specification can vary, however, I don't think it does. All drivers will allow you to query for a set of documents from the files collection which only houses the files meta data allowing you to later serve the file itself from the chunks collection with a single query.
However that is not the important thing, you want to serve the file itself, including its data; this means that you will be loading the files collection and its subsequent chunks collection into your working set.
With that in mind we have already hit the first snag:
Will files from gridfs be cached in ram and how it will affect read-write perfomance?
The read performance of small files could be awesome, directly from RAM; the writes would be just as good.
For larger files, not so. Most computers will not have 600 GB of RAM and it is likely, quite normal in fact, to house a 600 GB partition of a single file on a single mongod instance. This creates a problem since that file, in order to be served, needs to fit into your working set however it is impossibly bigger than your RAM; at this point you could have page thrashing ( http://en.wikipedia.org/wiki/Thrashing_%28computer_science%29 ) whereby the server is just page faulting 24/7 trying to load the file. The writes here are no better as well.
The only way around this is to starting putting a single file across many shards :\.
Note: one more thing to consider is that the default average size of a chunks "chunk" is 256KB, so that's a lot of documents for a 600GB file. This setting is manipulatable in most drivers.
What will happen with gridfs when i try to write few files concurrently. Will there be any lock for read/write operations? (I will use it only as file storage)
GridFS, being only a specification uses the same locks as on any other collection, both read and write locks on a database level (2.2+) or on a global level (pre-2.2). The two do interfere with each other as well, i.e. how can you ensure a consistent read of a document that is being written to?
That being said the possibility for contention exists based on your scenario specifics, traffic, number of concurrent writes/reads and many other things we have no idea about.
Maybe there are some other solutions that can solve my problem more efficiently?
I personally have found that S3 (as #mluggy said) in reduced redundancy format works best storing a mere portion of meta data about the file within MongoDB, much like using GridFS but without the chunks collection, let S3 handle all that distribution, backup and other stuff for you.
Hopefully I have been clear, hope it helps.
Edit: Unlike what I accidently said, MongoDB does not have a collection level lock, it is a database level lock.
Have you considered saving meta data onto MongoDB and writing actual files to Amazon S3? Both have excellent drivers and the latter is highly redundant, cloud/cdn-ready file storage. I would give it a shot.
I'll start by answering the first two:
There is a write lock when writing in to GridFS, yes. No lock for reads.
The files wont be cached in memory when you query them, but their metadata will.
GridFS may not be the best solution for your problem. Write locks can become something of a pain when you're dealing with this type of situation, particularly for huge files. There are other databases out there that may solve this problem for you. HDFS is a good choice, but as you say, it is very complicated. I would recommend considering a storage mechanism like Riak or Amazon's S3. They're more oriented around being storage for files, and don't end up with major drawbacks. S3 and Riak both have excellent admin facilities, and can handle huge files. Though with Riak, last I knew, you had to do some file chunking to store files over 100mb. Despite that, it generally is a best practice to do some level of chunking for huge file sizes. There are a lot of bad things that can happen when transferring files in to DBs- From network time outs, to buffer overflows, etc. Either way, your solution is going to require a fair amount of tuning for massive file sizes.

PostgreSQL temporary table cache in memory?

Context:
I want to store some temporary results in some temporary tables. These tables may be reused in several queries that may occur close in time, but at some point the evolutionary algorithm I'm using may not need some old tables any more and keep generating new tables. There will be several queries, possibly concurrently, using those tables. Only one user doing all those queries. I don't know if that clarifies everything about sessions and so on, I'm still uncertain about how that works.
Objective:
What I would like to do is to create temporary tables (if they don't exist already), store them on memory as far as that is possible and if at some point there is not enough memory, delete those that would be committed to the HDD (I guess those will be the least recently used).
Examples:
The client will be doing queries for EMAs with different parameters and an aggregation of them with different coefficients, each individual may vary in terms of the coefficients used and so the parameters for the EMAs may repeat as they are still in the gene pool, and may not be needed after a while. There will be similar queries with more parameters and the genetic algorithm will find the right values for the parameters.
Questions:
Is that what "on commit drop" means? I've seen descriptions about
sessions and transactions but I don't really understand those
concepts. Sorry if the question is stupid.
If it is not, do you know about any simple way to get Postgres to do
this?
Workaround:
In the worst case I should be able to make a guesstimation about how many tables I can keep on memory and try to implement the LRU by myself, but it's never going to be as good as what Postgres could do.
Thank you very much.
This is a complicated topic and probably one to discuss in some depth. I think it is worth both explaining why PostgreSQL doesn't support this and also what you can do instead with recent versions to approach what you are trying to do.
PostgreSQL has a pretty good approach to caching diverse data sets across multiple users. In general you don't want to allow a programmer to specify that a temporary table must be kept in memory if it becomes very large. Temporary tables however are managed quite differently from normal tables in that they are:
Buffered by the individual back-end, not the shared buffers
Locally visible only, and
Unlogged.
What this means is that typically you aren't generating a lot of disk I/O for temporary tables. The tables do not normally flush WAL segments, and they are managed by the local back-end so they don't affect shared buffer usage. This means that only occasionally is data going to be written to disk and only when necessary to free memory for other (usually more frequent) tasks. You certainly aren't forcing disk writes and only need disk reads when something else has used up memory.
The end result is that you don't really need to worry about this. PostgreSQL already tries, to a certain extent, to do what you are asking it to do, and temporary tables have much lower disk I/O requirements than standard tables do. It does not force the tables to stay in memory though and if they become large enough, the pages may expire into the OS disk cache, and eventually on to disk. This is an important feature because it ensures that performance gracefully degrades when many people create many large temporary tables.

Reason for monolithic data files

Primarily this seems to be a technique used by games, where they have all the sounds in one file, textures in another etc. With these files commonly reaching the GB size.
What is the reason behind doing this over maintaining it all in subdirectories as small files - one per texture which many small games use this, with the monolithic system being favoured by larger companies?
Is there some file system overhead with lots of small files?
Are they trying to protect their property - although most just seem to be a compressed file with a new extension?
The reasons we use an "archive" system like this where I work (a game development company):
lookup speed: We rarely need to iterate over files in a directory; we're far more often looking them up directly by name. By using a custom "file allocation table" that is essentially just a sequence of hash( normalized_filename ) -> [ offset, size ], we can look up files very quickly. We can also keep this index in RAM, potentially interleave it with other index tables, etc.
(When we do need to iterate, we can either easily iterate over all files in a .arc, or we can store a list of filenames, a list of hash-of-filenames, or just a list of [ offset, size ] pairs somewhere -- maybe even as a file in the archive. This is usually faster than directory-traversal on a FS.)
metadata: It's easy for us to tuck in any file metadata we want. For example, a single bit in the "size" field indicates whether the file is compressed or not (if it is, it has a header with more details about how to decompress it). We can even vary compression on pieces of a file if we know enough about the structure of the file ahead of time (we do this for sprite archives).
size: One of the devices we use has a "file size must be a multiple of X" requirement, where X is large compared to some of our files. For example, some of our lua scripts end up being just a few hundred bytes when compiled; taking extra overhead per .luc file adds up quickly.
alignment: on the other hand, sometimes we want to waste space. To take advantage of faster streaming (e.g. background DMA) from the filesystem, some of our files do want to obey certain alignment/size requirements. We can take care of that right in the tool, and the align/size we're shooting for doesn't necessarily have to line up with the underlying FS, allowing us to waste space only where we need it.
But those are the mundane reasons. The more fun stuff:
Each .arc registers in a list, and attempts to open a file know to look in the arcs. We search already-in-RAM archives first, then archives on the device FS, then the actual device FS. This gives us a ton of flexibility:
dynamic additions to the filesystem: at any time we can stream a new file or archive to the machine in question (over the network or the like) and have it appear as part of the "logical" filesystem; this is great when the actual FS resides in ROM or on a CD, and allows us to iterate much more quickly than we could otherwise.
(Doom's .wad system is a sort of example of the above, which allows modders to more easily override assets and scripts built into the game.)
possibility of no underlying fs: It's possible to use bin2obj to embed an entire arc directly in the executable (.rodata) at link time, at which point you don't ever need to look at the device FS -- we do this for certain small demo builds and the like. We can also send levels across the network or savegame-sneakernet this way. =)
organization and load/unload: since we can load and unload and override virtual "pieces" of our filesystem at any time, we can do some performance tricks with having the number of files in the FS be very small at any given time. We can additionally specify that an entire archive be loaded into memory, index table and data; our file load code is smart enough to know that if the file is already in memory, it doesn't need to do anything to read it other than move a pointer around. Some of the higher level code can actually detect that the file is in ram and just ask for the probably-already-looks-like-a-struct pointer directly.
portability: we only need to figure out how to get a few files on each new device we use, and then the remainder of the FS code is more or less the same. =) We do change the tool output a bit occasionally (for alignment reasons), but most of the processing remains the same.
de-duplication: with smarter archives, such as our sprite archives, we can (and do) de-duplicate data. If "jump" animation's fifth frame and "kick"'s third frame are the same, we can pull apart the file and only store one copy of that frame. We can do the same for whole files.
We ported a PC game to a system with much slower FS access recently. We didn't change the data format, and it turns out iterating through a dir on the raw device FS to load a hundred small XML files was absolutely killing our load times. The solution we used was to take each dir, make it into its own subdir.arc, and stick it in the master game.arc compressed. When the dir was needed (something like opendir was called) we decompressed the entire subdir.arc into RAM, added it to the filesystem, then iterated through it super-quickly.
It's the ability to throw something like this together in a few hours, and to ease the pain of porting across systems, that makes stuff like this worthwhile.
File systems do have an overhead. Usually, a file takes disk space rounded up to some power of 2 (e.g. up to 4 KB), so many small files would waste space. Some modern file systems try to mitigate that, but AFAIK it's not widespread yet. Additionally, file systems are often quite slow when accessing multiple files. E.g. it is usually considerably faster to copy one 400 MB file than 4000 100 KB files.
File systems come in handy when you have to modify files, because they handle changing file sizes much better than any simple home-grown solution. However, that's certainly not the case for constant game data.
On Apple systems, the most common way is to use, as you suggest, directories. They are called Bundles, and are in the Finder represented as just one file, but if you explore them more, they're actually directories. This makes writing code and conserving memory when loading individual items out of this bundle very easy. :-) Also, this makes taking incremental backups of gigantic databases easy, as for instance your iPhoto database is just a bundle, so you just backup changed and new files
On Windows, however, I believe this is much harder to do, it will look like a directory "no matter what" (I'm sure smart people have found a solution that will make Explorer see certain directories as a single file, but it's not common).
From a games developer point of view, you're not dealing with so small files that disk space overhead is something you're very much concerned with, so I doubt #doublep's suggestion, since it makes for such a hassle, but it makes it much easier with a single file if users are to copy an entire game over somewhere, then it's easy to check if the entire set is correct.
And, of course, it's harder to read for people that shouldn't have access to it. But it's also harder to modify, which means harder to patch, and harder to write extensions. Someone that uses extensions a lot, prefers the directory structure: The Sims.
Were I the games developer, I'd love to go for individual files. Then again, I'd be using bundles as I'd be writing for the Mac ;-)
Cheers
Nik
I can think of multiple reasons.
As doublep suggested, files occupy more space on the disc than they require. So an archive saves space. 10k files (of any size) should save you 20MB when packed into an archive. Not exactly a large amount of space nowadays, but still.
The other reason I can think of is disc fragmentation. I suspect a heavily fragmented disc will perform worse when accessing thousands of separate files on a fragmented space. But I'm no expert in this field, so I'd appreciate if someone more experienced verified this.
Finally, I think this may also have something to do with restricting access to separate game files. You can have a bunch of Lua scripts exposed, mess with them and break something. Or you could have the outro cinematic/sound/text/whatever exposed and get spoiled by accessing it. I do that myself as well: I encrypt images with a multipass XOR key, pack text files and config variables into a monolithic file (zipped for extra security) and only leave music freely accessible. This way, the game's secrets will remain undiscovered for a bit longer :).
Or there may be another reason I never thought about :D.
As you know games, especially with larger companies try to squeeze as much performance as they can. One technique is to have all the data in one large file and just DMA it to memory (think of it as a memcpy from CD to RAM). Since all the files are in one large one there will be no disk seeks and you can have a large number of files (which may cause large amount of seeks) all loaded quicky because of the technique.

Long term source code archiving: Is it possible?

I'm curious about keeping source code around reliably and securely for several years. From my research/experience:
Optical media, such as burned DVD-R's lose bits of data over time. After a couple years, I don't get all the files off that I put on them. Read errors, etc.
Hard drives are mechanical and subject to failure/obsolescence with expensive data recovery fees, that hardly keep your data private (you send it away to some company).
Magnetic tape storage: see #2.
Online storage is subject to the whim of some data storage center, the security or lack of security there, and the possibility that the company folds, etc. Plus it's expensive, and you can't guarantee that they aren't peeking in.
I've found over time that I've lost source code to old projects I've done due to these problems. Are there any other solutions?
Summary of answers:
1. Use multiple methods for redundancy.
2. Print out your source code either as text or barcode.
3. RAID arrays are better for local storage.
4. Open sourcing your project will make it last forever.
5. Encryption is the answer to security.
6. Magnetic tape storage is durable.
7. Distributed/guaranteed online storage is cheap and reliable.
8. Use source control to maintain history, and backup the repo.
The best answer is "in multiple places". If I were concerned about keeping my source code for as long as possible I would do:
1) Backup to some optical media on a regular basis, say burn it to DVD once a month and archive it offsite.
2) Back it up to multiple hard drives on my local machines
3) Back it up to Amazon's S3 service. They have guarantees, it's a distributed system so no single points of failure and you can easily encrypt your data so they can't "peek" at it.
With those three steps your chances of losing data are effectively zero. There is no such thing as too many backups for VERY important data.
Based on your level of paranoia, I'd recommend a printer and a safe.
More seriously, a RAID array isn't so expensive anymore, and so long as you continue to use and monitor it, a properly set-up array is virtually guaranteed never to lose data.
Any data you want to keep should be stored in multiple places on multiple formats. While the odds of any one failing may be significant, the odds of all of them failing are pretty small.
I think you'd be surprised how reasonably priced online storage is these days. Amazon S3 (simple storage solution) is $0.10 per gigabyte per month, with upload costs of $0.10 per GB and download costing $0.17 per GB maximum.
Therefore, if you stored 20GB for a month, uploaded 20GB and downloaded 20GB it would cost you $8.40 (slightly more expensive in the European data center at $9).
That's cheap enough to store your data in both US and EU data centers AND on dvd - the chances of losing all three are slim, to say the least.
There are also front-ends available, such as JungleDisk.
http://aws.amazon.com
http://www.jungledisk.com/
http://www.google.co.uk/search?q=amazon%20s3%20clients
The best way to back up your projects is to make them open source and famous. That way there will always be people with a copy of it and able to send it to you.
After that, just care of the magnetic/optical media, continued renewal of it and multiple copies (online as well, remember you can encrypt it) on multiple media (including, why not, RAID sets)
If you want to archive something for a long time, I would go with a tape drive. They may not hold a whole lot, but they are reliable and pretty much the storage medium of choice for data archiving. I've never personally experienced dataloss on a tape drive, however.
Don't forget to use Subversion (http://subversion.tigris.org/). I subversion my whole life (it's awesome).
The best home-usable solution I've seen was printing out the backups using a 2D barcode - the data density was fairly high, it could be re-scanned fairly easily (presuming a sheet-feeding scanner), and it moved the problem from the digital domain back into the physical one - which is fairly easily met by something like a safe deposit box, or a company like Iron Mountain.
The other answer is 'all of the above'. Redundancy always helps.
For my projects, I use a combination of 1, 2, & 4. If it's really important data, you need to have multiple copies in multiple places. My important data is replicated to 3-4 locations every night.
If you want a simpler solution, I recommend you get an online storage account from a well known provider which has an insured reliability guarantee. If you are worried about security, only upload data inside TrueCrypt encrypted archives. As far as cost, it will probably be pricey... But if it's really that important the cost is nothing.
For regulatory mandated archival of electronic data, we keep the data on a RAID and on backup tapes in two separate locations (one of which is Iron Mountain). We also replace the tapes and RAID every few years.
If you need to keep it "forever" probably the safest way is to print out the code and stick that in a plastic envelope to keep it safe from the elements. I can't tell you how much code I've lost to a backup means which are no longer reachable.... I don't have a paper card reader to read my old cobol deck, no drive for my 5 1/4" floppies, or my 3 1/2" floppies. but yet the print out that I made of my first big project still sits readable...even after my once 3 year old decided that it would make a good coloring book.
When you state "back up source code", I hope you include in your meaning the backing up of your version control system too.
Backing your current source code (to multiple places) is definitely critical, but backing up your history of changes as preseved by your VCS is paramount in my opinion. It may seem trivial especially when we are always "living in the present, looking towards the future". However, there have been way too many times when we have wanted to look backward to investigate an issue, review the chain of changes, see who did what, whether we can rollback to a previous build/version. All the more important if you practise heavy branching and merging. Archiving a single trunk will not do.
Your version control system may come with documentation and suggestions on backup strategies.
One way would be to periodically recycle your storage media, i.e. read data off the decaying medium and write it to a fresh one. There exist programs to assist you with this, e.g. dvdisaster. In the end, nothing lasts forever. Just pick the least annoying solution.
As for #2: you can store data in encrypted form to prevent data recovery experts from making sense of it.
I think Option 2 works well enough if you have the write backup mechanisms in place. They need not be expensive ones involving a third-party, either (except for disaster recovery).
A RAID 5 configured server would do the trick. If a hard drive fails, replace it. It is HIGHLY unlikely that all the hard drives will fail at the same time. Even a mirrored RAID 1 drive would be good enough in some cases.
If option 2 still seems like a crappy solution, the only other thing I can think of is to print out hard-copies of the source code, which has many more problems than any of the above solutions.
Online storage is subject to the whim of some data storage center, the security or lack of security there, and the possibility that the company folds, etc. Plus it's expensive,
Not necessarily expensive (see rsync.net for example), nor insecure. You can certainly encrypt your stuff too.
and you can't guarantee that they aren't peeking in.
True, but there's probably much more interesting stuff to peek at than your source-code. ;-)
More seriously, a RAID array isn't so expensive anymore
RAID is not backup.
I was just talking with a guy who is an expert in microfilm. While it is an old technology, for long term storage it is one of the most enduring forms of data storage if properly maintained. It doesn't require sophisticated equipment (magifying lens and a light) to read altough storing it may take some work.
Then again, as was previously mentioned, if you are only talking in the spans of a few years instead of decades printing it off to paper and storing it in a controlled environment is probable the best way. If you want to get really creative you could laminate every sheet!
Drobo for local backup
DVD for short-term local archiving
Amazon S3 for off-site,long-term archiving