I need to backup database files. Need something like github that is happy with 100s of gigabytes of data - github

I want a source controlled environment for a fairly large amount of database data, in text, before its loaded into the DBMS. We've been using GITHUB and its great. But they expect that a repository is less than 1 gigabyte and we have hundreds.
It could be in CVS or SVN, but tracking versions is important. The data is very static and is accessed only at low rates, say once a week for parts of it, once a month for more.
Any suggested places/services that do this? It doesn't have to be free, we'll happily pay a reasonable amount.

I confirm this kind of amount of data is incompatible with a Version Control System (made to record the history, ie the evolution of mostly text files and small binary files)
It is certainly not compatible with a Distributed VCS, where any clone would clone all the repo.
You need to look at cloud services for this type of storage.
The OP protests (downvote), stating that:
They would be normal ASCII except that GitHub has such small file size limits that I ran them through ZIP compression.
They rarely change, and when the contents change, its just a tiny number of lines within the file.
Its exactly what version control is about. Which 0.005% of the ASCII changed? Who changed it? When?
I maintain that:
hundreds of megabytes is incompatible with most source control repo providers out there (it would even be incompatible with most internal enterprise repos, and I am in a large company)
putting them in a zip file isn't practical in that a Version Control Tool system wouldn't be able to record the delta.
You need to keep separate:
the data (stores "elsewhere" as a large content of plain text files, certainly not on GitHub)
the metadata you want (author, date of modification), stored in a regular git repo in association with "shell" data (ie, your files which are actually "references", or kind of "symlinks" to, the actual files put elsewhere)
The one system, based on Git, who provides that is git-annex, using your own cloud storage with (if implemented) git-annex assistant: see its roadmap.

Related

Large Files in Source Control (TFS)

Recently at the office we have been talking about placing large files into our TFS repository. The files themselves are XML, usually 100-200MB in size, and sometimes as large as 1GB. We use them as data for automated testing and they are mostly static (one gets a minor tweak every year or so). Anyway, there is a notion that putting files like this into the repository is a no-no because they are "big" and that will make things "slow" (outside of the original check-in/out) but we don't really have any evidence to back this up.
So my question is, what are the pros / cons / implications of putting large static files into a source code repository like TFS (or SVN, Git, etc. for that matter) Is it OK? Will it "fill up the server" or have some other dire consequence?
tl;dr: TFS is designed to handle large files gracefully. The largest hurdle you'll have to face is network bandwidth to upload/download the files. The second issue is that of storage space on the server. Assuming you've considered these two issues, you shouldn't have any other problems.
Network bandwidth: There is very little overhead in checking in or getting files, it should be as fast as a typical HTTP upload or download. If your clients are remote from the server, network-wise, they may benefit by having a TFS source control proxy on their local network to speed up downloads.
Note that unlike some version control systems, TFS does not compute and transmit deltas when uploading or downloading new content. That is to say, if a client had revision 4 of a large text file, and revision 5 had added a few lines at the end, some version control tools optimize this experience to only send the changed lines. TFS does not do this optimization, so if your files change frequently, clients will need to download the entirety of the file each time.
Server storage: Disk space on the server is fairly straightforward - you'll need enough space to hold the files, there's little overhead beyond that. TFS will not slow down just because your repository contains large files.
If these files get modified frequently, you will need to account for the disk space used by the revisions, also. TFS stores "deltas" between file revisions - that is, a binary difference between two versions. So if the file's contents change minimally between revisions as in the typical use case with text files, the storage cost should be inexpensive. However, if the entirety of the contents change as would be typical with binary files like images or DLLs, then you'll need enough disk space to store each revision. (Of course, you can destroy previous revisions in order to regain that space.)
One note on deltas in TFS: to reduce overhead at check-in time, the deltas between revisions are not computed immediately, there's a background "deltafication" job that runs nightly to compute the deltas to trim space. Until that point, each revision is stored in its entirety in the database. So if you have a very large text file with a lot of revisions happening daily, your disk space requirements will need to take this into account.
Client storage: Clients will need to have enough disk space to contain these files also (although only at the revision that they've downloaded.) This can be mitigated in your workspace mappings such that the large files are cloaked (or otherwise not included in your workspace) if they're not needed.
Caveat: Getting Historic Versions: If you find yourself requesting historical versions of large files frequently (for example: I want an ISO image seven changesets ago), then you're going to make the server apply the delta chain to get back to that revision. If you have multiple clients doing this concurrently, this could tax your memory.
If those files were constantly changing & their deltas were big, I would eventually expect a penalty in the overall TFS performance.You clearly state that this is not the case, so, provided that your SQL server has the capacity to house the storage, I believe you should be able to proceed without any implications. A minor downside you may experience, is when you 're constructing new workspaces, where you would have to pull those files from their repository. Unfortunately this does also happen during TFS Build, so it's possible that your builds will now take that much longer. The severity of this angle greatly depends on your network constellation/stability.
The biggest problem (inconvenience) you'll have is having to download these massive files to all your workspaces, or map them out. Consider putting them into a separate team project to make this easier (unless you want to include them in branches, in which case I'd abuse keeping everything in one team project)
If you have control of the xml format then also consider a few tweaks to make them smaller. This will improve performance of store/get operations and also loading speed... Shorten element and attribute names, reduce the number of decimal places you are outputting for floating point numbers, etc. You will find threat simple schemes like this will knock many megabytes off the size of Gb-sized files, and it's easy to knock up a quick xslt transform or code to convert the files quickly over to the new format.

Simple version-control systems or versioning file system or versioning database

I am looking for a simple versioning system for a large number of records or files (~50 million, ~100GB unpacked, ~20MB packed). The files are only a few Kilobytes each, and have unique IDs, so I don't mind whether they are stored in a flat structure (table, directory...) or not. On average, each record is changed once a month, but most changes have diffs less than a Kilobyte so it should be easy to compress versions. However, a naive database with one entry for each version would grow too quickly. I need the following operations:
basic CRUD operations: create, read, update, delete
quick listing of recent changes
quick listing of recent changes of a particular record
query for changes in a given period of time
query for changes by a given user (each edit is associated to some user id and optionally has a commit message as comment)
for write operations there must be a commit hook to validate and reject illformed records.
In short, I am looking for a Wiki-like software for simple records or files.
I thought about possible solutions:
Put files in a version control system. This gives me replication and many available access tools, so it is my preferred solution. But the amount of data is too large for distributed systems like git. Is anyone using Subversion for a similar task with success?
Implement my own versioning in a database or in a file system. I would pobably need to store only compressed records and diffs, would have more work and learn something. This would be my preferred solution, if it was just for fun.
Use a versioning file system. This would make setup, replication and access more difficult. Probably I would need to implement my own access API above the file system.
Use a versioning database system. Can you suggest some?
Use some other existing data store with versioning (MediaWiki?, Amazon Cloud Drive?, ...)
Obviously there are many pathes. Which pathes have been used by others with success for similar or larger amounts of data?
If you're not averse to having a raw copy of each file on your client (which I imagine is OK, if you're considering svn) then git is probably quite a good solution to your problem. The underlying repository storage will use binary diffs between files as well as between versions, so you should have close to optimal compression there.
With a bare repo and some scripting, you may even be able to get away with not having the current revision checked out: objects are available from the command line and you can create new commits without a checkout.

Cryptographically Secure Backup

Until now I have been doing backups with rsync from my computer to an external drive. The backup data is made of tens of thousands of small files and hundreds of big ones (Maildir email messages and episodes of my favorite series). The problem with this is that if a single sector of my backup disk fails, perhaps a single message may be corrupted, which I find intolerable.
I have thought of an alternative that works as follows. There are three trees: the file tree consisting of the data I wish to backup, the backup tree containing a copy of the file tree at a given moment in time and a hash tree which contains file hashes and metadata hashes of the backup tree. A hash of the whole hash tree is also kept. Prior to a backup, the hash of the hash tree is checked. A failure here invalidates the whole backed up data. After the check succeeds, the hash tree shape is compared to the backup tree shape and the metadata hashes are verified to ensure the backup tree is metadata and shape consistent. If it is not, individual culprits can be listed. After this, the rsync backup traversal is performed. Whenever rsync updates a file, its new hash and metadata hash are computed and inserted into the hash tree. Whenever rsync deletes a file, that file is removed from the hash tree. In the end, the hash of the hash tree is computed and stored.
This process is very useful because the hashes are computed for correct data, meaning even if a file in the file tree is corrupted after it has been inserted in the hash tree, this inconsistency does not invalidate the backup (or future backups). The most important property, however, is that if an attacker corrupts the backup medium however he likes, the information that lies there will be trusted if and only if it is correct, unless the attacker has broken the hash algorithm. Also, the data sent to the backup or restored from it can be verified incrementally.
My question is: is there a reasonable implementation of such a backup scheme? My searches tell me that the only backup schemes available either do full or differential backups (tar based, for instance) or fail to provide a cryptographic correctness guarantee (rsync).
If there are no implementations of anything like that, maybe I will write one, but I would like to avoid reinventing the wheel.
What you're talking about sounds a lot like Git. I think it would pretty much do what you're describing. Just implement the process of "backing up" as git commit. You can then restore to any previous version with git checkout.
It is amazingly storage efficient and extremely fast for transfering content, which would probably save you a lot of time on your backups. As a bonus, it's free, portable and already debugged!
This sounds almost exactly identical to how the Mercurial storage system works. The 'rsync command' would be implement using Mercurial's push, which is remarkably network efficient.
If I had to solve the problem, I'd take RAID array (to prevent corruption) of drives, which use built-in AES encryption, and then would use any backup method I am used to.
Git-Annex is the proper solution to this problem given available tools. It is an extension to git which allows robust support for files which are arbitrarily large, synchronizes between datastores automatically, has an optional graphical user interface, tracks how many backups you have and precisely what files are stored where, and allows you to set rules for how it should manage different content. You can also customize what cryptographic hashes are used to validate the integrity of the content.
For needs of drive backups, git-annex has interoperability with bup which has more features tuned towards those looking for regular backups of entire systems.

How to combine version control with data analysis

I do a lot of solo data analysis, using a combination of tools such as R, Python, PostgreSQL, and whatever I need to get the job done. I use version control software (currently Subversion, though I'm playing around with git on the side) to manage all of my scripts, but the data is perpetually a challenge. My scripts tend to run for a long period of time (hours, or occasionally days) to generate small or large datasets, which I in turn use as input for more scripts.
The challenge I face is in how to "rollback" what I do if I want to check out my scripts from an earlier point in time. Getting the old scripts is easy. Getting the old data would be easy if I put my data into version control, but conventional wisdom seems to be to keep data out of version control because it's so darned big and cumbersome.
My question: how do you combine and/or manage your processed data with a version control system on your code?
Subversion, maybe other [d]vcs as well, supports symbolic links. The idea is to store raw data 'well organized' on a filesystem, while tracking the relation between 'script' and 'generated date' with symbolic links under version control.
data -> data-1.2.3
All your scripts will call load data to retrieve a given dataset, being linked through versioned symbolic link to a given dataset.
Using this approach, code and calculated datasets are tracked within one tool, without bloating your repository with binary data.

How can we manage non-code files in TFS for designers, etc?

Normally projects consist of a set of non-source code files like interface images (PSDs, JPGs,...). How can we managing these types of files with TFS and how graphic designers can check-in or out their image files to use them in applications like Photoshop?
You can simply add binary files (PSD, JPG etc.) to your tree, with the following caveats:
Large files take more space on the server. A quote from http://social.msdn.microsoft.com/Forums/en-US/tfsversioncontrol/thread/6f642d0f-5459-4a14-a19d-ede34713bcf4 :
TFS does handle large (> 16mb) files differently. It does not perform Delta storage but instead stores a complete copy of each version. This is an optimization to make check-ins faster for those large files. There is no difference between text files and binary files. Small ones are Delta'd, large ones are Stored.
Large files take slower to download (see the same link above).
If there is a conflict (i.e. two people modify the same binary file at the same time), one of them has to resolve the conflict completely manually, e.g. he has to load all 3 image versions in the image editor, look at the differences, and merge the changes manually.