Split a PlasticSCM repository into multiple repositories - version-control

Our default Plastic repository has grown pretty large. It is now bloated with a lot of files that are unrelated, and thus do not need to be versioned together. In hindsight the default repository should have been created as several separate repositories.
I'd like to break the default repository up into smaller chunks, but I don't want to lose all of the file history by moving files to empty new repositories. What's the best way to accomplish this?

You can do it choosing any of the following approaches depending on your needs:
Replicate the original repository as many times as repositories you want to split. Then, delete for each replicated repository the content you will not maintain on it. Doing that you'll preserve the file history, but the database size will not decrease.
Of course you can create new repositories and add the desired content to the new repositories. This will be a completely new repository, so the DB size will be much smaller.
You will need to decide between big databases preserving the history or fresh databases without file history.

Related

Is there a way for GitHub to "ignore" files when number crunching?

So my friends and I are collaborating on a project, which involves generating and testing out a ton of machine learning models. For convenience sake, we have our own model directory to store our saved models, and we keep our datasets on the repo too.
I've attached a picture of our statistics below.
As you can see, these stats don't really make much sense because GitHub counts the lines in the binary and dataset files too, so it doesn't really give an accurate estimation of our contributions. Also, gitignoring those models/dataset files isn't an option.
Is there a way for me to tell git to ignore certain filetypes before doing its number crunching? If not, can this be considered scope for further improvement?
Thanks.

I need to backup database files. Need something like github that is happy with 100s of gigabytes of data

I want a source controlled environment for a fairly large amount of database data, in text, before its loaded into the DBMS. We've been using GITHUB and its great. But they expect that a repository is less than 1 gigabyte and we have hundreds.
It could be in CVS or SVN, but tracking versions is important. The data is very static and is accessed only at low rates, say once a week for parts of it, once a month for more.
Any suggested places/services that do this? It doesn't have to be free, we'll happily pay a reasonable amount.
I confirm this kind of amount of data is incompatible with a Version Control System (made to record the history, ie the evolution of mostly text files and small binary files)
It is certainly not compatible with a Distributed VCS, where any clone would clone all the repo.
You need to look at cloud services for this type of storage.
The OP protests (downvote), stating that:
They would be normal ASCII except that GitHub has such small file size limits that I ran them through ZIP compression.
They rarely change, and when the contents change, its just a tiny number of lines within the file.
Its exactly what version control is about. Which 0.005% of the ASCII changed? Who changed it? When?
I maintain that:
hundreds of megabytes is incompatible with most source control repo providers out there (it would even be incompatible with most internal enterprise repos, and I am in a large company)
putting them in a zip file isn't practical in that a Version Control Tool system wouldn't be able to record the delta.
You need to keep separate:
the data (stores "elsewhere" as a large content of plain text files, certainly not on GitHub)
the metadata you want (author, date of modification), stored in a regular git repo in association with "shell" data (ie, your files which are actually "references", or kind of "symlinks" to, the actual files put elsewhere)
The one system, based on Git, who provides that is git-annex, using your own cloud storage with (if implemented) git-annex assistant: see its roadmap.

Large Files in Source Control (TFS)

Recently at the office we have been talking about placing large files into our TFS repository. The files themselves are XML, usually 100-200MB in size, and sometimes as large as 1GB. We use them as data for automated testing and they are mostly static (one gets a minor tweak every year or so). Anyway, there is a notion that putting files like this into the repository is a no-no because they are "big" and that will make things "slow" (outside of the original check-in/out) but we don't really have any evidence to back this up.
So my question is, what are the pros / cons / implications of putting large static files into a source code repository like TFS (or SVN, Git, etc. for that matter) Is it OK? Will it "fill up the server" or have some other dire consequence?
tl;dr: TFS is designed to handle large files gracefully. The largest hurdle you'll have to face is network bandwidth to upload/download the files. The second issue is that of storage space on the server. Assuming you've considered these two issues, you shouldn't have any other problems.
Network bandwidth: There is very little overhead in checking in or getting files, it should be as fast as a typical HTTP upload or download. If your clients are remote from the server, network-wise, they may benefit by having a TFS source control proxy on their local network to speed up downloads.
Note that unlike some version control systems, TFS does not compute and transmit deltas when uploading or downloading new content. That is to say, if a client had revision 4 of a large text file, and revision 5 had added a few lines at the end, some version control tools optimize this experience to only send the changed lines. TFS does not do this optimization, so if your files change frequently, clients will need to download the entirety of the file each time.
Server storage: Disk space on the server is fairly straightforward - you'll need enough space to hold the files, there's little overhead beyond that. TFS will not slow down just because your repository contains large files.
If these files get modified frequently, you will need to account for the disk space used by the revisions, also. TFS stores "deltas" between file revisions - that is, a binary difference between two versions. So if the file's contents change minimally between revisions as in the typical use case with text files, the storage cost should be inexpensive. However, if the entirety of the contents change as would be typical with binary files like images or DLLs, then you'll need enough disk space to store each revision. (Of course, you can destroy previous revisions in order to regain that space.)
One note on deltas in TFS: to reduce overhead at check-in time, the deltas between revisions are not computed immediately, there's a background "deltafication" job that runs nightly to compute the deltas to trim space. Until that point, each revision is stored in its entirety in the database. So if you have a very large text file with a lot of revisions happening daily, your disk space requirements will need to take this into account.
Client storage: Clients will need to have enough disk space to contain these files also (although only at the revision that they've downloaded.) This can be mitigated in your workspace mappings such that the large files are cloaked (or otherwise not included in your workspace) if they're not needed.
Caveat: Getting Historic Versions: If you find yourself requesting historical versions of large files frequently (for example: I want an ISO image seven changesets ago), then you're going to make the server apply the delta chain to get back to that revision. If you have multiple clients doing this concurrently, this could tax your memory.
If those files were constantly changing & their deltas were big, I would eventually expect a penalty in the overall TFS performance.You clearly state that this is not the case, so, provided that your SQL server has the capacity to house the storage, I believe you should be able to proceed without any implications. A minor downside you may experience, is when you 're constructing new workspaces, where you would have to pull those files from their repository. Unfortunately this does also happen during TFS Build, so it's possible that your builds will now take that much longer. The severity of this angle greatly depends on your network constellation/stability.
The biggest problem (inconvenience) you'll have is having to download these massive files to all your workspaces, or map them out. Consider putting them into a separate team project to make this easier (unless you want to include them in branches, in which case I'd abuse keeping everything in one team project)
If you have control of the xml format then also consider a few tweaks to make them smaller. This will improve performance of store/get operations and also loading speed... Shorten element and attribute names, reduce the number of decimal places you are outputting for floating point numbers, etc. You will find threat simple schemes like this will knock many megabytes off the size of Gb-sized files, and it's easy to knock up a quick xslt transform or code to convert the files quickly over to the new format.

Simple version-control systems or versioning file system or versioning database

I am looking for a simple versioning system for a large number of records or files (~50 million, ~100GB unpacked, ~20MB packed). The files are only a few Kilobytes each, and have unique IDs, so I don't mind whether they are stored in a flat structure (table, directory...) or not. On average, each record is changed once a month, but most changes have diffs less than a Kilobyte so it should be easy to compress versions. However, a naive database with one entry for each version would grow too quickly. I need the following operations:
basic CRUD operations: create, read, update, delete
quick listing of recent changes
quick listing of recent changes of a particular record
query for changes in a given period of time
query for changes by a given user (each edit is associated to some user id and optionally has a commit message as comment)
for write operations there must be a commit hook to validate and reject illformed records.
In short, I am looking for a Wiki-like software for simple records or files.
I thought about possible solutions:
Put files in a version control system. This gives me replication and many available access tools, so it is my preferred solution. But the amount of data is too large for distributed systems like git. Is anyone using Subversion for a similar task with success?
Implement my own versioning in a database or in a file system. I would pobably need to store only compressed records and diffs, would have more work and learn something. This would be my preferred solution, if it was just for fun.
Use a versioning file system. This would make setup, replication and access more difficult. Probably I would need to implement my own access API above the file system.
Use a versioning database system. Can you suggest some?
Use some other existing data store with versioning (MediaWiki?, Amazon Cloud Drive?, ...)
Obviously there are many pathes. Which pathes have been used by others with success for similar or larger amounts of data?
If you're not averse to having a raw copy of each file on your client (which I imagine is OK, if you're considering svn) then git is probably quite a good solution to your problem. The underlying repository storage will use binary diffs between files as well as between versions, so you should have close to optimal compression there.
With a bare repo and some scripting, you may even be able to get away with not having the current revision checked out: objects are available from the command line and you can create new commits without a checkout.

How can we manage non-code files in TFS for designers, etc?

Normally projects consist of a set of non-source code files like interface images (PSDs, JPGs,...). How can we managing these types of files with TFS and how graphic designers can check-in or out their image files to use them in applications like Photoshop?
You can simply add binary files (PSD, JPG etc.) to your tree, with the following caveats:
Large files take more space on the server. A quote from http://social.msdn.microsoft.com/Forums/en-US/tfsversioncontrol/thread/6f642d0f-5459-4a14-a19d-ede34713bcf4 :
TFS does handle large (> 16mb) files differently. It does not perform Delta storage but instead stores a complete copy of each version. This is an optimization to make check-ins faster for those large files. There is no difference between text files and binary files. Small ones are Delta'd, large ones are Stored.
Large files take slower to download (see the same link above).
If there is a conflict (i.e. two people modify the same binary file at the same time), one of them has to resolve the conflict completely manually, e.g. he has to load all 3 image versions in the image editor, look at the differences, and merge the changes manually.