best practice for backing up cvs repository? - version-control

Some of our projects are still on cvs. We currently use tar to backup the repository nightly.
Here's the question:
best practice for backing up a cvs repository?
Context: We're combining a several servers across the country onto one central server. The combined repsitory size is 14gb. (yes this is high, most likely due to lots of binary files, many branches, and the age of the repositories).
A 'straight tar' of the cvs repository yields ~5gb .tar.gz file. Restoring files from 5gb tar files will be unwieldy. Plus we fill up tapes quickly.
How well does a full-and-incremental backup approach, i.e. weekly full backup, nightly incremental backups? What open source tools solve this problem well? (e.g. Amanda, Bacula).
thanks,
bill

You can use rsync to create backup copy of your repo on another machine if you don't need history of backups. rsync works in incremental mode, so bandwidth will be consumed only for sending changed files.
I don't think that you need full history of backups as VCS provides its own history management and you need backups ONLY as failure-protection measure.
Moreover, if you worry about consistent state of backed up repository you MAY want to use filesystem snapshots, e.g. LVM can produce them on Linux. As far as I know, ZFS from Solaris also has snapshots feature.
You don't need snapshots if and only if you run backup procedure deeply at night when noone touches your repo and your VCS daemon is stopped during backup :-)

As Darkk mentioned rsync makes for good backups since only charged things are copied. Dirvish is a nice backup system based on rsync. Backups run quickly. Restores are extremely simple since all you have to do is copy things. Multiple versions of the backups are store efficiently stored.

Related

Azure private pipeline agent .git folder size

We recently moved from hosted to private agents, because of reasons that are not relevant to this question. The problem we're having now, is that the private agent runs out of disk space. I've checked why this is the case, and it turns out that for one of the workspaces the agent creates, the .git folder grows to over 20Gb during the day, while the repository is only a few Gb. What can explain this excessive growth?
some extra info:
We build from different branches, using the same pipeline (so it re-uses the same workspace)
We do not clean the workspace between runs, since this would require is to re-get the entire repository each build, which slows the build. (I understand adding the clean option would solve our problem, but it would also slow down all builds, which we don't want)
We used to use fetchdepth: 1 in our pipelines, but we recently removed this, since it is no longer necessary on private agents, since the sources are cached between runs
Edit:
to clarify, I'm looking for a way to avoid running out of disk space on the agents, without losing the ability to cache source files.
When I run the same pipeline with different branches, the .git folder size indeed increases.
Then I find that the root cause of this issue could be the pack files in .git/objects/pack.
It will pack the source files, if your source files are large enough, the packaged files will also take up a lot of space.
You could try to use BFG tool or Git command to remove the files.
For more detailed information, you could refer to this ticket: Remove large .pack file created by git

How to work on a large number of remote files with PHPStorm

I have a small Debian VPS-box on which I host and develop a few small, private PHP websites.
I develop on a Windows desktop with PHPStorm.
Most of my projects only have a few dozen source files but also contain a few thousand lib files.
I don't want to run a webserver on my local machine because this creates a whole set of problems, I don't want to be bothered with for such small projects (e.g. setting up another webserversynching files between my Desktop and the VPS-box; managing different configurations for Windows and Debian (different hosts, paths...); keeping db schema and data in synch).
I am looking for a good way to work with PHPStorm on a large amount of remote files.
My approaches so far:
Mounting the remote file system in Windows (tried via pptp/smb, ftp, webdav) and working on it with PHPStorm as if it were local files.
=> Indexing, synching, and PHPStorms VCS-support became unusably slow. This is probably due to the high latency for file access.
PHPStorm offers the possibility to automatically copy the remote files to the local machine and then synching them when changes are made.
=> After the initial copying, this is fast. Unfortunately, with this setup, PHPStorm is unable to provide VCS support, which I use heavily.
Any ideas on this are greatly appreciated :)
I use PhpStorm in a very similar setup as your second approach (local copies, automatic synced changes) AND importantly VCS support.
Ideal; Easiest In my experience the easiest solution is to checkout/clone your VCS branch on your local machine and use your remote file system as a staging platform which remains ignorant of VCS; a plain file system.
Real World; Remote VCS Required If however (as in my case) it is necessary to have VCS on each system; perhaps your remote environment is the standard for your shop or your shop's proprietary review/build tools are platform specific. Then a slightly different remote setup is required, however treating your remote system as staging is still the best approach.
Example: Perforce - centralized VCS (client work-space)
In my experience work-space based VCS systems (e.g. Perforce) can be handled best by sharing the same client work-space between local and remote systems, which has the benefit of VCS file status changes having to be applied only once. The disadvantage is that file system changes on the remote system typically must be handled manually. In my case I manually chmod (or OS equivalent) my remote files and wash my hands (problem solved). The alternative (dual work-space) approach requires more moving parts, which I do not advice.
Example: Git - distributed VCS
The easier approach is certainly Git which has it's wonderful magic of detecting file changes without file permissions being directly coupled to the VCS. This makes life easy as you can simply start with a common working branch and create two separate branches "my-feature" and "my-feature-remote-proxy" for example. Once you decide to merge your changes upstream, you do so (ideally) from your local environment. The remote proxy branch could be reverted or whatever you want. NOTE: in the case of Git I always have two branches because it's easy. And when you hard drive melts in a freak lighting strike you have extra redundancy :|
Hope this helps.

Local Source control repository - cross platform

I am looking for a 'local' source control software, I don't need it to be necessarily available on network.. Its meant to be only for personal use..
What I am looking for is something like:
Need it to be cross platform. The biggest problem is, I need the same local repository to be available on both windows and Linux! (Is this even possible? :s ) I dual boot Windows 7 and Ubuntu and have managed to setup workspace that works in both OS without changes, now I need a source control software!
Easy installation, I have never installed one before! :)
And Has eclipse plugin..
I have used VSS for this purpose before, but that is only on Windows!
I looked for Mercurial, but I am not sure if I can use the same repository on both the OS!
Any suggestions are appreciated!
UPDATE: Thanks for your replies.. Yes I do want the same repository to be accessed from different operating systems.. Everyone has suggested an on-line repository but I 'need it to be local'.. Internet is not something I can depend on (I now know git takes care of this..! :)), I would not want version of, say my personal recordings of some home functions tweaked in audacity, to be hosted on-line! Right now, I am trying out git, as a local repository solution..
If you definitely want a repository that's always available on a local filesystem, I'd probably go for Mercurial or Git. Most likely Mercurial, as it has the best windows support (including the TortoiseHg gui), but Git works similarly.
But there's two other issues:
Do you make frequent backups?
What file system type will you use for the shared repository?
In this particular case, I would not trust a single shared filesystem as the best basket to put your eggs in; In each boot environment, I would maintain working repositories separate from the shared one. This would give you some redundancy.
Here's how this would work:
Two repositories U and W, for Ubuntu and Windows respectively, and one shared repository S, accessable frome either boot environment.
Assuming a stable situation, with all three repositories in sync:
Commit any new code to repository U in Ubuntu.
$ hg commit -m 'changes from linux'
Push the changes to S.
$ hg push
Reboot into windows.
...
Pull the latest changesets from S into W
W> hg fetch
Update your code, commit frequently
Push prior to rebooting into linux
W> hg push
Reboot
And repeat step 4, but now from linux
$ hg fetch # performs an hg pull, followed by an update.
Rinse, lather, repeat.
That's said, with both Mercurial and Git, you can synchronise your repositories across the net any time, so I would surely recommend you try that out some time.
And note: the best backup is having a copy of your data on a live file system on another computer, preferably at another location.
I'm pretty sure you can Mercurial, since the whole repository is in .hg folder.
Try TortoiseHG - it's easy to install and use.
Why do you want it to be local? The benefit of source control, is that you can have multiple clients working on the same source, without worrying too much about conflicts etc.
Even though it doesn't really answer your question, this advices might solve your problem:
Just create a project for yourself at https://github.com/ or http://sourceforge.net/ any other free online repository hosting provider. SVN, CVS, GIT all come with excellent IDE integration and clients run on almost all operating systems.
Hope this helps. Regards.
Do you really want to have a duplicate repository on different operating systems? That doesn't make sense to me. What would be the purpose of doing that?
I think you instead want to have a single repository that you can access from any operating system.
In this case, you can just install Subversion (or whatever source control system you prefer) on a server and access it from the operating systems you use. There are plenty of client tools for Mac/Windows/Linux that can talk to subversion repositories, RapidSVN being free and cross-platform for one.
If you don't have your own server, there are plenty of places online that will host Subversion for you.

Version control with MVFS

Is there any version control system available with an MVFS-like virtual file system in addition to the ClearCase?
I can't find any.
Thanks,
Mart
No (in a read/write remote access).
MVFS (MultiVersion Filesystem) is about encapsulating the native filesystem to combine:
network access
with version files through dynamic views
To my knowledge, only ClearCase offers that (especially on that many platforms: Unix, Linux, Windows, Hp).
Other VCS offer read-only remote access like Gitfs and svnfs.
From "Filesystem Interface for the Git Version Control System" (pdf, from Reilly GRANT):
The Filesystem Interface to Git (known by the acronym “figfs”, pronounced like “figs”) allows developers to work with a project in a Git repository just like a local filesystem. This means that all the branchs, tags, and revisions are available for browsing without having to check anything out.
The ability to access past revisions in a repository via the filesystem has been implemented before.
Gitfs and svnfs[12] (which is the same as gitfs except that it uses Subversion)
implement a read-only view of repository history.
The advantage of gitfs over svnfs is that Git is a distributed system and thus maintains a copy of the entire repository on the local machine, eliminating network lag when fetching revisions.
A commercial system, Rational ClearCase[9], offers a writable filesystem view of the repository, MVFS (MultiVersion File System), as an alternative to checking out files to the local filesystem. As with svnfs the performance of this system suffers from the need to query over the network for uncached file data.
Figfs eliminates this problem because a Git repository is stored entirely locally.
FYI, one of the nice things about ClearCase is that it monitors system calls to typical file operations and can determine your real dependencies in a build. This can be important when building complex systems. This capability has been added to GNU make (runs on *nix systems only though) in http://sourceforge.net/projects/posixamake/; the author's currently working on adding a derived object cache using MySQL.

Version control of uploaded images to file system

After reading Storing Images in DB - Yea or Nay? I think that the file system is the right place for storing images. But I would like to know how you handle backup/version control of uploaded images in your different environments (dev/stage/prod) and for network load balancing?
These problems is pretty easy to handle when working with a database e.g. to make a backup from the production environment and restore the DB in the development environment.
What do you think of using for example git to handle version control of uploaded files e.g?
Production Environment:
A image is uploaded to a shared folder at the web server.
Meta data is stored in the database
The image is automatically added to a git repository
Developer at work:
Checks out the source code.
Runs a script to restore the database.
Runs a script to get the the latest images.
I think the solution above is pretty smooth for the developer, the images will be under version control and the environments can be isolated from each other.
For us, the version control isn't as important as the distribution. Meta data is added via the web admin and the images are dropped on the admin server. Rsync scripts push those out to the cluster that serves prod images. For dev/test, we just rsync from prod master server back to the dev server.
The rsync is great for load balancing and distribution. If you sub in git for the admin/master server, you have a pretty good solution.
If you're OK with backup that preserves file history at the time of backup (as opposed to version control with every revision), then some adaption of this may help:
Automated Snapshot-style backups with rsync.
It can work, but I would store those images in a git repository which would then be a submodule of the git repo with the source code.
That way, a strong relationship exists between the code and and images, even though the images are in their own repo.
Plus, it avoids issues with git gc or git prune being less efficient with large number of binary files: if images are in their own repo, and with few variations for each of them, the maintenance on that repo is fairly light. Whereas the source code repo can evolve much more dynamically, with the usual git maintenance commands in play.