Archiving succesive beta versions : how to save harddisk space? - diff

I archive successive versions of an in-progress work :
MySoftware-v1.01beta.rar [2 GB]
MySoftware-v1.02beta.rar [2 GB]
MySoftware-v1.03beta.rar [2 GB]
MySoftware-v1.04beta.rar [2 GB]
etc.
Lots of files are modified, so it's not possible to backup only modified files : most of the files are modified each time.
How can do a .rar file that only saves the "difference" (should I use something like "patch" or "diff" ? -> I never used them). There are lots of "difference" tool, okay, but the result file won't be a .rar, it will only be a "difference file" : so each time I would like to re-open such an archive, I'll have to "de-diff" it and only THEN I will have a .rar again.
I'm on Windows, and if possible, I'd like to use winrar or command-line tool (it would be great if no third party software is needed).
Thanks a lot in advance!

You say 90% of your product is .wav files. Since diff on two wav files that are different is likely to produce huge differences, this is not likely to save you any space. Nor are .wav files really compressible, so zip or rar likely doesn't help much, either.
However, if, like most of us programmers, you derive your next version of the product from the previous one, by mostly retaining files unchanged (whether that be source or be .wav files), then what you really want to do is simply store, for each version, the files that changed. This is called "de-duplication" in the backup/compression world.
You can organize a complicated scheme your self to do this. (e.g., your self-suggested "do this with winrar"). But if you use a decent "source control system" (SVN or GIT would be fine), this will happen automatically as you checkin changed (and don't re-checkin unchanged) files. These tools work by keeping track of "differences" between versions; you can tell the tools to track text ("diff") style differences, or simply store the entire thing.
Also, since your individual versions occupy 2GB, I'd go waste $100 on a 2 or 4 terabyte (external) drive. That should last you in worst case through some 1000 iterations. (SVN/GIT will likely extended this a lot further).

You should really be using a source control system. A popular one is called 'git'. There are many others, each with their own strengths and weaknesses and the debate about which is 'best' is long and tedious.
Source control systems take care of storing and managing revisions of your files. The actual methods vary, but as a programmer who uses version control you 'check in' files for storage and version control, 'tag' them with revision numbers and then 'check out' files for modifying.
If you've ever downloaded source off the Internet using 'svn' or 'cvs', that's the type of thing I mean.
The source control system usually uses some sort of difference system to only store differences between modified files. Its purpose is to save you from having to even think about copying and backing up files - all you have to do is ensure your 'repository' is backed up correctly.
Also, as an added advantage you can make changes to source files and always have backups in case your changes need reverting. So suppose you want to try out a new file handling system you can use the source control system to create a testing (or whatever you want to call it) 'branch' and do all your changes in there without damaging a working copy of your software. If the changes are good you can then 'merge' the changes into the non testing branch of your repository.

Related

Search code history for closest matching version based on content

I have a file that was forked from a project at an unknown moment in the past. I want to identify as closely as possible the moment of that fork. The file has been changed since the fork-moment.
Winmerge highlights about about 20% of the lines, with about half of those being just a few characters within the line, a path change or inline function turned into a variable or function call for instance. (20% after ignoring whitespace change and enabling moved-block detection that is, closer to ~40% without that.)
I don't have to worry about branches, the original version control system was CVS. (I don't have access to the CVS file system). I have a git imported version with tags corresponding to the CVS commits, and could generate the same with Mercurial for little effort if need be.
I don't care about matching the specific CSV commit date/time/number/whatever. The goal is to identify when the content of new file started drifting, and step forward through the revision history, cherry picking what to merge to the forked file.
For this project I could brute force it, there only a dozen or so revisions where the fork has mostly likely occurred and the file is less than 500 lines. However it's not hard to imagine a scenario where this is not feasible and I'm curious about what an elegant solution might be.
How would you go about solving this?
"Brute force" sounds as if you were contemplating testing all revisions. Normally one would use a binary search. To decide if it was a good match, I'd normally use just the numbers from diffstat (since you say there are post-fork changes). Accounting for block-moves complicates things, though.

tracking file-name version control in a real version control system?

Is there an established method to tell the SCM, mercurial in my case, that files of the pattern foobaz_1_2_3.csv should all be considered versions of foobaz.csv ?
In my application I rely on data tables from an external source that put the version number in the filename. The importance of tracking changes across their versions was made painfully sharp recently when I spend days troubleshooting a bug on my side of the fence, only to discover it was because they changed some data content and notification of said change did not reach me.
If the filename was constant Hg would have informed me immediately of the internal change and I could have responded appropriately in an hour or two, with very little stress. I could just adopt the habit of renaming foobaz_2_3_4 to foobaz myself before checking in, or running diff old new and one or both of those is likely what I'll do from now on.
The whole experience has me wondering though if there might be other methods I've not thought of that don't mess with the external file. (for example what of I have a downstream user who doesn't use SCM and relies on the filename+version number, which I've thrown away?)
If you get data in file with permanently changed name and (possibly) changeable data, you can:
Store data-file under version-control (mercurial is OK)
replace old file with new every time
hg addremove -s nnn (Check Manual hg help addremove) will detect possible rename and include new file in history of old

Multiple repositories in one directory (same level) - is it possible?

My original problem is that I have a directory where I write various scripts. Each of them is independent of others, and usually one-file-long. I want to have some versioning applied to them, but I have the following problems/requirements:
I don't want to have to store each small script in a separate directory!
I don't want to store them all in one repository OTOH, as they are completely unrelated, and:
some of them may later grow to more files (and then they will need a separate dir),
I sometimes want to copy one of them to a different machine (and I want to clone the whole repo).
I want to benefit from (distributed) version control mechanisms -- at least:
"infinite" number of revisions,
ability to clone repositories on different computers,
ability to do "atomic" multi-file commits.
Is it possible?
I'd prefer to do it in some mainstream distributed VCS (a solution using Mercurial would be preferable, but I'm not fixed).
EDIT: the solution has to be free (at least "as in beer") and cross-platform (at least Win32 & Linux).
Related, but didn't help:
"two-git-repositories-in-one-directory" -- didn't find it helpful: the accepted answer looks like point 2. (above) to me; the current "community voted" answer sounds like 1.
"Version control of single files using Subversion" -- also too much of 2. or 1.
These requirements seem pretty "special" to me, so here is a solution on par with them ^^
You may use two completely different VCS, in the same directory. Even two "instances" of SVN might work: SVN stores its metadata in a directory called .SVN and has (for historical reasons regarding ASP) the option to use _SVN. The Directory listing should look like this
.SVN // Metadata for rep1
_SVN // Metadata for rep2
script1 // in rep1
script2 // in rep2
...
Of course, you will need to hide or ignore the foreign scripts or folders from each VCS...
Added:
This only accounts for two scripts in one folder and needs one additional VCS per script beyond that, so if you even consider this route and need more repositories, rename each Metadir and use a script to rename it back before updating:
MOVE .SVN-script1 .SVN
svn update
MOVE .SVN .SVN-script1
Why don't you simply create a separate branch (in the git sense) for each (group of) script(s)?
You can develop them individually as you please. Switching to a branch will show you only the scripts from that branch. It's sort of like directories but managed by the version control system. If you later want to pluck a branch out into another repository, you can do that and if you want to combine two scripts into a single project, you can do that as well. The copying them to the different machine point might be a problem but you can clone the branch you're interested in and you it should work for you.
Another proposition for my own consideration is "Using Convert to Decompose Your Repository" article on hgtip.com. It fails as a "standalone" solution, but could be helpful as an addition to the "mv .hgN .hg / MOVE .SVN-script1 .SVN" idea.
You can create multiple hidden repository directories and symlink .hg to whichever one you want to be active. So if you have two repositories, create directories for them:
.hg_production
.hg_staging
Then to activate either of them just do:
ln -sf .hg_production .hg
You could easily create a bash command to do this. So instead you could write something like activate-repo production, which would run ln -sf .hg_production .hg.
Note: Mac doesn't seem to support ln -sf so instead you'll need to do:
rm .hg; ln -s .hg_production .hg
I can only think of these two lightweight versioning systems:
1) Using Dropbox with the Pack-Rat upgrade, to keep a full history of versions for each file automatically backed up and with the possibility to be shared with multiple Dropbox users: https://www.dropbox.com/help/113
If you have multiple machines managed by the same user (you), the synching would be automatic. Also if the machines are in the same LAN, Dropbox is smart enough to sync the files over the local network, so big files shouldn't be a worry.
2) Using a 'Versions' aware text editor for Mac OS X Lion. I'd expect TextMate, Coda and other popular Mac code editors to be updated to support this feature when Lion is released.
How about a compromise between 1 and 2? Instead of a folder+repo for each script, can you bundle them into loosely related groups, such as "database", "backup", etc. and then make one folder+repo for each group? Then if you clone a repo on another machine, you're only pulling down a smaller number of unrelated files. (Is the bandwidth/drivespace really a concern?) To me, this sounds WAAAY simpler than all of the other suggestions so far.
(Technically this approach meets your requirements because (1) each script isn't in its own directory, (2) not all scripts are in the same repository, and (3) you can easily do this with any popular DVCS. :D)
UPDATE (2016): Apparently, a guy named Cosmin Apreutesei created a tool named multigit, which seems to implement what I wished for in this question! If you ever read it, thanks a lot Cosmin! I've started using your tool this year and find it awesome.
I'm starting to think of some kind of an overlay over Mercurial/git/... which would keep a couple "disabled" repository meta-directories, let's say:
.hg1/
.hg2/
.hg3/
etc., and then on hg commit FILENAME would find the particular .hgN that is linked to FILENAME, and would then temporarily:
mv .hgN .hg
hg commit FILENAME
mv .hg .hgN
The main disadvantage is that it would require me to spend some time writing the tool. Or does anybody know of some ready-made one like this? If you do, please post as a full-featured answer (not a comment), I'm more than willing to accept it.

How to join two files in a version control system

I am doing a refactoring of my C++ project containing many source files.
The current refactoring step includes joining two files (say, x.cpp and y.cpp) into a bigger one (say, xy.cpp) with some code being thrown out, and some more code added to it.
I would like to tell my version control system (Perforce, in my case) that the resulting file is based on two previous files, so in future, when i look at the revision history of xy.cpp, i also see all the changes ever done to x.cpp and y.cpp.
Perforce supports renaming files, so if y.cpp didn't exist i would know exactly what to do. Perforce also supports merging, so if i had 2 different versions of xy.cpp it could create one version from it. From this, i figure out that joining two different files is possible (not sure about it); however, i searched through some documentation on Perforce and other source control systems and didn't find anything useful.
Is what i am trying to do possible at all?
Does it have a conventional name (searching the documentation on "merging" or "joining" was unsuccessful)?
You could try integrating with baseless merges (-i on the command line). If I understand the documentation correctly (and I've never used it myself), this will force the integration of two files. You would then need to resolve the integration however you choose, resulting in something close to the file you are envisioning.
After doing this, I assume the Perforce history would show the integration from the unrelated file in it's integration history, allowing you to track back to that file when desired.
I don't think it can be done in a classic VCS.
Those versioning systems come in two flavors (slide 50+ of Getting git by Scott Chacon):
delta-based history: you take one file, and record its delta. In this case, the unit being the file, you cannot associate its history with another file.
DAG-based history: you take one content and record its patches. In this case, the file itself can vary (it can be renamed/moved at will), and it can be the result of two other contents (so it is close of what you want)... but still within the history of one file (the contents coming from different branches of its DAG).
The easy part would be this:
p4 edit x.cpp y.cpp
p4 move x.cpp xy.cpp
p4 move y.cpp xy.cpp
Then the tricky part becomes resolving the move of y.cpp and doing your refactoring. But this will tell Perforce that the files are combined.

Sync (Diff) of the remote binary file

Usually both files are availble for running some diff tool but I need to find the differences in 2 binary files when one of them resides in the server and another is in the mobile device. Then only the different parts can be sent to the server and file updated.
There is the bsdiff tool. Debian has a bsdiff package, too, and there are high-level programming language interfaces like python-bsdiff.
I think that a jailbreaked iPhone, Android or similar mobile device can run bsdiff, but maybe you have to compile the software yourself.
But note! If you use the binary diff only to decide which part of the file to update, better use rsync. rsync has a built-in binary diff algorithm.
You're probably using the name generically, because diff expects its arguments to be text files.
If given binary files, it can only say they're different, not what the differences are.
But you need to update only the modified parts of binary files.
This is how the Open Source program called Rsync works, but I'm not aware of any version running on mobile devices.
To find the differences, you must compare. If you cannot compare, you cannot compute the minimal differences.
What kind of changes do you do to the local file?
Inserts?
Deletions?
Updates?
If only updates, ie. the size and location of unchanged data is constant, then a block-type checksum solution might work, where you split the file up into blocks, compute the checksum of each, and compare with a list of previous checksums. Then you only have to send the modified blocks.
Also, if possible, you could store two versions of the file locally, the old and modified.
Sounds like a job for rsync. See also librsync and pyrsync.
Cool thing about the rsync algorithm is that you don't need both files to be accessible on the same machine.