Binary Delta Storage - version-control

I'm looking for a binary delta storage solution to version large binary files (digital audio workstation files)
When working with DAW files, the majority of changes, especially near the end of the mix are very small in comparison to the huge amount of data used to store raw data (waves).
It would be great to have a versioning system for our DAW files, allowing us to roll back to older versions.
The system would only save the difference between the binary files (diff) of each version. This would give us a list of instructions to change from the current version to the previous version without storing the full file for every single version.
Is there any current versioning systems that do this? I've read that SVN using binary diff's to save space in the repo... But I've also read that it doesn't actually do that for binary files only text files... Not sure. Any ideas?
My plan of action as of right now is to continue research into preexisiting tools, and if none exist, become comfortable with c/c++ reading binary data and creating the tool myself.

I can't comment on the reliability or connection issues that might exist when committing a large file across the network (one referenced post hinted at problems). But here is a little bit of empirical data that you may find useful (or not).
I have been doing some tests today studying disk seek times and so had a reasonably good test case readily at hand. I found your question interesting, so I did a quick test with the files I am using/modifying. I created a local Subversion repository and added two binary files to it (sizes shown below) and then committed the files a couple of times after changes were made to them. The smaller binary file (.85 GB) simply had data added to the end of it each time. The larger file (2.2GB) contains data representing b-trees consisting of "random" integer data. The updates to that file between commits involved adding approximately 4000 new random values, so it would have modified nodes spread somewhat evenly throughout the file.
Here are the original file sizes along with the size/count of all files in the local subversion repository after the commit:
file1 851,271,675
file2 2,205,798,400
1,892,512,437 bytes in 32 files and 32 dirs
After the second commit:
file1 851,287,155
file2 2,207,569,920
1,894,211,472 bytes in 34 files and 32 dirs
After the third commit:
file1 851,308,845
file2 2,210,174,976
1,897,510,389 bytes in 36 files and 32 dirs
The commits were somewhat lengthy. I didn't pay close attention because I was doing other work, but I think each one took maybe 10 minutes. To check out a specific revision took about 5 minutes. I would not make a recommendation one way or other based on my results. All I can say is that it seemed to work fine and no errors occurred. And the file differencing seemed to work well (for these files).

Subversion might work, depending on your definition of large. This question/answer says that it works well as long as your files are less than 1 GB.

Subversion will perform binary deltas on binary files as well as text files. Subversion is just incapable of providing human-readable deltas for binary files, and cannot assist with merging conflicts in binary files.

git compresses (you may need to call git gc manually though), and seemingly really good:
$ git init
$ dd if=/dev/urandom of=largefile bs=1M count=100
$ git add largefile
$ git commit -m 'first commit'
[master (root-commit) e474841] first commit
1 files changed, 0 insertions(+), 0 deletions(-)
create mode 100644 largefile
$ du -sh .
201M .
$ for i in $(seq 20); do date >> largefile; git commit -m "$i" -a; git gc; done
$ du -sh .
201M .

Related

How can I see my place in history during an 'hg rebase' conflict?

Whenever I perform an hg rebase and there are merge conflicts, it immediately pulls up an editor for me to resolve the conflict. However, it doesn't give me any information about where in the rebase process I am at. For example, if my history looks as follows:
o 12
|
o 11
|
10 o |
\ /
o 9
Performing hg rebase -s 11 -d 10 may have a conflict trying to apply either 11 or 12. It is difficult to tell at a glance just from the merge conflict where I am stopped, especially when the graph is larger than this. How can I tell where in the rebase process the conflict is?
Very recent Mercurials have two configuration options: [ui] mergemarkertemplate, and [ui] pre-merge-tool-output-template, which can be used to improve this situation a bit.
pre-merge-tool-output-template
pre-merge-tool-output-template is printed before running any external merge tool. This can be used to print something before your editor or kdiff3 pops up; note that if you use a terminal-based merge tool (such as most editors unless they're the gui version), it'll likely be hidden by the merge tool. Depending on OS and what program you're using, you may be able to hit Ctrl-Z to suspend your merge tool to see this output.
Example output:
merging path/to/file
Running merge tool for path/to/file (/usr/bin/vim):
- local (working copy): 10:2d1f533d add binary file (#2) tip default
- base (base): 6:abcd1234 some other description default
- other (merge rev): 9:1e7ad7d7 add binary file (#1) default
... vim runs here ...
See https://www.mercurial-scm.org/repo/hg/file/14589f1989e9/tests/test-merge-tools.t#l1956 for the template that produced that output, hg help config.ui.pre-merge-tool-output-template and hg help templates for more information on that.
mergemarkertemplate
mergemarkertemplate controls the conflict markers you see in your editor. Set [ui] mergemarkers=detailed and see if this is sufficient; if not, you can use [ui] mergemarkertemplate to customize it; this can also be customized on a per-merge-tool basis, so see hg help config.ui.mergemarkers, hg help config.ui.mergemarkertemplate, and hg help config.merge-tools.
Programs with customizable labels
Merge tools like kdiff3 often have the ability to customizable the labels. In the default configuration, this should be the operation-provided name for the base/local/other (in my example above, this would be base, working copy, and merge rev, respectively. I believe if you have [ui] mergemarkers=detailed or [merge-tools] kdiff3.mergemarkers=detailed, these will include additional information. See hg help config.merge-tools for more information on the per-merge-tool configuration options.
(Not an answer, exactly, but a bit long for a comment...)
When you write:
Performing hg rebase -s 11 -d 10 may have a conflict trying to apply
either 10 or 11.
do you mean to write
may have a conflict trying to apply either 11 or 12
? Because you are trying to rebase those csets to 10, so it doesn't make sense to talk about applying 10. Also, consider using the Evolve extension if you aren't already. It makes everything append-only, which is much better.
Also, test in a clone. And also, try rebasing 12 first, if possible. Anyway, Mercurial is just trying to rebase the changes from both 11 and 12, and I don't think it distinguishes between those changes. And why would you expect it to? Isn't it obvious to you which changes belong to which cset?
Also, consider configuring your merge setup for use with kdiff3, if you aren't already. It makes things much clearer to do things in a merge editor, and you can also see both sides of the merge clearly. See
https://www.mercurial-scm.org/wiki/MergeToolConfiguration and https://www.mercurial-scm.org/wiki/KDiff3
Personally, I have the following lines in my ~/.hgrc, but they've been there a long time, and I don't remember where I got them from. Also, I don't do merges much these days. But for whatever it is worth...
[merge-tools]
kdiff3.args=--auto --L1 base --L2 local --L3 other $base $local $other -o $output
kdiff3.regkey=Software\KDiff3
kdiff3.regappend=\kdiff3.exe
kdiff3.fixeol=True
kdiff3.gui=True
kdiff3.diffargs=--L1 '$plabel1' --L2 '$clabel' $parent $child
Hope that helps.

Creating a script that compares multiple files in multiple servers

I have several different linux servers, all of which are essentially mirrors of each other. However, some of them have gone out of sync (file A in machine 1 is different from file B in machine 2).
I'm in the process of designing a script (shell or Perl only) that will systematically walk through certain directories and diff the corresponding files in the different machines against each other, and generate a meaningful report. Later on, I will try to sync up the files.
These are my thoughts so far on how to approach this:
sftp files to /tmp and diff locally
using ssh and diff
using rsync
My question is: what is the best way to systematically compare two files that are in different machines (but similar directory structure), and are there any built-in Perl utilities that may be helpful?
rsync will figure out the difference and sync your files by sending only the diff. Once two folders get synced, it will be pretty quick. (But the 1st time to sync will take some time)
You can also use git here. One possible workflow: just checkin all files you want to compare (or complete directories using git add -A). Then create an empty git repository on your local workstation which is used fetch all the other repositories, and which is used to do the comparisons:
git init
git remote add firstmachine ssh://user#firstmachine/path/to/directory
git remote add othermachine ssh://user#othermachine/path/to/directory
git fetch --all
Now the contents of two machines may be compared:
git diff remotes/firstmachine/master remotes/othermachine/master
Or just compare the contents of a specific file:
git diff remotes/firstmachine/master remotes/othermachine/master -- file/to/compare
It's not strictly necessary to use a third machine for the comparisons. You can also git-fetch the contents from othermachine to firstmachine.
I had worked on a similar tool (which was in python). What it did was, run a cron job, at a given time of the night, which would bring the tar bzipped files to one server, extract the directories and run a recursive diff on it. The diff output was then run through some python scripts, which would analyse the diff hunks (+ lines/! lines etc) to know the amount of change.
Not sure if there are pre-built modules in Perl or Python, but some helper utils might sure be available in one of them.
If you need to know the difference between some local and remote file systems, the following method minimizes the network load:
make a local copy ($C) of the local directory ($D) you want to compare. I.e.:
cp -R $D $C
use rsync to copy the remote directory ($R) you want to compare over $C:
rsync -av --delete $remote_host:$R $C
compare $D to $C:
diff -u $D $C

Limit to number of files to cp in parallel

Im running the gsutil cp command in parallel (with the -m option) on a directory with 25 4gb json files (that i am also compressing with the -z option).
gsutil -m cp -z json -R dir_with_4g_chunks gs://my_bucket/
When I run it, it will print out to terminal that it is copying all but one of the files. By this I mean that it prints one of these lines per file:
Copying file://dir_with_4g_chunks/a_4g_chunk [Content-Type=application/octet-stream]...
Once the transfer for one of them is complete, it says that it'll be copying the last file.
The result of this is that there is one file that only starts to copy only when one of the others finishes copying, significantly slowing down the process
Is there a limit to the number of files I can upload with the -m option? Is this configurable in the boto config file?
I was not able to find the .boto file on my Mac (as per jterrace's answer above), instead I specified these values using the -o switch:
gsutil -m -o "Boto:parallel_thread_count=4" cp directory1/* gs://my-bucket/
This seemed to control the rate of transfer.
From the description of the -m option:
gsutil performs the specified operation using a combination of
multi-threading and multi-processing, using a number of threads and
processors determined by the parallel_thread_count and
parallel_process_count values set in the boto configuration file. You
might want to experiment with these value, as the best value can vary
based on a number of factors, including network speed, number of CPUs,
and available memory.
If you take a look at your .boto file, you should see this generated comment:
# 'parallel_process_count' and 'parallel_thread_count' specify the number
# of OS processes and Python threads, respectively, to use when executing
# operations in parallel. The default settings should work well as configured,
# however, to enhance performance for transfers involving large numbers of
# files, you may experiment with hand tuning these values to optimize
# performance for your particular system configuration.
# MacOS and Windows users should see
# https://github.com/GoogleCloudPlatform/gsutil/issues/77 before attempting
# to experiment with these values.
#parallel_process_count = 12
#parallel_thread_count = 10
I'm guessing that you're on Windows or Mac, because the default values for non-Linux machines is 24 threads and 1 process. This would result in copying 24 of your files first, then the last 1 file afterward. Try experimenting with increasing these values to transfer all 25 files at once.

memory exhausted : for large files using diff

I am trying to create a patch using two large size folders (~7GB).
Here is how I'm doing it :
$ diff -Naurbw . ../other-folder > file.patch
But maybe due to file sizes, patch is not getting created and giving an error:
diff: memory exhausted
I tried making space more than 15 GB but still the issue persists. Could someone help me out with the flags that I should use?
Recently I came across this too when I needed to diff two large files (>5Gb each).
I tried to use 'diff' with different options, but even the --speed-large-files had no effect. Other methods like splitting the files into smaller ones, using xdelta or sorting the files as per this suggestion didn't help either. I even got my hands around a very powerful VM (> 72Gb RAM), but still got this memory exhausted error.
I finally got to work by adding the following parameter to sysctl.conf (sudo vim /etc/sysctl.conf):
vm.overcommit_memory=1
vm.overcommit_memory has three values (0,1,2) and sets the kernel virtual memory accounting mode. From the proc(5) man page:
0: heuristic overcommit (this is the default)
1: always overcommit, never check
2: always check, never overcommit
To make sure that the parameter is indeed applied you can run
sudo sysctl -p
Don't forget to change this parameter back when you finish!
bsdiff is slow & requires large memory, xdelta is create large delta for large files.
Try HDiffPatch for large files: https://github.com/sisong/HDiffPatch
support diff between large binary files or directories;
can run on: Windows, macos, Linux, Android
diff & patch both support run with limit memory;
Usage example:
Creating a patch: hdiffz -s-256 [-c-lzma2] old_path new_path out_delta_file
Applying a patch: hpatchz old_path delta_file out_new_path
Try sdiff. It's a pre-built tool in some Linux Distributions.
sdiff a.txt b.txt --output=c.txt
will show the files to be Modified.
This worked perfectly for me.

flexible merge command for unison to pick newer or older file?

I've been using unison as my file synchronizer of choice and life has been great.
Essentially I could modify any files on any side at any time without ever worrying who's master and slave, etc. It's bidirectional.
However with four roots failing over to each other when each's primary partner cannot be reached, I'm starting to push the limits of this tool. Conflicts arise that halt automatic syncing for the files involved. Aspects of my business logic are distributed across the different hosts, which modify sometimes the same files when run.
The merge option in the configuration file comes into play. It lets you specify different merge commands for different file types.
For example for log files only I like to interpolate their lines with:
merge = Name *.log -> diff3 -m CURRENT1 CURRENTARCH CURRENT2 > NEW || echo "differences detected"
Question: for *.last files only, what merge command would always favor the older copy?
For *.rb *.sh and other source files, I'm not looking to merge but always pick the newer version in case of conflicts. I can do that by default with the prefer = newer global option though.
For *.png files I typically prefer to keep the smaller(optimized) size.
Regarding the .rb and .sh files, you could use the preferpartial = Name *.rb -> newer and the same for .ssh files. For .last files, you can use older instead.
Regarding .png files, you could write your own merge command that checks the size of both files. I would then set merge = Name *.png -> mycmp CURRENT1 CURRENT2 NEW, and have the mycmp command takes three file path, compare the size of the first two, and copy it to the third path.