memory exhausted : for large files using diff

memory exhausted : for large files using diff - diff

I am trying to create a patch using two large size folders (~7GB).
Here is how I'm doing it :
$ diff -Naurbw . ../other-folder > file.patch
But maybe due to file sizes, patch is not getting created and giving an error:
diff: memory exhausted
I tried making space more than 15 GB but still the issue persists. Could someone help me out with the flags that I should use?

Recently I came across this too when I needed to diff two large files (>5Gb each).
I tried to use 'diff' with different options, but even the --speed-large-files had no effect. Other methods like splitting the files into smaller ones, using xdelta or sorting the files as per this suggestion didn't help either. I even got my hands around a very powerful VM (> 72Gb RAM), but still got this memory exhausted error.
I finally got to work by adding the following parameter to sysctl.conf (sudo vim /etc/sysctl.conf):
vm.overcommit_memory=1
vm.overcommit_memory has three values (0,1,2) and sets the kernel virtual memory accounting mode. From the proc(5) man page:
0: heuristic overcommit (this is the default)
1: always overcommit, never check
2: always check, never overcommit
To make sure that the parameter is indeed applied you can run
sudo sysctl -p
Don't forget to change this parameter back when you finish!

bsdiff is slow & requires large memory, xdelta is create large delta for large files.
Try HDiffPatch for large files: https://github.com/sisong/HDiffPatch
support diff between large binary files or directories;
can run on: Windows, macos, Linux, Android
diff & patch both support run with limit memory;
Usage example:
Creating a patch: hdiffz -s-256 [-c-lzma2] old_path new_path out_delta_file
Applying a patch: hpatchz old_path delta_file out_new_path

Try sdiff. It's a pre-built tool in some Linux Distributions.
sdiff a.txt b.txt --output=c.txt
will show the files to be Modified.
This worked perfectly for me.

Related

Limit to number of files to cp in parallel

Im running the gsutil cp command in parallel (with the -m option) on a directory with 25 4gb json files (that i am also compressing with the -z option).
gsutil -m cp -z json -R dir_with_4g_chunks gs://my_bucket/
When I run it, it will print out to terminal that it is copying all but one of the files. By this I mean that it prints one of these lines per file:
Copying file://dir_with_4g_chunks/a_4g_chunk [Content-Type=application/octet-stream]...
Once the transfer for one of them is complete, it says that it'll be copying the last file.
The result of this is that there is one file that only starts to copy only when one of the others finishes copying, significantly slowing down the process
Is there a limit to the number of files I can upload with the -m option? Is this configurable in the boto config file?

I was not able to find the .boto file on my Mac (as per jterrace's answer above), instead I specified these values using the -o switch:
gsutil -m -o "Boto:parallel_thread_count=4" cp directory1/* gs://my-bucket/
This seemed to control the rate of transfer.

From the description of the -m option:
gsutil performs the specified operation using a combination of
multi-threading and multi-processing, using a number of threads and
processors determined by the parallel_thread_count and
parallel_process_count values set in the boto configuration file. You
might want to experiment with these value, as the best value can vary
based on a number of factors, including network speed, number of CPUs,
and available memory.
If you take a look at your .boto file, you should see this generated comment:
# 'parallel_process_count' and 'parallel_thread_count' specify the number
# of OS processes and Python threads, respectively, to use when executing
# operations in parallel. The default settings should work well as configured,
# however, to enhance performance for transfers involving large numbers of
# files, you may experiment with hand tuning these values to optimize
# performance for your particular system configuration.
# MacOS and Windows users should see
# https://github.com/GoogleCloudPlatform/gsutil/issues/77 before attempting
# to experiment with these values.
#parallel_process_count = 12
#parallel_thread_count = 10
I'm guessing that you're on Windows or Mac, because the default values for non-Linux machines is 24 threads and 1 process. This would result in copying 24 of your files first, then the last 1 file afterward. Try experimenting with increasing these values to transfer all 25 files at once.

flexible merge command for unison to pick newer or older file?

I've been using unison as my file synchronizer of choice and life has been great.
Essentially I could modify any files on any side at any time without ever worrying who's master and slave, etc. It's bidirectional.
However with four roots failing over to each other when each's primary partner cannot be reached, I'm starting to push the limits of this tool. Conflicts arise that halt automatic syncing for the files involved. Aspects of my business logic are distributed across the different hosts, which modify sometimes the same files when run.
The merge option in the configuration file comes into play. It lets you specify different merge commands for different file types.
For example for log files only I like to interpolate their lines with:
merge = Name *.log -> diff3 -m CURRENT1 CURRENTARCH CURRENT2 > NEW || echo "differences detected"
Question: for *.last files only, what merge command would always favor the older copy?
For *.rb *.sh and other source files, I'm not looking to merge but always pick the newer version in case of conflicts. I can do that by default with the prefer = newer global option though.
For *.png files I typically prefer to keep the smaller(optimized) size.

Regarding the .rb and .sh files, you could use the preferpartial = Name *.rb -> newer and the same for .ssh files. For .last files, you can use older instead.
Regarding .png files, you could write your own merge command that checks the size of both files. I would then set merge = Name *.png -> mycmp CURRENT1 CURRENT2 NEW, and have the mycmp command takes three file path, compare the size of the first two, and copy it to the third path.

Binary Delta Storage

I'm looking for a binary delta storage solution to version large binary files (digital audio workstation files)
When working with DAW files, the majority of changes, especially near the end of the mix are very small in comparison to the huge amount of data used to store raw data (waves).
It would be great to have a versioning system for our DAW files, allowing us to roll back to older versions.
The system would only save the difference between the binary files (diff) of each version. This would give us a list of instructions to change from the current version to the previous version without storing the full file for every single version.
Is there any current versioning systems that do this? I've read that SVN using binary diff's to save space in the repo... But I've also read that it doesn't actually do that for binary files only text files... Not sure. Any ideas?
My plan of action as of right now is to continue research into preexisiting tools, and if none exist, become comfortable with c/c++ reading binary data and creating the tool myself.

I can't comment on the reliability or connection issues that might exist when committing a large file across the network (one referenced post hinted at problems). But here is a little bit of empirical data that you may find useful (or not).
I have been doing some tests today studying disk seek times and so had a reasonably good test case readily at hand. I found your question interesting, so I did a quick test with the files I am using/modifying. I created a local Subversion repository and added two binary files to it (sizes shown below) and then committed the files a couple of times after changes were made to them. The smaller binary file (.85 GB) simply had data added to the end of it each time. The larger file (2.2GB) contains data representing b-trees consisting of "random" integer data. The updates to that file between commits involved adding approximately 4000 new random values, so it would have modified nodes spread somewhat evenly throughout the file.
Here are the original file sizes along with the size/count of all files in the local subversion repository after the commit:
file1 851,271,675
file2 2,205,798,400
1,892,512,437 bytes in 32 files and 32 dirs
After the second commit:
file1 851,287,155
file2 2,207,569,920
1,894,211,472 bytes in 34 files and 32 dirs
After the third commit:
file1 851,308,845
file2 2,210,174,976
1,897,510,389 bytes in 36 files and 32 dirs
The commits were somewhat lengthy. I didn't pay close attention because I was doing other work, but I think each one took maybe 10 minutes. To check out a specific revision took about 5 minutes. I would not make a recommendation one way or other based on my results. All I can say is that it seemed to work fine and no errors occurred. And the file differencing seemed to work well (for these files).

Subversion might work, depending on your definition of large. This question/answer says that it works well as long as your files are less than 1 GB.

Subversion will perform binary deltas on binary files as well as text files. Subversion is just incapable of providing human-readable deltas for binary files, and cannot assist with merging conflicts in binary files.

git compresses (you may need to call git gc manually though), and seemingly really good:
$ git init
$ dd if=/dev/urandom of=largefile bs=1M count=100
$ git add largefile
$ git commit -m 'first commit'
[master (root-commit) e474841] first commit
1 files changed, 0 insertions(+), 0 deletions(-)
create mode 100644 largefile
$ du -sh .
201M .
$ for i in $(seq 20); do date >> largefile; git commit -m "$i" -a; git gc; done
$ du -sh .
201M .

VMWare Fusion: How do I combine muliple numbered vmdk files into a -flat.vmdk file?

I'm on Mac 10.6.6 using VM Ware Fusion 3.1.2. I created a Windows 7 image, but when I examine the files that make up the image, there are 21 "extent" files -- e.g. files with names like
Windows 7-s001.vmdk
Windows 7-s002.vmdk
Windows 7-s003.vmdk
...
Ultimately I want to convert this to something that an be used by VirtualBox, and so to do that, I need to get a single vmdk (-flat.vmdk) file. Does anyone know how to generate a single file given the multiple files I have now?
Thanks, - Dave

Virtual Machine - Settings - Hard Disk -> Uncheck "Split into 2GB files" and press Apply :)

For those who (like I did) may end up here looking for how to do this on ESXi (CLI): There is no vmware-vdiskmanager. Instead, use vmkfstools:
vmkfstools --clonevirtualdisk source.vmdk dest.vmdk

I have also had success doing this using the command line. Kind of heavy lifting, though - one has to RTFM and Google search carefully to find the right incantations.
look in
/Applications/VMware\ Fusion.app/Contents/Library/vmware-vdiskmanager
see -r for the "convert" option

How can I resume downloads in Perl?

I have a project that depends upon some other binaries to be downloaded from web at install time.For this what i do is:
if ( file-present-in-src/)
# skip that file
else
# use wget to download the file
The problem with this approach is that when I interrupt a download in middle, and do invoke the script next time, the partially downloaded file is also skipped (which is not desired), also I want wget to resume the download of the partially downloaded file.
How should I go about it:
Possible Solutions I could think of:
Let the file to be downloaded to some file say download_tmp. Move to original file
if successful.
Handle SIG{'INT'} to write proper cleanup code.
But none of these could help resume the partial file download,
Any insights?

Fist, I don't understand what this has to do with Perl, since you're using wget to do the dowloading ... You could use libwww-perl (perldoc LWP) and have more control about the download process.
Then I second your idea of downloading to a "tmp" filename and move the file on success.
However I think you need to go further and verify the integrity of the files. Doing an MD5 or SHA hash is very easy, and match the downloaded one with what you're expecting. You can have a short file on server containing the checksum (filename.md5). Determine success only when you have a match.
Note that catching all the signals and generally trying to make the process unkillable, and then expecting it to have worked is bound to fail at one point or another. There could be a network timeout, a crash, power failure, configuration problem on the server ... you should instead assume downloads can fail, because they will, and code so that your process can recover.
Finally you're not telling us what kind of binaries you're downloading and what you're doing with them. Since you use wget I'm going to assume you're on Unix; you should consider using RPM+Yum or the likes, they handle all this for you. RPM are easy to write, really.

use your first approach ..
download to "FileName".tmp
move "FileName".tmp to "FileName" move! not copy
once per diem clean out all .tmp files (paranoia rulez)

You could just use wget's -N and -c options and remove the entire "if file exists" logic.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse