How to create an identical gzip of the same file? - version-control

I have a file, its contents are identical. It is passed into gzip and only the compressed form is stored. I'd like to be able to generate the zip again, and only update my copy should they differ. As it stands diffing tools (diff, xdelta, subversion) see the files as having changed.
Premise, I'm storing a mysqldump of an important database into a subversion repository. It is my intention that a cronjob periodically dump the db, gzip it, and commit the file. Currently, every time the file is dumped and then gzipped it is considered as differing. I'd prefer not to have my revision numbers needlessly increase every 15m.
I realize I could dump the file as just plain text, but I'd prefer not as it's rather large.
The command I am currently using to generate the dumps is:
mysqldump $DB --skip-extended-insert | sed '$d' | gzip -n > $REPO/$DB.sql.gz
The -n instructs gzip to remove the filename/timestamp information. The sed '$d' removes the last line of the file where mysqldump places a timestamp.
At this point, I'm probably going to revert to storing it in a plain text fashion, but I was curious as to what kind of solution there is.
Resolved, Mr. Bright was correct, I had mistakenly used a capital N when the correct argument was a lowercase one.

The -N instructs gzip to remove the
filename/timestamp information.
Actually, that does just the opposite. -n is what tells it to forget the original file name and time stamp.

I think gzip is preserving the original date and timestamp on the file(s) which will cause it to produce a different archive.
-N --name
When compressing, always save the original file
name and time stamp; this is the default. When
decompressing, restore the original file name and
time stamp if present. This option is useful on
systems which have a limit on file name length or
when the time stamp has been lost after a file
transfer.

But watchout: two gzips made at different times of the same unchanged file differ. This is because the gzip is itself timestamped with the gzip creation date - this is written to the header of the gzip file. Thus the apparently different gzips can contain the exact same content.

Related

How merge gz files from postgres dump into one big file?

There is a folder with postgres dump files like:
0001.dat.gz
0001.dat.gz
...
6000.dat.gz
toc.dat
How merge all these files into single gz archive which is recognized by postgres during restoring?
So it looks like you have the directory format. pg_restore will recognize that format. I don't think there is any supported way to convert it to one of the other formats. You can tar it up into a single file, but you will have to untar it before restoring. Next time you run pg_dump, you should tell it to use the format you want used.
There are subtle differences in the toc.dat file between the directory format and the tar format, so if you just uncompress and then tar up the directory, it will not work (at least in my hands). It does work the other way around, however.

Google Cloud Storage : What is the easiest way to update timestamp of all files under all subfolders

I have datewise folders in the form of root-dir/yyyy/mm/dd
under which there are so many files present.
I want to update the timestamp of all the files falling under certain date-range,
for example 2 weeks ie. 14 folders, so that these these files can be picked up by my file-Streaming Data Ingestion process.
What is the easiest way to achieve this?
Is there a way in UI console? or is it through gsutil?
please help
GCS objects are immutable, so the only way to "update" the timestamp would be to copy each object on top of itself, e.g., using:
gsutil cp gs://your-bucket/object1 gs://your-bucket/object1
(and looping over all objects you want to do this to).
This is a fast (metadata-only) operation, which will create a new generation of each object, with a current timestamp.
Note that if you have versioning enabled on the bucket doing this will create an extra version of each file you copy this way.
When you say "folders in the form of root-dir/yyyy/mm/dd", do you mean that you're copying those objects into your bucket with names like gs://my-bucket/root-dir/2016/12/25/christmas.jpg? If not, see Mike's answer; but if they are named with that pattern and you just want to rename them, you could use gsutil's mv command to rename every object with that prefix:
$ export BKT=my-bucket
$ gsutil ls gs://$BKT/**
gs://my-bucket/2015/12/31/newyears.jpg
gs://my-bucket/2016/01/15/file1.txt
gs://my-bucket/2016/01/15/some/file.txt
gs://my-bucket/2016/01/15/yet/another-file.txt
$ gsutil -m mv gs://$BKT/2016/01/15 gs://$BKT/2016/06/20
[...]
Operation completed over 3 objects/12.0 B.
# We can see that the prefixes changed from 2016/01/15 to 2016/06/20
$ gsutil ls gs://$BKT/**
gs://my-bucket/2015/12/31/newyears.jpg
gs://my-bucket/2016/06/20/file1.txt
gs://my-bucket/2016/06/20/some/file.txt
gs://my-bucket/2016/06/20/yet/another-file.txt

Can we wget with file list and renaming destination files?

I have this wget command:
sudo wget --user-agent='some-agent' --referer=http://some-referrer.html -N -r -nH --cut-dirs=x --timeout=xxx --directory-prefix=/directory/for/downloaded/files -i list-of-files-to-download.txt
-N will check if there is actually a newer file to download.
-r will turn the recursive retrieving on.
-nH will disable the generation of host-prefixed directories.
--cut-dirs=X will avoid the generation of the host's subdirectories.
--timeout=xxx will, well, timeout :)
--directory-prefix will store files in the desired directorty.
This works nice, no problem.
Now, to the issue:
Let's say my files-to-download.txt has these kind of files:
http://website/directory1/picture-same-name.jpg
http://website/directory2/picture-same-name.jpg
http://website/directory3/picture-same-name.jpg
etc...
You can see the problem: on the second download, wget will see we already have a picture-same-name.jpg, so it won't download the second or any of the following ones with the same name. I cannot mirror the directory structure because I need all the downloaded files to be in the same directory. I can't use the -O option because it clashes with --N, and I need that. I've tried to use -nd, but doesn't seem to work for me.
So, ideally, I need to be able to:
a.- wget from a list of url's the way I do now, keeping my parameters.
b.- get all files at the same directory and being able to rename each file.
Does anybody have any solution to this?
Thanks in advance.
I would suggest 2 approaches -
Use the "-nc" or the "--no-clobber" option. From the man page -
-nc
--no-clobber
If a file is downloaded more than once in the same directory, >Wget's behavior depends on a few options, including -nc. In certain >cases, the local file will be
clobbered, or overwritten, upon repeated download. In other >cases it will be preserved.
When running Wget without -N, -nc, -r, or -p, downloading the >same file in the same directory will result in the original copy of file >being preserved and the second copy
being named file.1. If that file is downloaded yet again, the >third copy will be named file.2, and so on. (This is also the behavior >with -nd, even if -r or -p are in
effect.) When -nc is specified, this behavior is suppressed, >and Wget will refuse to download newer copies of file. Therefore, ""no->clobber"" is actually a misnomer in
this mode---it's not clobbering that's prevented (as the >numeric suffixes were already preventing clobbering), but rather the >multiple version saving that's prevented.
When running Wget with -r or -p, but without -N, -nd, or -nc, >re-downloading a file will result in the new copy simply overwriting the >old. Adding -nc will prevent this
behavior, instead causing the original version to be preserved >and any newer copies on the server to be ignored.
When running Wget with -N, with or without -r or -p, the >decision as to whether or not to download a newer copy of a file depends >on the local and remote timestamp and
size of the file. -nc may not be specified at the same time as >-N.
A combination with -O/--output-document is only accepted if the >given output file does not exist.
Note that when -nc is specified, files with the suffixes .html >or .htm will be loaded from the local disk and parsed as if they had been >retrieved from the Web.
As you can see from this man page entry, the behavior might be unpredictable/unexpected. You will need to see if it works for you.
Another approach would be to use a bash script. I am most comfortable using bash on *nix, so forgive the platform dependency. However the logic is sound, and with a bit of modifications, you can get it to work on other platforms/scripts as well.
Sample pseudocode bash script -
for i in `cat list-of-files-to-download.txt`;
do
wget <all your flags except the -i flag> $i -O /path/to/custom/directory/filename ;
done ;
You can modify the script to download each file to a temporary file, parse $i to get the filename from the URL, check if the file exists on the disk, and then take a decision to rename the temp file to the name that you want.
This offers much more control over your downloads.

how to stop creating temporary files

when i use sed -i it is creating some temporary files
When i use the above command for replace string
it is creating some temporary files like sed6Y5vk6 with same original file size.
how can we avoid this.
The same bug over here, backup files are not deleted.
I'm back on sed 4.1.5 which works as expected for the time now.

php wget and priovide a time for last modified

I know that wget has an -N function, (and also a timestamping option for non-header sending protocols like FTP), but how would I specify a time and date for wget.
For example; I don't want to compare local and remote files, I would like to directly specify a time and date for wget to use. I know the following is not correct, it just serves the purpose of the example:
wget -N **--jan-2-2013-05:00** r ftp://user:myPassword#ftp.example.com/public_html
Is it possible to give wget a timestamp to use when checking last modified?
No there is not such option.
Timestamping behavior is pretty automatic in wget. When you use --timestamping or -N option then the file will not get downloaded if:
a) The file already exists locally
AND
b) The file has the same size locally and remotely
AND
c) The timestamp of the remote file is equal or older than the timestamp of the local file (in http it compares against the Last-Modified header)
so you can emulate the behavior you are asking if you:
Create the files you are going to download with the same size as
those in the remote location (a+b conditions)
Touch the files so that they have the timestamp you want (c
condition)
Example:
for a single file the name which we already know let's say foo.txt of size 7348 that we want to get only newer files after 2013-07-24T14:27 that would be:
dd if=/dev/zero of=foo.txt bs=1 count=7348
touch -t 201307241427 foo.txt
now if you use wget -N http://url/path/foo.txt the wget will work like you asked