what is mongodb archive format? - mongodb

I've backed up some mongoDBs using their archive option, but I can't simply untar them. When I go through some steps to decompress the data it looks like the archive is the whole DB in one big file.
I wanted to get at the files for the individual collections.
Is there a way to do that?
$ tar -xvf valk.archive
tar: Unrecognized archive format
tar: Error exit delayed from previous errors.
$ file valk.archive
valk.archive: gzip compressed data, original size 13953183
$ gunzip valk.archive
gunzip: valk.archive: unknown suffix -- ignored
$ unzip valk.archive
Archive: valk.archive
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
unzip: cannot find zipfile directory in one of valk.archive or
valk.archive.zip, and cannot find valk.archive.ZIP, period.
$ mv valk.zip valk.gz
$ gunzip valk.gz
$ open .
$ tar -xvf valk
tar: Unrecognized archive format
tar: Error exit delayed from previous errors.
$ head valk
TemplateDatametadata�{"options":{},"indexes":[{"v":2,"key":{"_id":1},"name":"_id_","ns":"valk.TemplateData"}],"uuid":"f52402b5aba24856b072d57cc3e46a72"}size-dbvalkcollectioMetricsmetadata�{"options":{"capped":true,"size":10485760,"max":1000000},"indexes":[{"v":2,"key":{"_id":1},"name":"_id_","ns":"valk.Metrics"},{"v":2,"key":{"openid":1},"name":"openid_1","ns":"valk.Metrics"}],"uuid":"43d92ff01815432c95dac5a2e05a64c0"}size�dbvalkcollection
AppConfigmetadata�{"options":{},"indexes":[{"v":2,"key":{"_id":1},"name":"_id_","ns":"valk.AppConfig"}],"uuid":"df633b0a43184de38e8b8ea7489cda3e"}size�dbvalkcollecMinibotZonesmetadata�{"options":{},"indexes":[{"v":2,"key":{"_id":1},"name":"_id_","ns":"valk.MinibotZones"}],"uuid":"095bbac0d17640be9e27dffe681b7d83"}size�dbvalkcollection ChatLogsmetadataQ{"options":{"capped":true,"size":104857600,"max":10000000},"indexes":[{"v":2,"key":{"_id":1},"name":"_id_","ns":"valk.ChatLogs"},{"v":2,"key":{"openid":1,"createdAt":1},"name":"openid_1_createdAt_1","ns":"valk.ChatLogs"},{"v":2,"key":{"createdAt":1},"name":"createdAt_1","ns":"valk.ChatLogs"}],"uuid":"70586c82b3ae42cf8d9c47ad339ea55b"}size�dbvalkcollection

The mongodump archive format is a special purpose format; you need to use mongorestore --archive with any other options that are appropriate.
For example, you can use the --nsInclude option (mongorestore 3.4+) to selectively restore multiple collections by namespace.
For more information on the MongoDB archive format (and why tar wasn't suitable), see:
Archiving and Compression in MongoDB Tools. The gist of this is:
General purpose archive formats, like tar, only support contiguous file packing within the archive. Using these archive formats for mongodump and mongorestore will create an unacceptable performance degradation as data from all collections will have to be written to and read from, in order. To support the concurrent behavior of these tools, we developed a special purpose archive format that supports non-contiguous files writes. The new archiving feature provides major gains in the efficiency of backup and restore operations.

Related

Option to exclude files in pg_basebackup command Postgres

When cloning a standby, how can I prevent pg_basebackup from copying postgresql.conf and pg_hba.conf from the master to /var/lib/pgsql/9.9/data directory?
Currently I am using this command
[root#xyz..]# pg_basebackup -h {master ipAddr} -D /var/lib/pgsql/9.6/data -U postgres -v -P
according to docs:
The backup will include all files in the data directory and
tablespaces, including the configuration files and any additional
files placed in the directory by third parties. But only regular files
and directories are copied. Symbolic links (other than those used for
tablespaces) and special device files are skipped.
So there is no such option. If you still want to force it, move config files away from data directory (and optionally ln them to data_dir)
This answer is for Postgres 14. pg_basebackup takes backup of the entire data directory. https://www.postgresql.org/docs/14/app-pgbasebackup.html states that the backup utility will skip all directory/file that are symbolic links. So, that could be a workaround to get only desired content into the tar ball.
I had faced similar situations where I wanted to exclude the content of multiple directories like pg_replslot,pg_dynshmem, pg_notify etc. I made the tar ball the usual way: pg_basebackup -D /backup/ -F t -P -v. After the tar ball was made, and before restoring it to another server, I updated the tar manually by excluding content of all the required directories.

Stop mongodump from overwriting existing files (rename instead)

From mongodocs:
Overwrite Files
"Mongodump overwrites output files if they exist in the backup data folder. Before running the mongodump command multiple times, either ensure that you no longer need the files in the output folder (the default is the dump/ folder) or rename the folders or files."
Hey guys,
I want to do a daily backup and sometimes even two backups a day. The Dump-filename gets named by the actual date. If I backup twice a day, the first backup gets overwritten due to same names.
Is there any way to tell mongodump to rename (in e.g. 5.9.2016(1)) the file if it already exists?
You can use the --out option of mongodump to specify the path where to dummp the data.
Create a script that run mongodump and give different name for your path, i.e. using a date:
mongodump --out /data/dump/090516/
Shell script example:
#!/bin/sh
DIR=`date +%m%d%y`
DEST=$DIR
mkdir $DEST
mongodump --out=/data/dump/$DEST

gsutil rsync with gzip compression

I'm hosting publicly available static resources in a google storage bucket, and I want to use the gsutil rsync command to sync our local version to the bucket, saving bandwidth and time. Part of our build process is to pre-gzip these resources, but gsutil rsync has no way to set the Content-Encoding header. This means we must run gsutil rsync, then immediately run gsutil setmeta to set headers on all the of gzipped file types. This leaves the bucket in a BAD state until that header is set. Another option is to use gsutil cp, passing the -z option, but this requires us to re-upload the entire directory structure every time, and this includes a LOT of image files and other non-gzipped resources that wastes time and bandwidth.
Is there an atomic way to accomplish the rsync and set proper Content-Encoding headers?
Assuming you're starting with gzipped source files in source-dir you can do:
gsutil -h content-encoding:gzip rsync -r source-dir gs://your-bucket
Note: If you do this and then run rsync in the reverse direction it will decompress and copy all the objects back down:
gsutil rsync -r gs://your-bucket source-dir
which may not be what you want to happen. Basically, the safest way to use rsync is to simply synchronize objects as-is between source and destination, and not try to set content encodings on the objects.
I'm not completely answering the question but I came here as I was wondering the same thing trying to achieve the following:
how to deploy efficiently a static website to google cloud storage
I was able to find an optimized way for deploying my static web site from a local folder to a gs bucket
Split my local folder into 2 folders with the same hierarchy, one containing the content to be gzip (html,css,js...), the other the other files
Gzip each file in my gzip folder (in place)
Call gsutil rsync in for each folder to the same gs destination
Of course, it is only a one way synchronization and deleted local files are not deleted remotely
For the gzip folder the command is
gsutil -m -h Content-Encoding:gzip rsync -c -r src/gzip gs://dst
forcing the content encoding to be gzippped
For the other folder the command is
gsutil -m rsync -c -r src/none gs://dst
the -m option is used for parallel optimization. The -c option is needed to force using checksum validation (Why is gsutil rsync re-downloading all our files?) as I was touching each local file in my build process. the -r option is used for recursivity.
I even wrote a script for it (in dart): http://tekhoow.blogspot.fr/2016/10/deploying-static-website-efficiently-on.html

Possible backup corruption using pg_dump only with compress parameter?

I used this command to backup 200GB database (postgres 9.1, win7 x64):
pg_dump -Z 1 db_name > backup
It created 16GB file, which is fine I think because previous backups which works (and were packed by ext. tools) had similar size. Now, when I'm trying to restore into PG9.2 using pg_restore, I'm getting the error:
input file does not appear to be a valid archive
With pg_restore -Ft:
[tar archiver] corrupt tar header found in ▼ (expected 13500752, com
puted 78268) file position 512
Gzip also shows it's corrupted. When I open the backup file in Total Commander, the inner file has only 1.8GB.
When I was looking for a solution, dump should be done with -Cf parameter probably.
Which format has the file right now? Is it only tar or gzip (winrar shows gzip)?
Is there any way to restore this properly or is it corrupted somehow (no error when dumped)? Could it be due to file size limitations of tar or gzip?
What you have as output in "backup" is just zipped plain sql.
You could check it by prompting:
gzip -l backup
Unfortunately pg_retore do not provide possibility to restore PLAIN SQL,
so you just need to decompress the file and use psql -f <FILE> command:
zcat backup > backup.sql
psql -f backup.sql
It is not possible to make dump with pg_dump -Fc from postgres 9.1 as proposed by "Frank Heikens",
because dump formats are not compatible between primary versions, like 9.0 -> 9.1 -> 9.2
and "pg_restore" will give you an error on 9.2
Mostly this error mean that your restore action used invalid format
From manual of pg_dump ( pg_dump --help )
-F, --format=c|d|t|p output file format (custom, directory, tar,
plain text (default))
This mean that if you create dump with pg_dump without option --format / -F that your dump will be created in plain text format
NOTE: Plain text format cannot be restored with pg_restore tool. Use psql < dump.sql instead.
Examples:
# plain text export/import
pg_dump -Fp -d postgres://<db_user>:<db_password>#<db_host>:<db_port>/<db_name> > dump.sql
psql -d postgres://<target_db_user>:<target_db_password>#<target_db_host>:<target_db_port>/<target_db_name> -f dump.sql
# custom format
pg_dump -Fc -d postgres://<db_user>:<db_password>#<db_host>:<db_port>/<db_name> > dump.sql.custom
pg_restore -Ft postgres://<target_db_user>:<target_db_password>#<target_db_host>:<target_db_port>/<target_db_name> dump.sql.custom
# tar format
pg_dump -Ft -d postgres://<db_user>:<db_password>#<db_host>:<db_port>/<db_name> > dump.sql.tar
pg_restore -Ft postgres://<target_db_user>:<target_db_password>#<target_db_host>:<target_db_port>/<target_db_name> dump.sql.tar
Error from subject also can occur when restoring format not match backup.
For example created dump will be in custom format but for restore specified tar
Your dump is plain SQL, it's not a tar format, like you try to use in pg_restore. Use --format=custom or -Fc when you want a compressed format and use this setting in pg_restore as well. Check the manual.
This is an old thread though I had the exact same issue and managed to fix the somewhat corrupted dump with fixgz:
Short answer: run fixgz http://www.gzip.org/fixgz.zip on compressed dump.
fixgz.exe bad.gz fixed.gz
Long answer:
So if you used pg_dump with --compresss or -Z without specifying custom format option (-Fc) what you actually get is a compressed file in ASCII mode instead of BINARY mode.
Quoting from http://www.gzip.org/#faq1
If you have transferred a file in ASCII mode and you no longer have
access to the original, you can try the program fixgz to remove the
extra CR (carriage return) bytes inserted by the transfer. A Windows
9x/NT/2000/ME/XP binary is here. But there is absolutely no guarantee
that this will actually fix your file. Conclusion: never transfer
binary files in ASCII mode.
I got this problem when restoring using PGAdmin III. The problem doesn't occur with PGAdmin 4.

How to specify the download location with wget?

I need files to be downloaded to /tmp/cron_test/. My wget code is
wget --random-wait -r -p -nd -e robots=off -A".pdf" -U mozilla http://math.stanford.edu/undergrad/
So is there some parameter to specify the directory?
From the manual page:
-P prefix
--directory-prefix=prefix
Set directory prefix to prefix. The directory prefix is the
directory where all other files and sub-directories will be
saved to, i.e. the top of the retrieval tree. The default
is . (the current directory).
So you need to add -P /tmp/cron_test/ (short form) or --directory-prefix=/tmp/cron_test/ (long form) to your command. Also note that if the directory does not exist it will get created.
-O is the option to specify the path of the file you want to download to:
wget <uri> -O /path/to/file.ext
-P is prefix where it will download the file in the directory:
wget <uri> -P /path/to/folder
Make sure you have the URL correct for whatever you are downloading. First of all, URLs with characters like ? and such cannot be parsed and resolved. This will confuse the cmd line and accept any characters that aren't resolved into the source URL name as the file name you are downloading into.
For example:
wget "sourceforge.net/projects/ebosse/files/latest/download?source=typ_redirect"
will download into a file named, ?source=typ_redirect.
As you can see, knowing a thing or two about URLs helps to understand wget.
I am booting from a hirens disk and only had Linux 2.6.1 as a resource (import os is unavailable). The correct syntax that solved my problem downloading an ISO onto the physical hard drive was:
wget "(source url)" -O (directory where HD was mounted)/isofile.iso"
One could figure the correct URL by finding at what point wget downloads into a file named index.html (the default file), and has the correct size/other attributes of the file you need shown by the following command:
wget "(source url)"
Once that URL and source file is correct and it is downloading into index.html, you can stop the download (ctrl + z) and change the output file by using:
-O "<specified download directory>/filename.extension"
after the source url.
In my case this results in downloading an ISO and storing it as a binary file under isofile.iso, which hopefully mounts.
"-P" is the right option, please read on for more related information:
wget -nd -np -P /dest/dir --recursive http://url/dir1/dir2
Relevant snippets from man pages for convenience:
-P prefix
--directory-prefix=prefix
Set directory prefix to prefix. The directory prefix is the directory where all other files and subdirectories will be saved to, i.e. the top of the retrieval tree. The default is . (the current directory).
-nd
--no-directories
Do not create a hierarchy of directories when retrieving recursively. With this option turned on, all files will get saved to the current directory, without clobbering (if a name shows up more than once, the
filenames will get extensions .n).
-np
--no-parent
Do not ever ascend to the parent directory when retrieving recursively. This is a useful option, since it guarantees that only the files below a certain hierarchy will be downloaded.
man wget:
-O file
--output-document=file
wget "url" -O /tmp/cron_test/<file>