Limit to number of files to cp in parallel - google-cloud-storage

Im running the gsutil cp command in parallel (with the -m option) on a directory with 25 4gb json files (that i am also compressing with the -z option).
gsutil -m cp -z json -R dir_with_4g_chunks gs://my_bucket/
When I run it, it will print out to terminal that it is copying all but one of the files. By this I mean that it prints one of these lines per file:
Copying file://dir_with_4g_chunks/a_4g_chunk [Content-Type=application/octet-stream]...
Once the transfer for one of them is complete, it says that it'll be copying the last file.
The result of this is that there is one file that only starts to copy only when one of the others finishes copying, significantly slowing down the process
Is there a limit to the number of files I can upload with the -m option? Is this configurable in the boto config file?

I was not able to find the .boto file on my Mac (as per jterrace's answer above), instead I specified these values using the -o switch:
gsutil -m -o "Boto:parallel_thread_count=4" cp directory1/* gs://my-bucket/
This seemed to control the rate of transfer.

From the description of the -m option:
gsutil performs the specified operation using a combination of
multi-threading and multi-processing, using a number of threads and
processors determined by the parallel_thread_count and
parallel_process_count values set in the boto configuration file. You
might want to experiment with these value, as the best value can vary
based on a number of factors, including network speed, number of CPUs,
and available memory.
If you take a look at your .boto file, you should see this generated comment:
# 'parallel_process_count' and 'parallel_thread_count' specify the number
# of OS processes and Python threads, respectively, to use when executing
# operations in parallel. The default settings should work well as configured,
# however, to enhance performance for transfers involving large numbers of
# files, you may experiment with hand tuning these values to optimize
# performance for your particular system configuration.
# MacOS and Windows users should see
# https://github.com/GoogleCloudPlatform/gsutil/issues/77 before attempting
# to experiment with these values.
#parallel_process_count = 12
#parallel_thread_count = 10
I'm guessing that you're on Windows or Mac, because the default values for non-Linux machines is 24 threads and 1 process. This would result in copying 24 of your files first, then the last 1 file afterward. Try experimenting with increasing these values to transfer all 25 files at once.

Related

How to append more than 33 files in a gcloud bucket?

I use to append datasets in a bucket in gcloud using:
gsutil compose gs://bucket/obj1 [gs://bucket/obj2 ...] gs://bucket/composite
However, today when I tried to append some data the terminal prints the error CommandException: The compose command accepts at most 33 arguments.
I didn't know about this restriction. How can I append more than 33 files in my bucket? Is there another command line tool? I would like to avoid to create a virtual machine for what looks like a rather simple task.
I checked the help using gsutil help compose. But it didn't help much. There is only a warning saying "Note that there is a limit (currently 32) to the number of components that can
be composed in a single operation." but no hint on a workaround.
Could you not do it recursively|batch?
I've not tried this.
Given an arbitrary list of files (FILES)
While there is more than 1 file in FILES:
Take the first n where n<=33 from FILES and gsutil compose into temp file
If that succeeds, replace the n names in FILES with the 1 temp file.
Repeat
The file that remains is everything composed.
Update
The question piqued my curiosity and gave me an opportunity to improve my bash ;-)
A rough-and-ready proof-of-concept bash script that generates batches of gsutil compose commands for arbitrary (limited by the string formatting %04) numbers of files.
GSUTIL="gsutil compose"
BATCH_SIZE="32"
# These may be the same (or no) bucket
SRC="gs://bucket01/"
DST="gs://bucket02/"
# Generate test LST
FILES=()
for N in $(seq -f "%04g" 1 100); do
FILES+=("${SRC}/file-${N}")
done
function squish() {
LST=("$#")
LEN=${#LST[#]}
if [ "${LEN}" -le "1" ]; then
# Empty array; nothing to do
return 1
fi
# Only unique for this configuration; be careful
COMPOSITE=$(printf "${DST}/composite-%04d" ${LEN})
if [ "${LEN}" -le "${BATCH_SIZE}" ]; then
# Batch can be composed with one command
echo "${GSUTIL} ${LST[#]} ${COMPOSITE}"
return 1
fi
# Compose 1st batch of files
# NB Provide start:size
echo "${GSUTIL} ${LST[#]:0:${BATCH_SIZE}} ${COMPOSITE}"
# Remove batch from LST
# NB Provide start (to end is implied)
REM=${LST[#]:${BATCH_SIZE}}
# Prepend composite from above batch to the next run
NXT=(${COMPOSITE} ${REM[#]})
squish "${NXT[#]}"
}
squish "${FILES[#]}"
Running with BATCH_SIZE=3, no buckets and 12 files yields:
gsutil compose file-0001 file-0002 file-0003 composite-0012
gsutil compose composite-0012 file-0004 file-0005 composite-0010
gsutil compose composite-0010 file-0006 file-0007 composite-0008
gsutil compose composite-0008 file-0008 file-0009 composite-0006
gsutil compose composite-0006 file-0010 file-0011 composite-0004
gsutil compose composite-0004 file-0012 composite-0002
NOTE How composite-0012 is created by the first command but then knitted into the subsequent command.
I'll leave it to you to improve throughput by not threading the output from each step into the next, parallelizing the gsutil compose commands across the list chopped into batches and then compose the batches.
The docs say that you may only combine 32 components in a single operation, but there is no limit to the number of components that can make up a composite object.
So, if you have more than 32 objects to concatenate, you may perform multiple compose operations, composing 32 objects at a time until you eventually get all of them composed together.

GSUTIL CP using file size

I am trying to copy files from a directory on my Google Compute Instance to Google Cloud Storage Bucket. I have it working, however there are ~35k files but only ~5k have an data in them.
Is there anyway to only copy files above a certain size?
I've not tried this but...
You should be able to do this using a resumable transfer and setting the threshold to 5k (defaults to 8Mib). See: https://cloud.google.com/storage/docs/gsutil/commands/cp#resumable-transfers
May be advisable to set BOTO_CONFIG specifically for this copy (a) to be intentional; (b) to remind yourself how it works. See: https://cloud.google.com/storage/docs/boto-gsutil
Resumable uploads has the added benefit, of course, of resuming if there are any failures.
Recommend: try this on a small subset and confirm it works to your satisfaction.
While it's not possible to do it only with gsutil, it's possible to do it by parsing the names and use the -I flag on the cp command to process them. If you're using a Linux Compute Engine instance you can perform it by using the du and awk commands:
du * | awk '{if ($1 > 1000) print $2 }' | gsutil -m cp -I gs://bucket2
The command will get the filesize of the files inside the current directory on your compute engine with du * and will only copy the files which size are larger than 1000 bytes to bucket2, you can change that value to adjust it to your needs.

Compare file sizes and download if they're different via wget

I'm downloading some .mp3 files (all legal) via wget :
wget -r -nc files.myserver.com
I have to stop the download sometimes and at that times the file is partially downloaded. For example a 10 minutes record.mp3 file become 4 minutes record.mp3 file. It's playing correctly but incomplete.
If I use the same command above, because the record.mp3 file is already exist in my local computer wget skips that file although it isn't complete.
I wonder if there is a way to check the file sizes and if the file size in the remote server and local computer isn't same re-download it. (I've learned the --spider command gives the file size but is there any other command that automatically check the file sizes and download or not).
I would go with wget's -N option for timestamping, but note that wget will only compare the file sizes if you also specify the --no-if-modified-since option. Without it, incomplete files are indeed skipped on the next run because they receive a timestamp of the current time, which is newer than that on the server.
The reason is probably that with only -N, a GET request is sent for the file with the If-Modified-Since field set. The server responds with either 200 or 304, but the 304 doesn't contain the file size so wget can't check it.
With --no-if-modified-since wget sends a HEAD request instead to get the timestamp and file size, and checks both.
What I use for recursive download of a folder:
wget -T 300 -nv -t 1 -r -nd -np -l 1 -N --no-if-modified-since -P $my_folder $my_url
With:
-T 300: Set the network timeout to 300 seconds
-nv: Turn off verbose without being completely quiet
-t 1: Set number of tries to 1
-r: Turn on recursive retrieving
-nd: Do not create a hierarchy of directories when retrieving recursively
-np: Do not ever ascend to the parent directory when retrieving recursively
-l 1: Specify recursion maximum depth 1
-N: Turn on time-stamping
--no-if-modified-since: Do not send If-Modified-Since header in ā€˜-Nā€™ mode, send preliminary HEAD request instead
You may try the -c option to continue the download of partially downloaded files, however the manual gives an explicit warning:
You need to be especially careful of this when using -c in conjunction
with -r, since every file will be considered as an "incomplete
download" candidate.
While there is no perfect solution to this problem you could try to use -N option to turn on timestamping. This might prevent errors when the file has changed on the server but only if the server supports timestamping and partial downloads. Try it and see how it goes.
wget -r -N -c files.myserver.com
If you need check if file was partially downloaded (has different size) or updated on remote server by timestamp and must be in this case updated locally you need use -N option.
Here some additional info about -N (--timestamping) option from Wget docs:
If the local file does not exist, or the sizes of the files do not match, Wget will download the remote file no matter what the
time-stamps say.
Added From: https://www.gnu.org/software/wget/manual/wget.html (Chapter: 5 Time-Stamping)

memory exhausted : for large files using diff

I am trying to create a patch using two large size folders (~7GB).
Here is how I'm doing it :
$ diff -Naurbw . ../other-folder > file.patch
But maybe due to file sizes, patch is not getting created and giving an error:
diff: memory exhausted
I tried making space more than 15 GB but still the issue persists. Could someone help me out with the flags that I should use?
Recently I came across this too when I needed to diff two large files (>5Gb each).
I tried to use 'diff' with different options, but even the --speed-large-files had no effect. Other methods like splitting the files into smaller ones, using xdelta or sorting the files as per this suggestion didn't help either. I even got my hands around a very powerful VM (> 72Gb RAM), but still got this memory exhausted error.
I finally got to work by adding the following parameter to sysctl.conf (sudo vim /etc/sysctl.conf):
vm.overcommit_memory=1
vm.overcommit_memory has three values (0,1,2) and sets the kernel virtual memory accounting mode. From the proc(5) man page:
0: heuristic overcommit (this is the default)
1: always overcommit, never check
2: always check, never overcommit
To make sure that the parameter is indeed applied you can run
sudo sysctl -p
Don't forget to change this parameter back when you finish!
bsdiff is slow & requires large memory, xdelta is create large delta for large files.
Try HDiffPatch for large files: https://github.com/sisong/HDiffPatch
support diff between large binary files or directories;
can run on: Windows, macos, Linux, Android
diff & patch both support run with limit memory;
Usage example:
Creating a patch: hdiffz -s-256 [-c-lzma2] old_path new_path out_delta_file
Applying a patch: hpatchz old_path delta_file out_new_path
Try sdiff. It's a pre-built tool in some Linux Distributions.
sdiff a.txt b.txt --output=c.txt
will show the files to be Modified.
This worked perfectly for me.

How to read a block in a storage pool (zpool) using dd?

I want to read a block in zpool storage pool using dd command. Since zpool doesn't create a device file like other volume manager like vxvm. I dunno which block device to use for reading. Is there any way to read block by block data in zpool ?
You can probably use the zdb command. Here is a pdf about it, and the help output.
http://www.bruningsystems.com/osdevcon_draft3.pdf
# zdb --help
zdb: illegal option -- -
Usage: zdb [-CumdibcsDvhL] poolname [object...]
zdb [-div] dataset [object...]
zdb -m [-L] poolname [vdev [metaslab...]]
zdb -R poolname vdev:offset:size[:flags]
zdb -S poolname
zdb -l [-u] device
zdb -C
Dataset name must include at least one separator character '/' or '#'
If dataset name is specified, only that dataset is dumped
If object numbers are specified, only those objects are dumped
Options to control amount of output:
-u uberblock
-d dataset(s)
-i intent logs
-C config (or cachefile if alone)
-h pool history
-b block statistics
-m metaslabs
-c checksum all metadata (twice for all data) blocks
-s report stats on zdb's I/O
-D dedup statistics
-S simulate dedup to measure effect
-v verbose (applies to all others)
-l dump label contents
-L disable leak tracking (do not load spacemaps)
-R read and display block from a device
Below options are intended for use with other options (except -l):
-A ignore assertions (-A), enable panic recovery (-AA) or both (-AAA)
-F attempt automatic rewind within safe range of transaction groups
-U <cachefile_path> -- use alternate cachefile
-X attempt extreme rewind (does not work with dataset)
-e pool is exported/destroyed/has altroot/not in a cachefile
-p <path> -- use one or more with -e to specify path to vdev dir
-P print numbers parsable
-t <txg> -- highest txg to use when searching for uberblocks
Specify an option more than once (e.g. -bb) to make only that option verbose
Default is to dump everything non-verbosely
Unfortunately, I don't know how to use it.
# zdb
tank:
version: 28
name: 'tank'
...
vdev_tree:
...
children[0]:
...
children[0]:
...
path: '/dev/label/bank1d1'
phys_path: '/dev/label/bank1d1'
...
So I took the array indexes 0 0 to get my first disk (bank1d1) and did this command. It did something. I don't know how to read the output.
zdb -R tank 0:0:4e00:200 | strings
Have fun... try not to destroy anything. Here is your warning from the man page:
The zdb command is used by support engineers to diagnose failures and
gather statistics. Since the ZFS file system is always consistent on
disk and is self-repairing, zdb should only be run under the direction
by a support engineer.
And please tell us what you actually were looking for. Was Alan right that you wanted to do backups?
You can read from underlying raw devices in the pool, but as far as I can tell there's no concept of single contiguous block device representing the whole pool.
The pool in ZFS is not a single contiguous block of sectors that 'classic' volume managers are. ZFS internal structure is closer to a tree which would be somewhat challenging to represent as a flat array of blocks.
Ben Rockwood's blog post "zdb: Examining ZFS At Point-Blank Range" may help getting better idea of what's under the hood.
No idea about what might be useful doing so but you certainly can read blocks in the underlying devices used by the pool. They are shown by the zpool status command. If you are really asking about zvols instead of zpools, they are accessible under /dev/zvol/rdsk/pool-name/zvol-name. If you want to look at internal zpool data, you probably want to use zdb.
If you want to backup ZFS filesystems you should be using the following tools:
'zfs snapshot' to create a stable snapshot of the filesystem
'zfs send' to send a copy of the snapshot to somewhere else
'zfs receive' to go back from a snapshot to a filesystem.
'dd' is almost certainly not the tool you should be using. In your case you could 'zfs send' and redirect the output into a file on your other filesystem.
See chapter 7 of the ZFS administration guide for more details.