How to append more than 33 files in a gcloud bucket? - gcloud

I use to append datasets in a bucket in gcloud using:
gsutil compose gs://bucket/obj1 [gs://bucket/obj2 ...] gs://bucket/composite
However, today when I tried to append some data the terminal prints the error CommandException: The compose command accepts at most 33 arguments.
I didn't know about this restriction. How can I append more than 33 files in my bucket? Is there another command line tool? I would like to avoid to create a virtual machine for what looks like a rather simple task.
I checked the help using gsutil help compose. But it didn't help much. There is only a warning saying "Note that there is a limit (currently 32) to the number of components that can
be composed in a single operation." but no hint on a workaround.

Could you not do it recursively|batch?
I've not tried this.
Given an arbitrary list of files (FILES)
While there is more than 1 file in FILES:
Take the first n where n<=33 from FILES and gsutil compose into temp file
If that succeeds, replace the n names in FILES with the 1 temp file.
Repeat
The file that remains is everything composed.
Update
The question piqued my curiosity and gave me an opportunity to improve my bash ;-)
A rough-and-ready proof-of-concept bash script that generates batches of gsutil compose commands for arbitrary (limited by the string formatting %04) numbers of files.
GSUTIL="gsutil compose"
BATCH_SIZE="32"
# These may be the same (or no) bucket
SRC="gs://bucket01/"
DST="gs://bucket02/"
# Generate test LST
FILES=()
for N in $(seq -f "%04g" 1 100); do
FILES+=("${SRC}/file-${N}")
done
function squish() {
LST=("$#")
LEN=${#LST[#]}
if [ "${LEN}" -le "1" ]; then
# Empty array; nothing to do
return 1
fi
# Only unique for this configuration; be careful
COMPOSITE=$(printf "${DST}/composite-%04d" ${LEN})
if [ "${LEN}" -le "${BATCH_SIZE}" ]; then
# Batch can be composed with one command
echo "${GSUTIL} ${LST[#]} ${COMPOSITE}"
return 1
fi
# Compose 1st batch of files
# NB Provide start:size
echo "${GSUTIL} ${LST[#]:0:${BATCH_SIZE}} ${COMPOSITE}"
# Remove batch from LST
# NB Provide start (to end is implied)
REM=${LST[#]:${BATCH_SIZE}}
# Prepend composite from above batch to the next run
NXT=(${COMPOSITE} ${REM[#]})
squish "${NXT[#]}"
}
squish "${FILES[#]}"
Running with BATCH_SIZE=3, no buckets and 12 files yields:
gsutil compose file-0001 file-0002 file-0003 composite-0012
gsutil compose composite-0012 file-0004 file-0005 composite-0010
gsutil compose composite-0010 file-0006 file-0007 composite-0008
gsutil compose composite-0008 file-0008 file-0009 composite-0006
gsutil compose composite-0006 file-0010 file-0011 composite-0004
gsutil compose composite-0004 file-0012 composite-0002
NOTE How composite-0012 is created by the first command but then knitted into the subsequent command.
I'll leave it to you to improve throughput by not threading the output from each step into the next, parallelizing the gsutil compose commands across the list chopped into batches and then compose the batches.

The docs say that you may only combine 32 components in a single operation, but there is no limit to the number of components that can make up a composite object.
So, if you have more than 32 objects to concatenate, you may perform multiple compose operations, composing 32 objects at a time until you eventually get all of them composed together.

Related

GSUTIL CP using file size

I am trying to copy files from a directory on my Google Compute Instance to Google Cloud Storage Bucket. I have it working, however there are ~35k files but only ~5k have an data in them.
Is there anyway to only copy files above a certain size?
I've not tried this but...
You should be able to do this using a resumable transfer and setting the threshold to 5k (defaults to 8Mib). See: https://cloud.google.com/storage/docs/gsutil/commands/cp#resumable-transfers
May be advisable to set BOTO_CONFIG specifically for this copy (a) to be intentional; (b) to remind yourself how it works. See: https://cloud.google.com/storage/docs/boto-gsutil
Resumable uploads has the added benefit, of course, of resuming if there are any failures.
Recommend: try this on a small subset and confirm it works to your satisfaction.
While it's not possible to do it only with gsutil, it's possible to do it by parsing the names and use the -I flag on the cp command to process them. If you're using a Linux Compute Engine instance you can perform it by using the du and awk commands:
du * | awk '{if ($1 > 1000) print $2 }' | gsutil -m cp -I gs://bucket2
The command will get the filesize of the files inside the current directory on your compute engine with du * and will only copy the files which size are larger than 1000 bytes to bucket2, you can change that value to adjust it to your needs.

LSF moving files into created output dir

When executing a job on LSF you can specify the working directory and create a output directory, i.e
bsub -cwd /home/workDir -outdir /home/$J program inputfile
where it will look for inputfile in the specified working directory. The -outdir will create a new directory based on the JobId.
What I'm wondering is how you pipe the results created from the run in the working directory to the newly created output dir.
You can't add a command like
mv * /home/%J
as the underlying OS has no understanding of the %J identifier. Is there an option in LSF for piping the data inside the job, where it knows the jobId?
You can use the environment variable $LSB_JOBID.
mv * /data/${LSB_JOBID}/
If you copy the data inside your job script then it will hold the compute resource during the data copy. If you're copying a small amount of data then its not a problem. But if its a large amount of data you can use bsub -f so that other jobs can start while the data copy is ongoing.
bsub -outdir "/data/%J" -f "/data/%J/final < bigfile" sh script.sh
bigfile is the file that your job creates on the compute host. It will be copied to /data/%J/final after the job finishes. It even works on a non-shared filesystem.

Google Cloud Storage : What is the easiest way to update timestamp of all files under all subfolders

I have datewise folders in the form of root-dir/yyyy/mm/dd
under which there are so many files present.
I want to update the timestamp of all the files falling under certain date-range,
for example 2 weeks ie. 14 folders, so that these these files can be picked up by my file-Streaming Data Ingestion process.
What is the easiest way to achieve this?
Is there a way in UI console? or is it through gsutil?
please help
GCS objects are immutable, so the only way to "update" the timestamp would be to copy each object on top of itself, e.g., using:
gsutil cp gs://your-bucket/object1 gs://your-bucket/object1
(and looping over all objects you want to do this to).
This is a fast (metadata-only) operation, which will create a new generation of each object, with a current timestamp.
Note that if you have versioning enabled on the bucket doing this will create an extra version of each file you copy this way.
When you say "folders in the form of root-dir/yyyy/mm/dd", do you mean that you're copying those objects into your bucket with names like gs://my-bucket/root-dir/2016/12/25/christmas.jpg? If not, see Mike's answer; but if they are named with that pattern and you just want to rename them, you could use gsutil's mv command to rename every object with that prefix:
$ export BKT=my-bucket
$ gsutil ls gs://$BKT/**
gs://my-bucket/2015/12/31/newyears.jpg
gs://my-bucket/2016/01/15/file1.txt
gs://my-bucket/2016/01/15/some/file.txt
gs://my-bucket/2016/01/15/yet/another-file.txt
$ gsutil -m mv gs://$BKT/2016/01/15 gs://$BKT/2016/06/20
[...]
Operation completed over 3 objects/12.0 B.
# We can see that the prefixes changed from 2016/01/15 to 2016/06/20
$ gsutil ls gs://$BKT/**
gs://my-bucket/2015/12/31/newyears.jpg
gs://my-bucket/2016/06/20/file1.txt
gs://my-bucket/2016/06/20/some/file.txt
gs://my-bucket/2016/06/20/yet/another-file.txt

grep command to print follow-up lines after a match

how to use "grep" command to find a match and to print followup of 10 lines from the match. this i need to get some error statements from log files. (else need to download use match for log time and then copy the content). Instead of downloading bulk size files i need to run a command to get those number of lines.
A default install of Solaris 10 or 11 will have the /usr/sfw/bin file tree. Gnu grep - /usr/sfw/bin/ggrep is there. ggrep supports /usr/sfw/bin/ggrep -A 10 [pattern] [file] which does what you want.
Solaris 9 and older may not have it. Or your system may not have been a default install. Check.
Suppose, you have a file /etc/passwd and want to filter user "chetan"
Please try below command:
cat /etc/passwd | /usr/sfw/bin/ggrep -A 2 'chetan'
It will print the line with letter "chetan" and the next two lines as well.
-- Tested in Solaris 10 --

Limit to number of files to cp in parallel

Im running the gsutil cp command in parallel (with the -m option) on a directory with 25 4gb json files (that i am also compressing with the -z option).
gsutil -m cp -z json -R dir_with_4g_chunks gs://my_bucket/
When I run it, it will print out to terminal that it is copying all but one of the files. By this I mean that it prints one of these lines per file:
Copying file://dir_with_4g_chunks/a_4g_chunk [Content-Type=application/octet-stream]...
Once the transfer for one of them is complete, it says that it'll be copying the last file.
The result of this is that there is one file that only starts to copy only when one of the others finishes copying, significantly slowing down the process
Is there a limit to the number of files I can upload with the -m option? Is this configurable in the boto config file?
I was not able to find the .boto file on my Mac (as per jterrace's answer above), instead I specified these values using the -o switch:
gsutil -m -o "Boto:parallel_thread_count=4" cp directory1/* gs://my-bucket/
This seemed to control the rate of transfer.
From the description of the -m option:
gsutil performs the specified operation using a combination of
multi-threading and multi-processing, using a number of threads and
processors determined by the parallel_thread_count and
parallel_process_count values set in the boto configuration file. You
might want to experiment with these value, as the best value can vary
based on a number of factors, including network speed, number of CPUs,
and available memory.
If you take a look at your .boto file, you should see this generated comment:
# 'parallel_process_count' and 'parallel_thread_count' specify the number
# of OS processes and Python threads, respectively, to use when executing
# operations in parallel. The default settings should work well as configured,
# however, to enhance performance for transfers involving large numbers of
# files, you may experiment with hand tuning these values to optimize
# performance for your particular system configuration.
# MacOS and Windows users should see
# https://github.com/GoogleCloudPlatform/gsutil/issues/77 before attempting
# to experiment with these values.
#parallel_process_count = 12
#parallel_thread_count = 10
I'm guessing that you're on Windows or Mac, because the default values for non-Linux machines is 24 threads and 1 process. This would result in copying 24 of your files first, then the last 1 file afterward. Try experimenting with increasing these values to transfer all 25 files at once.