Remove the object name is "." - google-cloud-storage

I want to remove the object of the object name is "."
$ gsutil ls gs://{my_bucket}
gs://{my_bucket}/.
I tried, but are not remove.
$ gsutil -m rm "gs://{my_bucket}/**"
Removing gs://{my_bucket}/....
CommandException: 1 files/objects could not be removed.
$ gsutil rm "gs://{my_bucket}/."
$ gsutil rm gs://{my_bucket}/.
BadRequestException: 400 Invalid field selection name
help me

You cannot easily remove objects with the name ".". This is a known bug.
Requests to delete objects are ultimately sent as HTTP DELETE requests with the object name as the last path segment of the URL. RFC 3986 calls for the path segments . or .. to be stripped out of URLs as if they were being resolved as Unix paths, and most HTTP clients and servers obey the RFC. Thus a request to delete such a path can't easily be constructed. This is true even if you were to try URL escaping the dots.
There are a few sneaky ways to get around this problem, but they're quite complex and arcane. The best way may be for you to contact support and ask for the object to be removed.

Related

gsutil: why ls returns directory itself

I got
$ gsutil ls gs://ml_models_c/ref7/test/model/2/
gs://ml_models_c/ref7/test/model/2/ <= why this?
gs://ml_models_c/ref7/test/model/2/saved_model.pb
gs://ml_models_c/ref7/test/model/2/variables/
$ gsutil ls gs://seldon-models/tfserving/mnist-model/1/
gs://seldon-models/tfserving/mnist-model/1/saved_model.pb
gs://seldon-models/tfserving/mnist-model/1/variables/
Why there is gs://ml_models_c/ref7/test/model/2/ in the first command output?
Why the second command does not return itself?
It seems that I can rm it.
Thanks
At the API level, Cloud Storage doesn't have the concept of folders, everything is stored as long file names that might have slashes in them.
In this case, you likely have an object named gs://ml_models_c/ref7/test/model/2/, but no object named gs://seldon-models/tfserving/mnist-model/1/
If you don't need the gs://ml_models_c/ref7/test/model/2/ object, you can delete it and it will no longer show in the results for gsutil ls

GCS - Global Consistency with delete + rename

My issue may be a result of my misunderstanding with global consistency in google storage, but since I have not experienced this issue until just recently (mid November) and now it seems easily reproducible, I wanted some clarification. The issue started happening in a piece of spark code running on compute engine using bdutil but I can reproduce from the command line with gsutil.
My code is deleting a destination path and then immediately renaming a source path as the destination path. With global consistency I would expect since the destination path no longer exists, the src would be renamed to the destination, but instead the src is being nested inside destination as if the destination still exists and that is not consistent.
The hadoop code to reproduce looks like:
fs.delete(new Path(dest), true)
fs.rename(new Path(src), new Path(dest))
From command line I can reproduce with:
gsutil -m rm -r gs://mybucket/dest
gsutil -m cp -r gs://mybucket/src gs://mybucket/dest
If the reason is because list operations are eventually consistent and the FileSystem implementation is using list operations to determine if the destination still exists, then I understand, and then is there a recommended solution to ensure the destination no longer exists before renaming?
Thanks,
Luke
Travis's answer is a couple of years old and not true anymore. Object list operation is strongly consistent now. Read Google's post.
Read-after-write (including delete) operations are strongly consistent, so for example, if you did:
gsutil -m rm -r gs://mybucket/dest
# Command output shows it removed gs://mybucket/dest/file1
gsutil cp gs://mybucket/dest/file1 my_local_dir/file1
That would always fail.
However, to determine if a "directory" exists, gsutil must perform an eventually-consistent listing operation to find out if any object in Google Cloud Storage's flat name space has a prefix with the name of that "directory". This can lead to the problem you described, and I expect that the hadoop code behaves similarly.
There isn't a strongly consistent workaround for this problem because there's no way to check for the existence of a prefix in a strongly-consistent way.

Compare file sizes and download if they're different via wget

I'm downloading some .mp3 files (all legal) via wget :
wget -r -nc files.myserver.com
I have to stop the download sometimes and at that times the file is partially downloaded. For example a 10 minutes record.mp3 file become 4 minutes record.mp3 file. It's playing correctly but incomplete.
If I use the same command above, because the record.mp3 file is already exist in my local computer wget skips that file although it isn't complete.
I wonder if there is a way to check the file sizes and if the file size in the remote server and local computer isn't same re-download it. (I've learned the --spider command gives the file size but is there any other command that automatically check the file sizes and download or not).
I would go with wget's -N option for timestamping, but note that wget will only compare the file sizes if you also specify the --no-if-modified-since option. Without it, incomplete files are indeed skipped on the next run because they receive a timestamp of the current time, which is newer than that on the server.
The reason is probably that with only -N, a GET request is sent for the file with the If-Modified-Since field set. The server responds with either 200 or 304, but the 304 doesn't contain the file size so wget can't check it.
With --no-if-modified-since wget sends a HEAD request instead to get the timestamp and file size, and checks both.
What I use for recursive download of a folder:
wget -T 300 -nv -t 1 -r -nd -np -l 1 -N --no-if-modified-since -P $my_folder $my_url
With:
-T 300: Set the network timeout to 300 seconds
-nv: Turn off verbose without being completely quiet
-t 1: Set number of tries to 1
-r: Turn on recursive retrieving
-nd: Do not create a hierarchy of directories when retrieving recursively
-np: Do not ever ascend to the parent directory when retrieving recursively
-l 1: Specify recursion maximum depth 1
-N: Turn on time-stamping
--no-if-modified-since: Do not send If-Modified-Since header in ‘-N’ mode, send preliminary HEAD request instead
You may try the -c option to continue the download of partially downloaded files, however the manual gives an explicit warning:
You need to be especially careful of this when using -c in conjunction
with -r, since every file will be considered as an "incomplete
download" candidate.
While there is no perfect solution to this problem you could try to use -N option to turn on timestamping. This might prevent errors when the file has changed on the server but only if the server supports timestamping and partial downloads. Try it and see how it goes.
wget -r -N -c files.myserver.com
If you need check if file was partially downloaded (has different size) or updated on remote server by timestamp and must be in this case updated locally you need use -N option.
Here some additional info about -N (--timestamping) option from Wget docs:
If the local file does not exist, or the sizes of the files do not match, Wget will download the remote file no matter what the
time-stamps say.
Added From: https://www.gnu.org/software/wget/manual/wget.html (Chapter: 5 Time-Stamping)

wget appends query string to resulting file

I'm trying to retrieve working webpages with wget and this goes well for most sites with the following command:
wget -p -k http://www.example.com
In these cases I will end up with index.html and the needed CSS/JS etc.
HOWEVER, in certain situations the url will have a query string and in those cases I get an index.html with the query string appended.
Example
www.onlinetechvision.com/?p=566
Combined with the above wget command will result in:
index.html?page=566
I have tried using the --restrict-file-names=windows option, but that only gets me to
index.html#page=566
Can anyone explain why this is needed and how I can end up with a regular index.html file?
UPDATE: I'm sort of on the fence on taking a different approach. I found out I can take the first filename that wget saves by parsing the output. So the name that appears after Saving to: is the one I need.
However, this is wrapped by this strange character â - rather than just removing that hardcoded - where does this come from?
If you try with parameter "--adjust-extension"
wget -p -k --adjust-extension www.onlinetechvision.com/?p=566
you come closer. In www.onlinetechvision.com folder there will be file with corrected extension: index.html#p=566.html or index.html?p=566.html on *NiX systems. It is simple now to change that file to index.html even with script.
If you are on Microsoft OS make sure you have latter version of wget - it is also available here: https://eternallybored.org/misc/wget/
To answer your question about why this is needed, remember that the web server is likely to return different results based on the parameters in the query string. If a query for index.html?page=52 returns different results from index.html?page=53, you probably wouldn't want both pages to be saved in the same file.
Each HTTP request that uses a different set of query parameters is quite literally a request for a distinct resource. wget can't predict which of these changes is and isn't going to be significant, so it's doing the conservative thing and preserving the query parameter URLs in the filename of the local document.
My solution is to do recursive crawling outside wget:
get directory structure with wget (no file)
loop to get main entry file (index.html) from each dir
This works well with wordpress sites. Could miss some pages tho.
#!/bin/bash
#
# get directory structure
#
wget --spider -r --no-parent http://<site>/
#
# loop through each dir
#
find . -mindepth 1 -maxdepth 10 -type d | cut -c 3- > ./dir_list.txt
while read line;do
wget --wait=5 --tries=20 --page-requisites --html-extension --convert-links --execute=robots=off --domain=<domain> --strict-comments http://${line}/
done < ./dir_list.txt
The query string is required because of the website design what the site is doing is using the same standard index.html for all content and then using the querystring to pull in the content from another page like with script on the server side. (it may be client side if you look in the JavaScript).
Have you tried using --no-cookies it could be storing this information via cookie and pulling it when you hit the page. also this could be caused by URL rewrite logic which you will have little control over from the client side.
use -O or --output-document options. see http://www.electrictoolbox.com/wget-save-different-filename/

Possible issue with international characters in objects and/or paths when copying recursively

I've run into a weird problem after uploading a lot of images with gsutil - the uploaded files cannot be seen via the Google Cloud Console and gsutil itself complains if I try to do a 'gsutil ls '. I am 99% sure it is related to the use of "å" or "Å" together with spaces in the directory name.
All uploads were done recursively from a root folder (large image collection in multiple levels of subdirectories). If I try to upload the files again, gsutil skips them since they are already there, so the upload feature does something - it just isn't working in the same way as the list and download.
An example:
gsutil cp -R -n /Volumes/Photos/digitalfotografen.dk/2009/2009-05-30\ Søgården\ -\ bryllup/ gs://digitalfotografen/2009/
Skipping existing item: gs://digitalfotografen/2009/2009-05-30 Søgården - bryllup/Søgården 0128.CR2
...
OK - so the files are there, but browsing the directory through the Google Cloud Console shows "No results".
Also:
gsutil ls gs://digitalfotografen/2009/2009-06-27 Søgården - reklamefotos/20090627_IMG_0128.CR2
CommandException: "ls" command does not support "file://" URIs. Did you mean to use a gs:// URI?
I tried escaping spaces and used quotation marks in different ways with no luck.
Now, here is the interesting thing:
gsutil cp -R -n /Volumes/Photos/digitalfotografen.dk/2009/2009-05-30\ Søgården\ -\ bryllup/ gs://digitalfotografen/2009/
Copying file:///Volumes/Photos/digitalfotografen.dk/2009/2009-05-30 Søgården - bryllup/Søgården 0128.CR2 [Content-Type=application/octet-stream]...
Here I copied the folder specifically with escaped spaces on the source side, and now the files are uploaded again. This creates a second folder with the same name (at least it appears so in the Cloud Console) and the files are now visible in both folders.
We use three different characters that are outside the standard US ASCII in the Danish character set ("æøå" and the capital "ÆØÅ") but the problem only seems to affect "å" and "Å" - the two others alone or in combination works fine. My hunch is that "å" and "Å" may translate into something entirely different in ASCII that throws things off track when gsutil is allowed to handle the directory naming on its own based on the name of the root folder (doing a multiple level recursion) but works when the user specifies the escaped name of the root folder.
This may be a python issue rather than a gsutil issue, but I am in no way qualified to identify this since I have very close to zero knowledge of programming outside a bit of hodgepodge shell scripts.
We got a trouble with gsutil into ubuntu wsl version windows 10.
The command gsutil work perfectly into the shell but not working when is included into a shell script:
gsutil -m ls -lr gs://project.appspot.com/
Error:
commandexception: "ls" command does not support "file://" urls. did you mean to use a gs:// url?
A workaround cloud be by calling directly the script /usr/lib/google-cloud-sdk/platform/gsutil/gsutil and not calling the link /usr/bin/gsutil:
/usr/lib/google-cloud-sdk/platform/gsutil/gsutil -m ls -lr gs://project.appspot.com/
I don't know why but it's working.
Thank Marion to provide us a such uncommon bug :-)
I know this here is a old error, but nevertheless I had a similar issue as described above.
CommandException: "ls" command does not support "file://" URLs. Did you mean to use a gs:// URL?
Using gsutil from scala code.
import sys.process._
object Main {
def main(args: Array[String]): Unit = {
val clients = s"gsutil ls gs://<bucket name>".!!
val beforeDate: String = "date +%Y-%m-%d -d '-8 days'".!!
val clientList = clients.split("\n").map(f => f.split('/').apply(1)).toList
for (x <- clientList) {
val countImg = (s"gsutil -m ls gs://<bucket name>/$x/${beforeDate.stripLineEnd}" #| "wc -l").!!
println(countImg)
}
}
}
So what I found was that there was a LineEnd character on the beforeDate, when I striped that the error went away. So the error occurs when there is a "special" character in the gs://... path. So be sure to strip variables for any "special" characters.
And all this happened just because I was to lazy to use java.time.LocalDate to generate the beforeDate variable. Hope this here will help others that encounter the same error.