Possible issue with international characters in objects and/or paths when copying recursively - google-cloud-storage

I've run into a weird problem after uploading a lot of images with gsutil - the uploaded files cannot be seen via the Google Cloud Console and gsutil itself complains if I try to do a 'gsutil ls '. I am 99% sure it is related to the use of "å" or "Å" together with spaces in the directory name.
All uploads were done recursively from a root folder (large image collection in multiple levels of subdirectories). If I try to upload the files again, gsutil skips them since they are already there, so the upload feature does something - it just isn't working in the same way as the list and download.
An example:
gsutil cp -R -n /Volumes/Photos/digitalfotografen.dk/2009/2009-05-30\ Søgården\ -\ bryllup/ gs://digitalfotografen/2009/
Skipping existing item: gs://digitalfotografen/2009/2009-05-30 Søgården - bryllup/Søgården 0128.CR2
...
OK - so the files are there, but browsing the directory through the Google Cloud Console shows "No results".
Also:
gsutil ls gs://digitalfotografen/2009/2009-06-27 Søgården - reklamefotos/20090627_IMG_0128.CR2
CommandException: "ls" command does not support "file://" URIs. Did you mean to use a gs:// URI?
I tried escaping spaces and used quotation marks in different ways with no luck.
Now, here is the interesting thing:
gsutil cp -R -n /Volumes/Photos/digitalfotografen.dk/2009/2009-05-30\ Søgården\ -\ bryllup/ gs://digitalfotografen/2009/
Copying file:///Volumes/Photos/digitalfotografen.dk/2009/2009-05-30 Søgården - bryllup/Søgården 0128.CR2 [Content-Type=application/octet-stream]...
Here I copied the folder specifically with escaped spaces on the source side, and now the files are uploaded again. This creates a second folder with the same name (at least it appears so in the Cloud Console) and the files are now visible in both folders.
We use three different characters that are outside the standard US ASCII in the Danish character set ("æøå" and the capital "ÆØÅ") but the problem only seems to affect "å" and "Å" - the two others alone or in combination works fine. My hunch is that "å" and "Å" may translate into something entirely different in ASCII that throws things off track when gsutil is allowed to handle the directory naming on its own based on the name of the root folder (doing a multiple level recursion) but works when the user specifies the escaped name of the root folder.
This may be a python issue rather than a gsutil issue, but I am in no way qualified to identify this since I have very close to zero knowledge of programming outside a bit of hodgepodge shell scripts.

We got a trouble with gsutil into ubuntu wsl version windows 10.
The command gsutil work perfectly into the shell but not working when is included into a shell script:
gsutil -m ls -lr gs://project.appspot.com/
Error:
commandexception: "ls" command does not support "file://" urls. did you mean to use a gs:// url?
A workaround cloud be by calling directly the script /usr/lib/google-cloud-sdk/platform/gsutil/gsutil and not calling the link /usr/bin/gsutil:
/usr/lib/google-cloud-sdk/platform/gsutil/gsutil -m ls -lr gs://project.appspot.com/
I don't know why but it's working.
Thank Marion to provide us a such uncommon bug :-)

I know this here is a old error, but nevertheless I had a similar issue as described above.
CommandException: "ls" command does not support "file://" URLs. Did you mean to use a gs:// URL?
Using gsutil from scala code.
import sys.process._
object Main {
def main(args: Array[String]): Unit = {
val clients = s"gsutil ls gs://<bucket name>".!!
val beforeDate: String = "date +%Y-%m-%d -d '-8 days'".!!
val clientList = clients.split("\n").map(f => f.split('/').apply(1)).toList
for (x <- clientList) {
val countImg = (s"gsutil -m ls gs://<bucket name>/$x/${beforeDate.stripLineEnd}" #| "wc -l").!!
println(countImg)
}
}
}
So what I found was that there was a LineEnd character on the beforeDate, when I striped that the error went away. So the error occurs when there is a "special" character in the gs://... path. So be sure to strip variables for any "special" characters.
And all this happened just because I was to lazy to use java.time.LocalDate to generate the beforeDate variable. Hope this here will help others that encounter the same error.

Related

What am I screwing up trying to download particular file types with wget?

I am attempting to regularly archive a few file types hosted on a community website where our admin has been MIA for years, in case he dies or just stops paying for the hosting.
I am able to download all of the files I need using wget -r -np -nd -e robots=off -l 0 URL but this leaves me with about 60,000 extra files to waste time both downloading and deleting.
I am really only looking for files with the extensions "tbt" and "zip". When I add in -A tbt,zip to the input, wget then only downloads a single file, "index.html.tmp". It immediately deletes this file because it doesn't match the file type specified, and then the process stops entirely, with wget announcing that it is finished. It does not attempt to download any of the other files that it grabs when the -A flag is not included.
What am I doing wrong? Why does specifying file types in the way that I did cause it to finish after only looking at one file?
Possibly you're hitting the same problem I've hit when trying to do something similar. When using --accept, wget determines whether a links refers to a file or directory based on whether or not it ends with a /.
For example, say I have a directory named files, and a web page that has:
Lots o' files!
If I were to request this with wget -r, then I wget would happily GET /files, see that it was an HTML document containing a bunch of links, and continue to download those links.
However, if I add -A zip to my command line, and run wget with --debug, I see:
appending ‘http://localhost:8080/files’ to urlpos.
[...]
Deciding whether to enqueue "http://localhost:8080/files".
http://localhost:8080/files (files) does not match acc/rej rules.
Decided NOT to load it.
In other words, wget thinks this is a file (no trailing /) and it doesn't match our acceptance criteria, so it gets rejected.
If I modify the remote file so that it looks like...
Lots o' files!
...then wget will follow the link and download files as desired.
I don't think there's a great solution to this problem if you need to use wget. As I mentioned in my comment, there are other tools available that may handle this situation more gracefully.
It's also possible you're experiencing a different issue; the output of adding --debug to your command line clarify things in that case.
I also experienced this issue, on a page where all the download links looked something like this: filedownload.ashx?name=file.mp3. The solution was to match for both the linked file, and the downloaded file. So my wget accept flag looked like this: -A 'ashx,mp3'. I also used the --trust-server-names flag. This catches all the .ashx that are linked in the webpage, then when wget does the second check, all the mp3 files that were downloaded will stay.
As an alternative to --trust-server-names, you may also find the --content-disposition flag helpful. Both flags help rename the file that gets downloaded from filedownload.ashx?name=file.mp3 to just file.mp3.

Can we wget with file list and renaming destination files?

I have this wget command:
sudo wget --user-agent='some-agent' --referer=http://some-referrer.html -N -r -nH --cut-dirs=x --timeout=xxx --directory-prefix=/directory/for/downloaded/files -i list-of-files-to-download.txt
-N will check if there is actually a newer file to download.
-r will turn the recursive retrieving on.
-nH will disable the generation of host-prefixed directories.
--cut-dirs=X will avoid the generation of the host's subdirectories.
--timeout=xxx will, well, timeout :)
--directory-prefix will store files in the desired directorty.
This works nice, no problem.
Now, to the issue:
Let's say my files-to-download.txt has these kind of files:
http://website/directory1/picture-same-name.jpg
http://website/directory2/picture-same-name.jpg
http://website/directory3/picture-same-name.jpg
etc...
You can see the problem: on the second download, wget will see we already have a picture-same-name.jpg, so it won't download the second or any of the following ones with the same name. I cannot mirror the directory structure because I need all the downloaded files to be in the same directory. I can't use the -O option because it clashes with --N, and I need that. I've tried to use -nd, but doesn't seem to work for me.
So, ideally, I need to be able to:
a.- wget from a list of url's the way I do now, keeping my parameters.
b.- get all files at the same directory and being able to rename each file.
Does anybody have any solution to this?
Thanks in advance.
I would suggest 2 approaches -
Use the "-nc" or the "--no-clobber" option. From the man page -
-nc
--no-clobber
If a file is downloaded more than once in the same directory, >Wget's behavior depends on a few options, including -nc. In certain >cases, the local file will be
clobbered, or overwritten, upon repeated download. In other >cases it will be preserved.
When running Wget without -N, -nc, -r, or -p, downloading the >same file in the same directory will result in the original copy of file >being preserved and the second copy
being named file.1. If that file is downloaded yet again, the >third copy will be named file.2, and so on. (This is also the behavior >with -nd, even if -r or -p are in
effect.) When -nc is specified, this behavior is suppressed, >and Wget will refuse to download newer copies of file. Therefore, ""no->clobber"" is actually a misnomer in
this mode---it's not clobbering that's prevented (as the >numeric suffixes were already preventing clobbering), but rather the >multiple version saving that's prevented.
When running Wget with -r or -p, but without -N, -nd, or -nc, >re-downloading a file will result in the new copy simply overwriting the >old. Adding -nc will prevent this
behavior, instead causing the original version to be preserved >and any newer copies on the server to be ignored.
When running Wget with -N, with or without -r or -p, the >decision as to whether or not to download a newer copy of a file depends >on the local and remote timestamp and
size of the file. -nc may not be specified at the same time as >-N.
A combination with -O/--output-document is only accepted if the >given output file does not exist.
Note that when -nc is specified, files with the suffixes .html >or .htm will be loaded from the local disk and parsed as if they had been >retrieved from the Web.
As you can see from this man page entry, the behavior might be unpredictable/unexpected. You will need to see if it works for you.
Another approach would be to use a bash script. I am most comfortable using bash on *nix, so forgive the platform dependency. However the logic is sound, and with a bit of modifications, you can get it to work on other platforms/scripts as well.
Sample pseudocode bash script -
for i in `cat list-of-files-to-download.txt`;
do
wget <all your flags except the -i flag> $i -O /path/to/custom/directory/filename ;
done ;
You can modify the script to download each file to a temporary file, parse $i to get the filename from the URL, check if the file exists on the disk, and then take a decision to rename the temp file to the name that you want.
This offers much more control over your downloads.

GCS - Global Consistency with delete + rename

My issue may be a result of my misunderstanding with global consistency in google storage, but since I have not experienced this issue until just recently (mid November) and now it seems easily reproducible, I wanted some clarification. The issue started happening in a piece of spark code running on compute engine using bdutil but I can reproduce from the command line with gsutil.
My code is deleting a destination path and then immediately renaming a source path as the destination path. With global consistency I would expect since the destination path no longer exists, the src would be renamed to the destination, but instead the src is being nested inside destination as if the destination still exists and that is not consistent.
The hadoop code to reproduce looks like:
fs.delete(new Path(dest), true)
fs.rename(new Path(src), new Path(dest))
From command line I can reproduce with:
gsutil -m rm -r gs://mybucket/dest
gsutil -m cp -r gs://mybucket/src gs://mybucket/dest
If the reason is because list operations are eventually consistent and the FileSystem implementation is using list operations to determine if the destination still exists, then I understand, and then is there a recommended solution to ensure the destination no longer exists before renaming?
Thanks,
Luke
Travis's answer is a couple of years old and not true anymore. Object list operation is strongly consistent now. Read Google's post.
Read-after-write (including delete) operations are strongly consistent, so for example, if you did:
gsutil -m rm -r gs://mybucket/dest
# Command output shows it removed gs://mybucket/dest/file1
gsutil cp gs://mybucket/dest/file1 my_local_dir/file1
That would always fail.
However, to determine if a "directory" exists, gsutil must perform an eventually-consistent listing operation to find out if any object in Google Cloud Storage's flat name space has a prefix with the name of that "directory". This can lead to the problem you described, and I expect that the hadoop code behaves similarly.
There isn't a strongly consistent workaround for this problem because there's no way to check for the existence of a prefix in a strongly-consistent way.

wget appends query string to resulting file

I'm trying to retrieve working webpages with wget and this goes well for most sites with the following command:
wget -p -k http://www.example.com
In these cases I will end up with index.html and the needed CSS/JS etc.
HOWEVER, in certain situations the url will have a query string and in those cases I get an index.html with the query string appended.
Example
www.onlinetechvision.com/?p=566
Combined with the above wget command will result in:
index.html?page=566
I have tried using the --restrict-file-names=windows option, but that only gets me to
index.html#page=566
Can anyone explain why this is needed and how I can end up with a regular index.html file?
UPDATE: I'm sort of on the fence on taking a different approach. I found out I can take the first filename that wget saves by parsing the output. So the name that appears after Saving to: is the one I need.
However, this is wrapped by this strange character â - rather than just removing that hardcoded - where does this come from?
If you try with parameter "--adjust-extension"
wget -p -k --adjust-extension www.onlinetechvision.com/?p=566
you come closer. In www.onlinetechvision.com folder there will be file with corrected extension: index.html#p=566.html or index.html?p=566.html on *NiX systems. It is simple now to change that file to index.html even with script.
If you are on Microsoft OS make sure you have latter version of wget - it is also available here: https://eternallybored.org/misc/wget/
To answer your question about why this is needed, remember that the web server is likely to return different results based on the parameters in the query string. If a query for index.html?page=52 returns different results from index.html?page=53, you probably wouldn't want both pages to be saved in the same file.
Each HTTP request that uses a different set of query parameters is quite literally a request for a distinct resource. wget can't predict which of these changes is and isn't going to be significant, so it's doing the conservative thing and preserving the query parameter URLs in the filename of the local document.
My solution is to do recursive crawling outside wget:
get directory structure with wget (no file)
loop to get main entry file (index.html) from each dir
This works well with wordpress sites. Could miss some pages tho.
#!/bin/bash
#
# get directory structure
#
wget --spider -r --no-parent http://<site>/
#
# loop through each dir
#
find . -mindepth 1 -maxdepth 10 -type d | cut -c 3- > ./dir_list.txt
while read line;do
wget --wait=5 --tries=20 --page-requisites --html-extension --convert-links --execute=robots=off --domain=<domain> --strict-comments http://${line}/
done < ./dir_list.txt
The query string is required because of the website design what the site is doing is using the same standard index.html for all content and then using the querystring to pull in the content from another page like with script on the server side. (it may be client side if you look in the JavaScript).
Have you tried using --no-cookies it could be storing this information via cookie and pulling it when you hit the page. also this could be caused by URL rewrite logic which you will have little control over from the client side.
use -O or --output-document options. see http://www.electrictoolbox.com/wget-save-different-filename/

wget to exclude certain naming structures

My company has a local production server I want to download files from that have a certain naming convention. However, I would like to exclude certain elements based on a portion of the name. Example:
folder client_1234
file 1234.jpg
file 1234.ai
file 1234.xml
folder client_1234569
When wget is ran I want it to bypass all folders and files with "1234". I have researched and ran across ‘--exclude list’ but that appears to be only for directories and ‘reject = rejlist’ which appears to be for file extensions. Am I missing something in the manual here
EDIT:
this should work.
wget has options -A <accept_list> and -R <reject_list>, which from the manual page, appear to allow either suffixes or patterns. These are separate from the -I <include_dirs> and -X <exclude_dirs> options, which, as you note, only deal with directories. Given the example you list, something along the lines of -A "folder client_1234*" -A "file 1234.*" might be what you need, although I'm not entirely sure that's exactly the naming convention you're after...