How to download a file from box using wget? - wget

I've created a direct link to a file in box:
The previous link is to the browser web interface, so I've then shared with a direct link:
However, if I download the file with a wget I receive garbage.
How can I download the file with wget?

I was able to download the file by making the link public, then replacing /s/ in the url with /shared/static
So my final command was:
curl -L https://MYUNI.box.com/shared/static/EXAMPLEtzwosac6pz --output myfile.zip
This can probably be modified for wget.

I might be a bit late to the party, but FWIW:
I tried to do the same things in order to download a folder.
I went to the box UI and opened the browser's network tab on the developer tools.
Then I clicked on download and copied as cURL the first link generated, it was something like (removed many headers and options for readability)
curl 'https://app.box.com/index.php?folder_id=122215143745&rm=box_v2_zip_folder'
The response of this request is a json object containing a link for downloading the folder:
{
"use_zpdl": "true",
"result": "success",
"download_url": <somg long url>,
"progress_reporting_url": <some other url>
}
I then executed wget -L <download_url> and was able to download the file using wget

The solution was to add the -L option to follow the HTTP redirect:
wget -v -O myfile.tgz -L https://ibm.box.com/shared/static/xxxxx.tgz

What you can do in 2022 is something like this:
wget "https://your_university.app.box.com/index.php?rm=box_download_shared_file&vanity_name=your_private_name&file_id=f_your_file_id"
You can find this link in the POST method in an incognito under Google Chrome's network tab. Note that the double quotes escape characters.

Related

Do not create directory when using wget with `--page-requisites` option

I'm using wget with --page-requisites option. I'd like to combine this option with --directory-prefix. So for example when calling wget --page-requisites --directory-prefix=/tmp/1 https://google.com would download the google page to /tmp/1/ directory without creating it's own folder (like google.com).
I'd expect the google homepage to end up at /tmp/1/index.html
Is there a way to do this without creating some kind of script that would move the files where I want them to be?
Ok using option --no-directories seems to do the trick.

What am I screwing up trying to download particular file types with wget?

I am attempting to regularly archive a few file types hosted on a community website where our admin has been MIA for years, in case he dies or just stops paying for the hosting.
I am able to download all of the files I need using wget -r -np -nd -e robots=off -l 0 URL but this leaves me with about 60,000 extra files to waste time both downloading and deleting.
I am really only looking for files with the extensions "tbt" and "zip". When I add in -A tbt,zip to the input, wget then only downloads a single file, "index.html.tmp". It immediately deletes this file because it doesn't match the file type specified, and then the process stops entirely, with wget announcing that it is finished. It does not attempt to download any of the other files that it grabs when the -A flag is not included.
What am I doing wrong? Why does specifying file types in the way that I did cause it to finish after only looking at one file?
Possibly you're hitting the same problem I've hit when trying to do something similar. When using --accept, wget determines whether a links refers to a file or directory based on whether or not it ends with a /.
For example, say I have a directory named files, and a web page that has:
Lots o' files!
If I were to request this with wget -r, then I wget would happily GET /files, see that it was an HTML document containing a bunch of links, and continue to download those links.
However, if I add -A zip to my command line, and run wget with --debug, I see:
appending ‘http://localhost:8080/files’ to urlpos.
[...]
Deciding whether to enqueue "http://localhost:8080/files".
http://localhost:8080/files (files) does not match acc/rej rules.
Decided NOT to load it.
In other words, wget thinks this is a file (no trailing /) and it doesn't match our acceptance criteria, so it gets rejected.
If I modify the remote file so that it looks like...
Lots o' files!
...then wget will follow the link and download files as desired.
I don't think there's a great solution to this problem if you need to use wget. As I mentioned in my comment, there are other tools available that may handle this situation more gracefully.
It's also possible you're experiencing a different issue; the output of adding --debug to your command line clarify things in that case.
I also experienced this issue, on a page where all the download links looked something like this: filedownload.ashx?name=file.mp3. The solution was to match for both the linked file, and the downloaded file. So my wget accept flag looked like this: -A 'ashx,mp3'. I also used the --trust-server-names flag. This catches all the .ashx that are linked in the webpage, then when wget does the second check, all the mp3 files that were downloaded will stay.
As an alternative to --trust-server-names, you may also find the --content-disposition flag helpful. Both flags help rename the file that gets downloaded from filedownload.ashx?name=file.mp3 to just file.mp3.

Creating a static copy of a MoinMoin site

I have a MoinMoin site which I've inherited from a previous system
administrator. I'd like to shut it down but keep a static copy of the
content as an archive, ideally with the same URLs. At the moment I'm
trying to accomplish this using wget with the following parameters:
--mirror
--convert-links
--page-requisites
--no-parent
-w 1
-e robots=off
-user-agent="Mozilla/5.0"
-4
This seems to work for getting the HTML and CSS, but it fails to
download any of the attachments. Is there an argument I can add to wget
which will get round this problem?
Alternatively, is there a way I can tell MoinMoin to link directly to
files in the HTML it produces? If I could do that then I think wget
would "just work" and download all the attachments. I'm not bothered
about the attachment URLs changing as they won't have been linked to
directly in other places (e.g. email archives).
The site is running MoinMoin 1.9.x.
My version of wget:
$ wget --version
GNU Wget 1.16.1 built on linux-gnu.
+digest +https +ipv6 +iri +large-file +nls +ntlm +opie -psl +ssl/openssl
The solution in the end was to use MoinMoin's export dump functionality:
https://moinmo.in/FeatureRequests/MoinExportDump
It doesn't preserve the file paths in the way that wget does, but has the major advantage of including all the files and the attachments.

How do I get the raw version of a gist from github?

I need to load a shell script from a raw gist but I can't find a way to get raw URL.
curl -L address-to-raw-gist.sh | bash
And yet there is, look for the raw button (on the top-right of the source code).
The raw URL should look like this:
https://gist.githubusercontent.com/{user}/{gist_hash}/raw/{commit_hash}/{file}
Note: it is possible to get the latest version by omitting the {commit_hash} part, as shown below:
https://gist.githubusercontent.com/{user}/{gist_hash}/raw/{file}
February 2014: the raw url just changed.
See "Gist raw file URI change":
The raw host for all Gist files is changing immediately.
This change was made to further isolate user content from trusted GitHub applications.
The new host is
https://gist.githubusercontent.com.
Existing URIs will redirect to the new host.
Before it was https://gist.github.com/<username>/<gist-id>/raw/...
Now it is https://gist.githubusercontent.com/<username>/<gist-id>/raw/...
For instance:
https://gist.githubusercontent.com/VonC/9184693/raw/30d74d258442c7c65512eafab474568dd706c430/testNewGist
KrisWebDev adds in the comments:
If you want the last version of a Gist document, just remove the <commit>/ from URL
https://gist.githubusercontent.com/VonC/9184693/raw/testNewGist
One can simply use the github api.
https://api.github.com/gists/$GIST_ID
Reference: https://miguelpiedrafita.com/github-gists
Gitlab snippets provide short concise urls, are easy to create and goes well with the command line.
Sample example: Enable bash completion by patching /etc/bash.bashrc
sudo su -
(curl -s https://gitlab.com/snippets/21846/raw && echo) | patch -s /etc/bash.bashrc

wget download name

I'd like to write a function that, given a URL, returns the name of the file downloaded by wget URL.
I don't understand the behavior of wget very well. If I do wget on python.org, www.python.org, http://www.python.org, or http://www.python.org/, the name of the file downloaded is index.html.
However, if I do www.python.org/about, the name of the file downloaded is about, instead of index.html.
The reason your wget fetches index.html in the first cases is because that's the default "home page" that the server points to. python.org, www.python.org, http://www.phython.org, and http://www.python.org/ aren't files, so the server points wget to index.html. It points your browser there, too, though you don't usually see it. www.python.org/about is a different page, so it makes sense that the file it downloads has a different name.
Might I recommend the man page for wget if you want to know how it works? If it's the name of the downloaded file that concerns you, you have the option to change it via the -O option.