How to write php so that the correct file can be downloaded over linux command line? - wget

So basically i have a problem where the user will send a request to test.php?getnextfile=1, and i need to process the request and figure out what is the next file in line to be downloaded, and deliver it to the user. The part that i'm stuck on is how to get the correct filename to the user (the server knows the correct file name, user doesn't).
Currently i've tried to use wget on the test.php?getnextfile=1, and it doesn't actually save with the correct filename. Also tried to header redirect to the correct file, and that doesn't work either.
Any ideas?
Thanks a lot!
Jason

Since Jul 2010, this is impossible in default wget configuration. In process of fixing security bug, they switched "trust-server-names" option off by default. For more information, see this answer:
https://serverfault.com/questions/183959/how-to-tell-wget-to-use-the-name-of-the-target-file-behind-the-http-redirect

In your php script you need to set the 'Content-Disposition' header:
<?php
header('Content-type: application/zip');
header('Content-Disposition: attachment; filename="myfile.zip"');
readfile('myfile.zip');
?>
Use curl instead of wget to test it.
curl --remote-name --remote-header-name http://127.0.0.1:8080/download

Related

How to download a file from box using wget?

I've created a direct link to a file in box:
The previous link is to the browser web interface, so I've then shared with a direct link:
However, if I download the file with a wget I receive garbage.
How can I download the file with wget?
I was able to download the file by making the link public, then replacing /s/ in the url with /shared/static
So my final command was:
curl -L https://MYUNI.box.com/shared/static/EXAMPLEtzwosac6pz --output myfile.zip
This can probably be modified for wget.
I might be a bit late to the party, but FWIW:
I tried to do the same things in order to download a folder.
I went to the box UI and opened the browser's network tab on the developer tools.
Then I clicked on download and copied as cURL the first link generated, it was something like (removed many headers and options for readability)
curl 'https://app.box.com/index.php?folder_id=122215143745&rm=box_v2_zip_folder'
The response of this request is a json object containing a link for downloading the folder:
{
"use_zpdl": "true",
"result": "success",
"download_url": <somg long url>,
"progress_reporting_url": <some other url>
}
I then executed wget -L <download_url> and was able to download the file using wget
The solution was to add the -L option to follow the HTTP redirect:
wget -v -O myfile.tgz -L https://ibm.box.com/shared/static/xxxxx.tgz
What you can do in 2022 is something like this:
wget "https://your_university.app.box.com/index.php?rm=box_download_shared_file&vanity_name=your_private_name&file_id=f_your_file_id"
You can find this link in the POST method in an incognito under Google Chrome's network tab. Note that the double quotes escape characters.

wget appends query string to resulting file

I'm trying to retrieve working webpages with wget and this goes well for most sites with the following command:
wget -p -k http://www.example.com
In these cases I will end up with index.html and the needed CSS/JS etc.
HOWEVER, in certain situations the url will have a query string and in those cases I get an index.html with the query string appended.
Example
www.onlinetechvision.com/?p=566
Combined with the above wget command will result in:
index.html?page=566
I have tried using the --restrict-file-names=windows option, but that only gets me to
index.html#page=566
Can anyone explain why this is needed and how I can end up with a regular index.html file?
UPDATE: I'm sort of on the fence on taking a different approach. I found out I can take the first filename that wget saves by parsing the output. So the name that appears after Saving to: is the one I need.
However, this is wrapped by this strange character รข - rather than just removing that hardcoded - where does this come from?
If you try with parameter "--adjust-extension"
wget -p -k --adjust-extension www.onlinetechvision.com/?p=566
you come closer. In www.onlinetechvision.com folder there will be file with corrected extension: index.html#p=566.html or index.html?p=566.html on *NiX systems. It is simple now to change that file to index.html even with script.
If you are on Microsoft OS make sure you have latter version of wget - it is also available here: https://eternallybored.org/misc/wget/
To answer your question about why this is needed, remember that the web server is likely to return different results based on the parameters in the query string. If a query for index.html?page=52 returns different results from index.html?page=53, you probably wouldn't want both pages to be saved in the same file.
Each HTTP request that uses a different set of query parameters is quite literally a request for a distinct resource. wget can't predict which of these changes is and isn't going to be significant, so it's doing the conservative thing and preserving the query parameter URLs in the filename of the local document.
My solution is to do recursive crawling outside wget:
get directory structure with wget (no file)
loop to get main entry file (index.html) from each dir
This works well with wordpress sites. Could miss some pages tho.
#!/bin/bash
#
# get directory structure
#
wget --spider -r --no-parent http://<site>/
#
# loop through each dir
#
find . -mindepth 1 -maxdepth 10 -type d | cut -c 3- > ./dir_list.txt
while read line;do
wget --wait=5 --tries=20 --page-requisites --html-extension --convert-links --execute=robots=off --domain=<domain> --strict-comments http://${line}/
done < ./dir_list.txt
The query string is required because of the website design what the site is doing is using the same standard index.html for all content and then using the querystring to pull in the content from another page like with script on the server side. (it may be client side if you look in the JavaScript).
Have you tried using --no-cookies it could be storing this information via cookie and pulling it when you hit the page. also this could be caused by URL rewrite logic which you will have little control over from the client side.
use -O or --output-document options. see http://www.electrictoolbox.com/wget-save-different-filename/

Can't gzip my mysqldump on the command line windows

I'm trying to make a back up of my MySQL db and zip the file.
Every time I try to run this command...
C:\wamp\bin\mysql\mysql5.5.24\bin\mysqldump.exe -u usernam -ppassword db_name | gzip > sites\www.site.com-local\backups\backup-date.sql.gz
All I get is an error saying "'gzip' is not recognized as an internal or external command, operable program or batch file'
I've used the following resources hoping they would help but have done nothing
http://www.zigpress.com/2009/04/09/enabling-gzip-on-wamp/
http://dnhome.wordpress.com/2011/10/06/apache-wampserver-enable-compression-gzip/
http://www.dewebbouwmeester.nl/enabling-gzip-on-wamp/
All say the same thing but nothing happens.
Could someone please shine some light on what i am doing wrong?
I had this same problem myself with WAMP.
All you need to do is:
Download gzip.exe from here: http://www.gzip.org/#exe
Place gzip.exe in your PHP folder
That is telling you that gzip is nowhere your system can find. You can blame however you installed gzip.
Check the value of your $PATH environment (or system) variable. Then install a copy of gzip in one of those directories.
Yes, you have to put zip.exe in your wamp folder, like wamp/bin/mysql/[your mysql version]/bin.
Find or install gzip locally. After that directly in PHP folder create just gzip.bat file, specify the fully qualified path to existing gzip binary file and save.
For example:C:\git\bin\gzip.exe
It's not necessary to place gzip.exe in your PHP folder.

How do I get the raw version of a gist from github?

I need to load a shell script from a raw gist but I can't find a way to get raw URL.
curl -L address-to-raw-gist.sh | bash
And yet there is, look for the raw button (on the top-right of the source code).
The raw URL should look like this:
https://gist.githubusercontent.com/{user}/{gist_hash}/raw/{commit_hash}/{file}
Note: it is possible to get the latest version by omitting the {commit_hash} part, as shown below:
https://gist.githubusercontent.com/{user}/{gist_hash}/raw/{file}
February 2014: the raw url just changed.
See "Gist raw file URI change":
The raw host for all Gist files is changing immediately.
This change was made to further isolate user content from trusted GitHub applications.
The new host is
https://gist.githubusercontent.com.
Existing URIs will redirect to the new host.
Before it was https://gist.github.com/<username>/<gist-id>/raw/...
Now it is https://gist.githubusercontent.com/<username>/<gist-id>/raw/...
For instance:
https://gist.githubusercontent.com/VonC/9184693/raw/30d74d258442c7c65512eafab474568dd706c430/testNewGist
KrisWebDev adds in the comments:
If you want the last version of a Gist document, just remove the <commit>/ from URL
https://gist.githubusercontent.com/VonC/9184693/raw/testNewGist
One can simply use the github api.
https://api.github.com/gists/$GIST_ID
Reference: https://miguelpiedrafita.com/github-gists
Gitlab snippets provide short concise urls, are easy to create and goes well with the command line.
Sample example: Enable bash completion by patching /etc/bash.bashrc
sudo su -
(curl -s https://gitlab.com/snippets/21846/raw && echo) | patch -s /etc/bash.bashrc

wget download name

I'd like to write a function that, given a URL, returns the name of the file downloaded by wget URL.
I don't understand the behavior of wget very well. If I do wget on python.org, www.python.org, http://www.python.org, or http://www.python.org/, the name of the file downloaded is index.html.
However, if I do www.python.org/about, the name of the file downloaded is about, instead of index.html.
The reason your wget fetches index.html in the first cases is because that's the default "home page" that the server points to. python.org, www.python.org, http://www.phython.org, and http://www.python.org/ aren't files, so the server points wget to index.html. It points your browser there, too, though you don't usually see it. www.python.org/about is a different page, so it makes sense that the file it downloads has a different name.
Might I recommend the man page for wget if you want to know how it works? If it's the name of the downloaded file that concerns you, you have the option to change it via the -O option.