download by wget without specific folder site - wget

how download site for viewing offline without specific folder
for example i want download the site without http://site.com/forum/ sub-directory

wget --help
might lead you to
-nH, --no-host-directories don't create host directories.
I'd try that first, but I'm not sure whether it will do what you want.

Related

Wget creates some of directories only at the end of the mirroring

I'm currently mirroring www.typingstudy.com
wget --mirror --page-requisites --convert-link --no-clobber --no-parent --domains typingstudy.com https://www.typingstudy.com/
And wget creates directories, which contain html files on the site, only at the end of the scrapping and, accordingly, when it tries to download those html files, before the creation of the directories in which this files are located, wget says:powershell output
Sometimes it downloads only 1 file, like at this example, to the directory "part" and refuses to see this directory while trying downloading all other ~10 files from this exact directory, saying that this directory does not exist: enter image description here
Can someone help me understand what's wrong with my commands? Or is it a bug of wget?(Probably not)
Thanks in advance.
When I start the downloading process again - everything is neat, wget downloads all those ~10 other html files to created in the previous download session ("part") directories. So the problem is that I need to start the downloading 2 times, at least in case of this site.
And I totally do not understand why this is happening.

Download Github public assets from CLI and Dockerfile

I'm trying to download the networks models of a Neural Network project used to detect NSFW images. The packed models are available in the assets section of a release at this URL: https://github.com/notAI-tech/NudeNet/releases
I would like to create a DOCKERFILE which download these assets when built. I started using the ADD command, but the files didn't get downloaded entirely (only few kB over the 120 MB of some files). So, I tried in my Linux CLI using wget and curl... But nothing worked as expected. For example the command :
curl -OJL https://github.com/notAI-tech/NudeNet/releases/download/v0/classifier_model.onnx
Starts the download but only download an HTML file instead of the actual ONNX file... It seems Github is doing some kind of redirection and I don't know how I can handle it with curl/wget and finally with the ADD command of a DOCKERFILE ?
I did visit
https://github.com/notAI-tech/NudeNet/releases/download/v0/classifier_model.onnx
in my browser and I did get login page, so apparently it is not publicly available. That would explain why you did get small HTML file (file with login form).
Github is doing some kind of redirection and I don't know how I can
handle it with(...)wget
You need to provide authentication data, I do not know how it is exactly done in this case, but I suspect they might use one of popular methods: basic authentication (see wget options --http-user=user and --http-password=pass) or cookies based solution (see wget options --load-cookies file and --save-cookies file and --keep-session-cookies).
Mentonied options are described in wget man page, which you might access by click link or doing man wget in terminal.

How to download files from adf.ly to a remote server?

I am trying to download a file to a remote server from the website adf.ly, but when I try wget it downloads the website html code. Help?
First start the download on your own computer (and watch the ad and everything). Then once its downloading the file, you should be able to get the URL where its downloading from by looking at your "Downloads" section (in Chrome, for example). Copy that URL and use it to download from the remote server.
Make sure you are wget'ing the link to the actual file, not the ad.fly link. Ad.fly has it setup this way to force you to watch their ads.

Wget - Overwrite files that are different

So, I'm making an updater for my game using wget. The latest version of the game (with all it's files), is on my server. The thing is that I want wget to download only the files that are different from a directory on the server into the www folder in the root of the game files (This has to be recursive, since not all of the game's files are stored directly in that folder). So, when the command is being ran it should check to see if the file's hashsum (if possible, otherwise it should check the size) on the server matches the one in the game's files. If it doesn't, it should download the file from the server and replace the one in the game's directory. This way, the player won't have to re-download all of the game's files.
Here's the command I used for testing:
wget.exe -N -r ftp://10.42.0.1/gamupd/arpg -O ./www
The server is running on my local network, so I use that IP address.
The thing is that it doesn't save the contents of /gamupd/arpg to the www folder, instead it seems to copy the directory tree of arpg.
Maybe the timestamping flag will satisfy you. From wget -help:
-N, --timestamping don't re-retrieve files unless newer than
local

Rename the Directory Index of a web page downloaded with wget to index.html

I am currently using a wget command that is fairly complicated, but the essence of it is the -p and -k flags to download all the pre-requisites. How do I rename the main downloaded file to index.html?
For instance, I download a webpage
http://myawesomewebsite.com/something/derp.html
This will, for example, download:
derp.html
style.css
firstimage.png
secondimage.jpg
And maybe even an iFrame:
iframe.html
iframe-style.css
So now the question is how do I rename derp.html to index.html, without accidentally renaming iframe.html to index.html as well, given that I don't know what the name of the resolved downloaded file may be?
When I tried this method on a Tumblr page with URL http://something.tumblr.com/34324/post it downloaded as page.html.
I've tried the --output-document flag, but that results in nothing being downloaded at all.
Thanks!
This is what I ended up doing:
If there was no index.html found after downloading, I used Ruby to get the derp.html part of the URL, and then searched for derp.html and then renamed it to index.html.
It's not as elegant as I would like, but it works.