Wget creates some of directories only at the end of the mirroring - wget

I'm currently mirroring www.typingstudy.com
wget --mirror --page-requisites --convert-link --no-clobber --no-parent --domains typingstudy.com https://www.typingstudy.com/
And wget creates directories, which contain html files on the site, only at the end of the scrapping and, accordingly, when it tries to download those html files, before the creation of the directories in which this files are located, wget says:powershell output
Sometimes it downloads only 1 file, like at this example, to the directory "part" and refuses to see this directory while trying downloading all other ~10 files from this exact directory, saying that this directory does not exist: enter image description here
Can someone help me understand what's wrong with my commands? Or is it a bug of wget?(Probably not)
Thanks in advance.
When I start the downloading process again - everything is neat, wget downloads all those ~10 other html files to created in the previous download session ("part") directories. So the problem is that I need to start the downloading 2 times, at least in case of this site.
And I totally do not understand why this is happening.

Related

WGet multiple directoriesdirectories

I am relatively new to the scene and am not that experienced with WGet (currently using VisualWGet, but also have cmd based WGet). I'm trying to download many (182,218 to be precise) images. I can get as far as downloading the first directory and all of the images within, then it will download only one image of each directory afterwards.
I am making sure to use a recursive search, but it seems like it does not want to enter the other directories after it exits out of the first one.
here's the process:
Downloads everything in directory 0
back tracks to parent directory
downloads first image in directory 1
downloads first image in directory 2
etc
The directory i'm trying to download from is http://66.11.126.173/images/ and each directory doesn't seem to be a link, rather an image that doesn't link to another directory.
The images are listed in directories as such
http://66.11.126.173/images/0/
http://66.11.126.173/images/1/
http://66.11.126.173/images/2/
etc
each directory has 31 variations of the same image and there are 5878 directories and I start my downloads in images/0/ otherwise it will want to download the index.html file for /images/
any help will be greatly appreciated.

Wget - Overwrite files that are different

So, I'm making an updater for my game using wget. The latest version of the game (with all it's files), is on my server. The thing is that I want wget to download only the files that are different from a directory on the server into the www folder in the root of the game files (This has to be recursive, since not all of the game's files are stored directly in that folder). So, when the command is being ran it should check to see if the file's hashsum (if possible, otherwise it should check the size) on the server matches the one in the game's files. If it doesn't, it should download the file from the server and replace the one in the game's directory. This way, the player won't have to re-download all of the game's files.
Here's the command I used for testing:
wget.exe -N -r ftp://10.42.0.1/gamupd/arpg -O ./www
The server is running on my local network, so I use that IP address.
The thing is that it doesn't save the contents of /gamupd/arpg to the www folder, instead it seems to copy the directory tree of arpg.
Maybe the timestamping flag will satisfy you. From wget -help:
-N, --timestamping don't re-retrieve files unless newer than
local

wget download and rename files that originally have no file extension

Have a wget download I'm trying to perform.
It downloads several thousand files, unless I start to restrict the file type (junk files etc). In theory restricting the file type is fine.
However there are lots of files that wget downloads without a file extension, that when manually opened with Adobe for example, are actually PDF's. These are actually the files I want.
Restricting the wget to filetype PDF does not download these files.
So far my syntax is wget -r --no-parent A.pdf www.websitehere.com
Using wget -r --no-parent www.websitehere.com brings me every file type, so in theory I have everything. But this means I have 1000's of junk files to remove, and then several hundred of the useful files of unknown file type to rename.
Any ideas on how to wget and save the files with the appropriate file extension?
Alternatively, a way restrict the wget to only files without a file extension, and then a separate batch method to determine the file type and rename appropriately?
Manually testing every file to determine the appropriate application will take a lot of time.
Appreciate any help!
wget has an --adjust-extension option, which will add the correct extensions to HTML and CSS files. Other files (like PDFs) may not work, though. See the complete documentation here.

Rename the Directory Index of a web page downloaded with wget to index.html

I am currently using a wget command that is fairly complicated, but the essence of it is the -p and -k flags to download all the pre-requisites. How do I rename the main downloaded file to index.html?
For instance, I download a webpage
http://myawesomewebsite.com/something/derp.html
This will, for example, download:
derp.html
style.css
firstimage.png
secondimage.jpg
And maybe even an iFrame:
iframe.html
iframe-style.css
So now the question is how do I rename derp.html to index.html, without accidentally renaming iframe.html to index.html as well, given that I don't know what the name of the resolved downloaded file may be?
When I tried this method on a Tumblr page with URL http://something.tumblr.com/34324/post it downloaded as page.html.
I've tried the --output-document flag, but that results in nothing being downloaded at all.
Thanks!
This is what I ended up doing:
If there was no index.html found after downloading, I used Ruby to get the derp.html part of the URL, and then searched for derp.html and then renamed it to index.html.
It's not as elegant as I would like, but it works.

download by wget without specific folder site

how download site for viewing offline without specific folder
for example i want download the site without http://site.com/forum/ sub-directory
wget --help
might lead you to
-nH, --no-host-directories don't create host directories.
I'd try that first, but I'm not sure whether it will do what you want.