How do I download pages from a site that aren't linked? - wget

I am trying to mirror the entire site "citypaper.com" using wget
At first all it would do is download the index.html and stop.
Then I found the solution:
wget -r -p -e robots=off http://www.citypaper.com
Now it downloads pages that are linked to index.html as well as pages linked to those, and so on...
The problem is that there are thousands of pages that are no longer actually on any of these pages.
Is there a way for wget to download these pages as well?

What you want is a web crawler i think. You can start with some tool like this to get a feel
https://www.screamingfrog.co.uk/crawl-javascript-seo/

Related

Wget creates some of directories only at the end of the mirroring

I'm currently mirroring www.typingstudy.com
wget --mirror --page-requisites --convert-link --no-clobber --no-parent --domains typingstudy.com https://www.typingstudy.com/
And wget creates directories, which contain html files on the site, only at the end of the scrapping and, accordingly, when it tries to download those html files, before the creation of the directories in which this files are located, wget says:powershell output
Sometimes it downloads only 1 file, like at this example, to the directory "part" and refuses to see this directory while trying downloading all other ~10 files from this exact directory, saying that this directory does not exist: enter image description here
Can someone help me understand what's wrong with my commands? Or is it a bug of wget?(Probably not)
Thanks in advance.
When I start the downloading process again - everything is neat, wget downloads all those ~10 other html files to created in the previous download session ("part") directories. So the problem is that I need to start the downloading 2 times, at least in case of this site.
And I totally do not understand why this is happening.

Download Github public assets from CLI and Dockerfile

I'm trying to download the networks models of a Neural Network project used to detect NSFW images. The packed models are available in the assets section of a release at this URL: https://github.com/notAI-tech/NudeNet/releases
I would like to create a DOCKERFILE which download these assets when built. I started using the ADD command, but the files didn't get downloaded entirely (only few kB over the 120 MB of some files). So, I tried in my Linux CLI using wget and curl... But nothing worked as expected. For example the command :
curl -OJL https://github.com/notAI-tech/NudeNet/releases/download/v0/classifier_model.onnx
Starts the download but only download an HTML file instead of the actual ONNX file... It seems Github is doing some kind of redirection and I don't know how I can handle it with curl/wget and finally with the ADD command of a DOCKERFILE ?
I did visit
https://github.com/notAI-tech/NudeNet/releases/download/v0/classifier_model.onnx
in my browser and I did get login page, so apparently it is not publicly available. That would explain why you did get small HTML file (file with login form).
Github is doing some kind of redirection and I don't know how I can
handle it with(...)wget
You need to provide authentication data, I do not know how it is exactly done in this case, but I suspect they might use one of popular methods: basic authentication (see wget options --http-user=user and --http-password=pass) or cookies based solution (see wget options --load-cookies file and --save-cookies file and --keep-session-cookies).
Mentonied options are described in wget man page, which you might access by click link or doing man wget in terminal.

wget saving html files as .en files instead of html files

I am using wget to download help.ubuntu.com so I will be able to use it offline. I downloaded the entire web page with the command
wget -U firefox -m -l -D help.ubuntu.com --follow-ftp -np "https://help.ubuntu.com/stable/ubuntu-help/" -e robots=off
Everything was fine until I tried navigating the saved pages. The index.html is fine, but when i follow a link, it breaks and just shows the html code for the next page. I think it has to do with how the files were saved because the pages are saved like example.ubuntu.help.html.en
So my question is, implying I entered the command correctly, how do I change the saved pages to end with just .html and not .en? I'm fine with having to re-download the pages, but I wonder if this normal or if I messed up with the command.
I'm running ubuntu 20.04 LTS
I found out the issue was that I was using the brave-browser. Everything worked fine when I swapped to Firefox. If only my brain saved me some embarrassment, I wouldn't be so embarrassed lol.

Rename the Directory Index of a web page downloaded with wget to index.html

I am currently using a wget command that is fairly complicated, but the essence of it is the -p and -k flags to download all the pre-requisites. How do I rename the main downloaded file to index.html?
For instance, I download a webpage
http://myawesomewebsite.com/something/derp.html
This will, for example, download:
derp.html
style.css
firstimage.png
secondimage.jpg
And maybe even an iFrame:
iframe.html
iframe-style.css
So now the question is how do I rename derp.html to index.html, without accidentally renaming iframe.html to index.html as well, given that I don't know what the name of the resolved downloaded file may be?
When I tried this method on a Tumblr page with URL http://something.tumblr.com/34324/post it downloaded as page.html.
I've tried the --output-document flag, but that results in nothing being downloaded at all.
Thanks!
This is what I ended up doing:
If there was no index.html found after downloading, I used Ruby to get the derp.html part of the URL, and then searched for derp.html and then renamed it to index.html.
It's not as elegant as I would like, but it works.

download by wget without specific folder site

how download site for viewing offline without specific folder
for example i want download the site without http://site.com/forum/ sub-directory
wget --help
might lead you to
-nH, --no-host-directories don't create host directories.
I'd try that first, but I'm not sure whether it will do what you want.