wget visit url which has parent directory exactly after the hostname

wget visit url which has parent directory exactly after the hostname - wget

Update: I upgrade wget from 1.10 to 1.12 and solved the problem.
For example
www.example.com/level1/level2/../test.html
In this way, wget and browser will visit
www.example.com/level1/test.html
But for
www.example.com/../test.html
wget will visit
www.example.com/../test.html
browser will visit
www.example.com/test.html
I was using wget to parse some webpage to get the size of it and the elements inside it.
Now I found that some webpage are using "../css/xxx.jpg" instead of "css/xxx.jpg".
It is Ok to visit the webpage with browser, but not wget.
Is there a way to solve it? Thank you.

Before passing URLs to wget, trim "../" from the begging of the path. (splitting the URLS into components would help.)
How to do this depends on what language or framework you are using.

Related

Download Github public assets from CLI and Dockerfile

I'm trying to download the networks models of a Neural Network project used to detect NSFW images. The packed models are available in the assets section of a release at this URL: https://github.com/notAI-tech/NudeNet/releases
I would like to create a DOCKERFILE which download these assets when built. I started using the ADD command, but the files didn't get downloaded entirely (only few kB over the 120 MB of some files). So, I tried in my Linux CLI using wget and curl... But nothing worked as expected. For example the command :
curl -OJL https://github.com/notAI-tech/NudeNet/releases/download/v0/classifier_model.onnx
Starts the download but only download an HTML file instead of the actual ONNX file... It seems Github is doing some kind of redirection and I don't know how I can handle it with curl/wget and finally with the ADD command of a DOCKERFILE ?

I did visit
https://github.com/notAI-tech/NudeNet/releases/download/v0/classifier_model.onnx
in my browser and I did get login page, so apparently it is not publicly available. That would explain why you did get small HTML file (file with login form).
Github is doing some kind of redirection and I don't know how I can
handle it with(...)wget
You need to provide authentication data, I do not know how it is exactly done in this case, but I suspect they might use one of popular methods: basic authentication (see wget options --http-user=user and --http-password=pass) or cookies based solution (see wget options --load-cookies file and --save-cookies file and --keep-session-cookies).
Mentonied options are described in wget man page, which you might access by click link or doing man wget in terminal.

How do I download pages from a site that aren't linked?

I am trying to mirror the entire site "citypaper.com" using wget
At first all it would do is download the index.html and stop.
Then I found the solution:
wget -r -p -e robots=off http://www.citypaper.com
Now it downloads pages that are linked to index.html as well as pages linked to those, and so on...
The problem is that there are thousands of pages that are no longer actually on any of these pages.
Is there a way for wget to download these pages as well?

What you want is a web crawler i think. You can start with some tool like this to get a feel
https://www.screamingfrog.co.uk/crawl-javascript-seo/

migrating TYPO3 6.2.31 to TYPO3 7.6.23

When I try to migrate my TYPO3 6.2.31 to 7.6.23 I've got some problems.
Especially the page tree is missing so I got this error
The requested resource "%2Fmain" was not found
I've tried this way to migrate:
1.) Copy the whole page
2.) Changing the symlinks to the new sources
3.) Starting the migration wizard in install tool
And now When I want to access the backend I got the above mentioned error.
what can I do?
thanks.
When I call url.de/typo3 the follwing url is called:
index.php?route=%252Fmain&token=XXX
The correct one should be
index.php?route=%2Fmain&token=XXX
What could be the problem in the url?

Please follow below steps.
Download typo3 7 LTS latest source and create symlink.
Add your typo3conf, uploads and fileadmin folder
Open install tools and clear both cache php and typo3.
Compare currentdatabse specification and perform all steps.
Go to upgrade wizard and complete all needed steps
Clear cache and remove typo3temp file and open BE

as mentioned here: Need to allow encoded slashes on Apache
Issue 1: Apache believes that's an invalid url
Solution: AllowEncodedSlashes On in httpd.conf
Issue 2: Apache decodes the encoded slashes
Solution: AllowEncodedSlashes NoDecode in httpd.conf (Requires Apache 2.3.12+)
Issue 3: mod_proxy attempts to re-encode (double encode) the URL changing %2F to
%252F (eg. /example/http:%252F%252Fwww.someurl.com/)
Solution: In httpd.conf use the ProxyPass keyword nocanon to pass the raw URL thru the proxy.
ProxyPass http://anotherserver:8080/example/ nocanon
httpd.conf file:
AllowEncodedSlashes NoDecode
<Location /example/>
ProxyPass http://anotherserver:8080/example/ nocanon
</Location>

wget link containing question mark

I'm trying to download a .exe using the command line.
download link: https://go.microsoft.com/fwlink/?LinkId=691980&clcid=0x409
Doing wget <link> results in a file index.html#LinkId=691980&clcid=0x409
How do you deal with links that contain parameters at the end of the link? LinkId is necessary to download the correct .exe, so I can't just get rid of/ignore it.

Rename the Directory Index of a web page downloaded with wget to index.html

I am currently using a wget command that is fairly complicated, but the essence of it is the -p and -k flags to download all the pre-requisites. How do I rename the main downloaded file to index.html?
For instance, I download a webpage
http://myawesomewebsite.com/something/derp.html
This will, for example, download:
derp.html
style.css
firstimage.png
secondimage.jpg
And maybe even an iFrame:
iframe.html
iframe-style.css
So now the question is how do I rename derp.html to index.html, without accidentally renaming iframe.html to index.html as well, given that I don't know what the name of the resolved downloaded file may be?
When I tried this method on a Tumblr page with URL http://something.tumblr.com/34324/post it downloaded as page.html.
I've tried the --output-document flag, but that results in nothing being downloaded at all.
Thanks!

This is what I ended up doing:
If there was no index.html found after downloading, I used Ruby to get the derp.html part of the URL, and then searched for derp.html and then renamed it to index.html.
It's not as elegant as I would like, but it works.