Selectively remove subdirectory structure using wget - wget

In order to download a page I use the following command:
wget -kmnH -p http://www.example.com/my/own/folder/page.htm
I get the following directory structure:
/js_folder
/css_folder
/my
/own
/folder
page.htm
What I need is the following directory structure:
/js_folder
/css_folder
/page.htm
I tried using --cut-dirs=3 but it puts all the js and css in the same folder as page.htm.
Is there a way to bypass this problem calling wget only once?
Other solutions are welcome as well!

Related

wget download file to filename using the another level of the url path

I have a list of url paths saved as, say, listOfFiles.txt containing the following:
https://domain.path/level1/level2/name-of-file-01/index.html
https://domain.path/level1/level2/name-of-file-02/index.html
https://domain.path/level1/level2/name-of-file-03/index.html
...
where the name-of-file-xx has no pattern. For example,
https://domain.path/level1/level2/cR2xcet/index.html
https://domain.path/level1/level2/fse4scx/index.html
...
Question: How do you download the index.html here and saving each of the with names using their name-of-file-xx names using wget?
EDIT: What other options/arguments do we need to add in the following code to solve this problem?
wget -np -l1 -i listOfFiles.txt

Do not create directory when using wget with `--page-requisites` option

I'm using wget with --page-requisites option. I'd like to combine this option with --directory-prefix. So for example when calling wget --page-requisites --directory-prefix=/tmp/1 https://google.com would download the google page to /tmp/1/ directory without creating it's own folder (like google.com).
I'd expect the google homepage to end up at /tmp/1/index.html
Is there a way to do this without creating some kind of script that would move the files where I want them to be?
Ok using option --no-directories seems to do the trick.

Monitor directory for files and issue wget command?

I have an wget command running, but due to the website it regularly downloads "fake data". So, I am now downloading it dozens of times to get the correct content.
I now want the following:
use the wget command
check if in that directory are files <100kB
if yes ->use robocopy to save the >100kB files to another folder; repeat with wget
if no -> stop
I have the wget command and the robocopy command.
I only need now the function to check if an directory has a file <100kB and how to put this together into an cmd/powershell/bat or whatever.
How would the command for that look like? I have especially no idea how to check the files.

Creating a static copy of a MoinMoin site

I have a MoinMoin site which I've inherited from a previous system
administrator. I'd like to shut it down but keep a static copy of the
content as an archive, ideally with the same URLs. At the moment I'm
trying to accomplish this using wget with the following parameters:
--mirror
--convert-links
--page-requisites
--no-parent
-w 1
-e robots=off
-user-agent="Mozilla/5.0"
-4
This seems to work for getting the HTML and CSS, but it fails to
download any of the attachments. Is there an argument I can add to wget
which will get round this problem?
Alternatively, is there a way I can tell MoinMoin to link directly to
files in the HTML it produces? If I could do that then I think wget
would "just work" and download all the attachments. I'm not bothered
about the attachment URLs changing as they won't have been linked to
directly in other places (e.g. email archives).
The site is running MoinMoin 1.9.x.
My version of wget:
$ wget --version
GNU Wget 1.16.1 built on linux-gnu.
+digest +https +ipv6 +iri +large-file +nls +ntlm +opie -psl +ssl/openssl
The solution in the end was to use MoinMoin's export dump functionality:
https://moinmo.in/FeatureRequests/MoinExportDump
It doesn't preserve the file paths in the way that wget does, but has the major advantage of including all the files and the attachments.

Using wget filtering only folders needed

I'm trying to get some files in a specific website directory using wget:
wget -r http://www.example.com/Item/<folder>/<files>
Knowing that the page http://www.example.com/Item/ causes a 404 page, how can I get the files and folders inside it and neglect other folders and extra files such as the following:
http://www.example.com/b/something/index.html
http://www.example.com/Category/file.html
http://www.example.com/Collection/demo.html
http://www.example.com/Javascript/data.js
http://www.example.com/Stylesheet/type.css
etc.