Make wget convert links to other pages in input-file

Make wget convert links to other pages in input-file - wget

I'm using wget to archive a discussion from a forum. The discussion is over several pages, navigated to with next and previous buttons.
I generated a list of the page urls and used that for the input-file, however the convert-links option is not converting the next and previous links, only the images.
Is there any way to make it do that?
I could use -r, but that would need a depth of 64 to get the whole discussion, and therefore it would get a whole load of extra unwanted stuff as well.

I figured out a workaround. It was easy enough to change the input file to html and upload it. Then with -r and -l1 it correctly converted the links.

Related

ESGF `wget` scripts incorrectly generating; linking to a random unrelated file

About half the time when I click on WGET script following a CMIP6 data search on the ESGF (LLNL node), I get a wget script that only points to one, unrelated file. It's always the same one, too. Here's the relevant line that shows up in each wget file:
download_files="$(cat <<EOF--dataset.file.url.chksum_type.chksum
'famipc5_ne120_v0.3_00001_01_198001_198401_climo.nc' 'http://esgf.anl.gov/thredds/fileServer/esg_dataroot/ACME/climo/amip/v0_3/atm/mon/native/ne120/ens1/famipc5_ne120_v0.3_00001_01_198001_198401_climo.nc' 'SHA256' 'e5040c5df9d080437418943f02a41e84712dbe1c4a69982447712d7c7334241d'
EOF--dataset.file.url.chksum_type.chksum
)"
This happens with a wide variety of datasets. Here's one file where that happens, for example:
CMIP6.CMIP.CCCma.CanESM5.amip.r1i1p1f1.day.pr.gn
I've been searching for a reason, so far without success. A workaround is to hit the "download HTML" button 1000 times for each individual needed file instead (or set up a Globus endpoint for the files where that's possible), but it's very inconvenient and doesn't provide the functionality of a bash script.
Does anyone know what may be causing this? Is there some sort of limit to how many wget scripts an ESGF user can generate per day and these are downloaded as placeholders afterwards instead?
Grateful for any insight!
PS: I apologize for the cdo tag; I know this isn't a cdo problem, but it's hard to find relevant tags for this, and I figured that community may know what's up.

Turns out this is a browser issue. Repeating the search with chrome fixed it.
(Also that stack overflow may have not been the right venue to post this question; but I want this to be searchable somewhere at least)

wget behaves differently with different adresses

I have these two urls:
https://cdn.pixabay.com/photo/2017/06/24/09/13/dog-2437110_960_720.jpg
and
http://www.deutschland-machts-effizient.de/SiteGlobals/KAENEF/StyleBundles/Bilder/sublogo.png;jsessionid=DF603F2801D8F686FD4BCFAD770C3FC9?__blob=normal&v=3
Trying to access the pictures with wget works for the first one, but does not for the second one. Of course the first more closely resembles a picture (ending in .jpg), but any browser I tested displayed both as pictures I could download.
Instead of a picture I download a 2000 line html file, which contains several img tags. I guess I could try any of the urls, but I want to automate this for a general case, so this doesn't really help me.
What is the inherent difference between both pictures in the way they are stored on their respective server?
How can I download the second picture using wget?

Wget to download html

I have been trying to download an html from http://osu.ppy.sh/u/2330158 to get Historical data
but it doesnt download that part. Nor it downloads General, Top Ranks etc
Is there a way to make wget to download it?

That part of the page is loaded dynamically, so wget won't see it as it doesn't support Javascript. However, if you open the web developer tools in your browser of choice and then load the main page you can get the URL which you're really after. For this page, it's: http://osu.ppy.sh/pages/include/profile-history.php?u=2330158&m=0
Luckily, it's another simple, parameterised URL so you can feed that to wget:
wget "http://osu.ppy.sh/pages/include/profile-history.php?u=2330158&m=0"
That'll get you an html document containing just the historic data you're looking for.

Perl Mechanize module for scraping pdfs

I have a website into which many pdfs are uploaded. What i want to do is to download all those PDFs present in the website. To do so i first need to provide username and password to the website. After searching for sometime i found WWW::Mechanize package that does this work. Now the problem arises here that i want to make a recursive search in the website meaning that if the link does not contain a PDF, then i should not simply discard the link but should navigate the link and check whether the new page has links that contain PDFs. In this way i should exhaustively search the entire website to download all PDFs uploaded. Any suggestion on how to do this?

I'd also go with wget, which runs on a variety of platforms.
If you want to do it in Perl, check CPAN for web crawlers.
You might want to decouple collecting PDF URLs from actually downloading them. Crawling already is lengthy processing and it might be advantageous to be able to hand off downloading tasks to seperate worker processes.

You are right about using WWW::Mechanize module. This module has a method - find_all_links() wherein you can point out the regex to match the kind of pages you want to grab or follow.
For example:
my $obj = WWW::Mechanize->new;
.......
.......
my #pdf_links = $obj->find_all_links( url_regex => qr/^.+?\.pdf/ );
This gives you all the links pointing to pdf files, Now iterate through these links and issue a get call on each of them.

I suggest to try with wget. Something like:
wget -r --no-parent -A.pdf --user=LOGIN --pasword=PASSWORD http://www.server.com/dir/

How can I get all HTML pages from a website subfolder with Perl?

Can you point me on an idea of how to get all the HTML files in a subfolder and all the folders in it of a website?
For example:
www.K.com/goo
I want all the HTML files that are in: www.K.com/goo/1.html, ......n.html
Also, if there are subfolders so I want to get also them: www.K.com/goo/foo/1.html...n.html

Assuming you don't have access to the server's filesystem, then unless each directory has an index of the files it contains, you can't be guaranteed to achieve this.
The normal way would be to use a web crawler, and hope that all the files you want are linked to from pages you find.

Look at lwp-mirror and follow its lead.

I would suggest using the wget program to download the website rather than perl, it's not that well suited to the problem.

There are also a number of useful modules on CPAN which will be named things like "Spider" or "Crawler". But ishnid is right. They will only find files which are linked from somewhere on the site. They won't find every file that's on the file system.

You can also use curl to get all the files from a website folder.
Look at this man page and go to the section -o/--output which gives u a good idead about that.
I have used this a couple of times.

Read perldoc File::Find, then use File::Find.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Make wget convert links to other pages in input-file - wget

I figured out a workaround. It was easy enough to change the input file to html and upload it. Then with -r and -l1 it correctly converted the links.

Related

ESGF `wget` scripts incorrectly generating; linking to a random unrelated file

wget behaves differently with different adresses

Wget to download html

Perl Mechanize module for scraping pdfs

How can I get all HTML pages from a website subfolder with Perl?

Categories

Resources