WGET incompletely spidered the target website - wget

My idea was to use wget to create a complete list of all the threads of a very big blog (according to the blog itself the total number of threads is 50,000). I used wget in --spider mode to crawl the website and output the urls in a text file. After 1d 3h 3m 3s wget completed its work but I identified 'only' 9668 files against the 50,000 presented on the website. According to wget 643 links were broken so my first idea was to check whether the missing threads were somehow related with the broken links, but apparently they are not. The blog keeps its threads in folders named after year and month (e.g. /2012/01/name_of_thread.html). Some broken links appeared to refer to folders from which wget downloaded some threads, so I would exclude a blackout on selected folders. Also missing threads (which I located browsing the blog) appeared to be from the same folder from which wget corrently downloaded few other threads.
Where do I start to understand what went wrong?

Sometimes blog publishing/hosting platforms use unusual robots.txt files, so -e robots=off may be called for. I had a similar issue crawling a WordPress blog, where strange output was being shaped by robots exclusions - it was somewhat similar to your problem.
Depending on how the blog is structured, you may have better results with a more careful crawl. If it uses pagination (www.site.com/archive/1/, www.site.com/archive/2/...), you could crawl each page via a for loop and parse the content of each. This will give you more controlled results, as you can verify and test against a small subset of data (e.g. ten pages of threads at a time) instead of all 50k threads at once.
It's also possible that the site is reporting bad numbers - are you certain that there should be 50k threads?

Related

ESGF `wget` scripts incorrectly generating; linking to a random unrelated file

About half the time when I click on WGET script following a CMIP6 data search on the ESGF (LLNL node), I get a wget script that only points to one, unrelated file. It's always the same one, too. Here's the relevant line that shows up in each wget file:
download_files="$(cat <<EOF--dataset.file.url.chksum_type.chksum
'famipc5_ne120_v0.3_00001_01_198001_198401_climo.nc' 'http://esgf.anl.gov/thredds/fileServer/esg_dataroot/ACME/climo/amip/v0_3/atm/mon/native/ne120/ens1/famipc5_ne120_v0.3_00001_01_198001_198401_climo.nc' 'SHA256' 'e5040c5df9d080437418943f02a41e84712dbe1c4a69982447712d7c7334241d'
EOF--dataset.file.url.chksum_type.chksum
)"
This happens with a wide variety of datasets. Here's one file where that happens, for example:
CMIP6.CMIP.CCCma.CanESM5.amip.r1i1p1f1.day.pr.gn
I've been searching for a reason, so far without success. A workaround is to hit the "download HTML" button 1000 times for each individual needed file instead (or set up a Globus endpoint for the files where that's possible), but it's very inconvenient and doesn't provide the functionality of a bash script.
Does anyone know what may be causing this? Is there some sort of limit to how many wget scripts an ESGF user can generate per day and these are downloaded as placeholders afterwards instead?
Grateful for any insight!
PS: I apologize for the cdo tag; I know this isn't a cdo problem, but it's hard to find relevant tags for this, and I figured that community may know what's up.
Turns out this is a browser issue. Repeating the search with chrome fixed it.
(Also that stack overflow may have not been the right venue to post this question; but I want this to be searchable somewhere at least)

Noindex in a robots.txt

I've always stopped google from indexing my website using a robots.txt file. Recently i've read an article from a google employee where he stated you should do this using meta tags. Does this mean Robots.txt won't work? Since i'm working with a CMS my options are very limited and its a lot easier just using a robots.txt file. My question is whats the worst that could happen if i proceed using a robots.txt file instead of meta tags.
Here's the difference in simple terms:
A robots.txt file controls crawling. It instructs robots (a.k.a. spiders) that are looking for pages to crawl to “keep out” of certain places. You place this file in your website’s root directory.
A noindex tag controls indexing. It tells spiders that the page should not be indexed. You place this tag in the code of the relevant web page.
Use the robots.txt file when you want control at the directory level or across your site. However, keep in mind that robots are not required to follow these directives. Most will, such as Googlebot, but it is safer to keep any highly sensitive information out of publicly-accessible areas of the site.
As with robots.txt files, noindex tags will exclude a page from search results. The page will still be crawled, but it won’t be indexed. Use these tags when you want control at the individual page level.
An aside on the difference between crawling and indexing: Crawling (via spiders) is how a search engine’s spider tracks your website; the results of the crawling go into the search engine’s index. Storing this information in an index speeds up the return of relevant search results—instead of scanning every page related to a search, the index (a smaller database) is searched to optimize speed.
If there was no index, the search engine would look at every single bit of data or info in existence related to the search term, and we’d all have time to make and eat a couple of sandwiches while waiting for search results to display. The index uses spiders to keep its database up to date.
Here is an example of the tag:
<meta name="robots" content="noindex,follow"/>
Now that you read and understand the above information, I think you are able to answer your question on your own ;)
Indeed, there was the opportunity of GoogleBot that allowed to use:
Noindex
Nofollow
Crawl-delay
But seen on the GoogleBlog-News they will no longer support those (0,001% used) commands anymore from September 2019 on. So you should only use meta tags anymore for these on your page to be safe for the future.

Make wget convert links to other pages in input-file

I'm using wget to archive a discussion from a forum. The discussion is over several pages, navigated to with next and previous buttons.
I generated a list of the page urls and used that for the input-file, however the convert-links option is not converting the next and previous links, only the images.
Is there any way to make it do that?
I could use -r, but that would need a depth of 64 to get the whole discussion, and therefore it would get a whole load of extra unwanted stuff as well.
I figured out a workaround. It was easy enough to change the input file to html and upload it. Then with -r and -l1 it correctly converted the links.

Perl Mechanize module for scraping pdfs

I have a website into which many pdfs are uploaded. What i want to do is to download all those PDFs present in the website. To do so i first need to provide username and password to the website. After searching for sometime i found WWW::Mechanize package that does this work. Now the problem arises here that i want to make a recursive search in the website meaning that if the link does not contain a PDF, then i should not simply discard the link but should navigate the link and check whether the new page has links that contain PDFs. In this way i should exhaustively search the entire website to download all PDFs uploaded. Any suggestion on how to do this?
I'd also go with wget, which runs on a variety of platforms.
If you want to do it in Perl, check CPAN for web crawlers.
You might want to decouple collecting PDF URLs from actually downloading them. Crawling already is lengthy processing and it might be advantageous to be able to hand off downloading tasks to seperate worker processes.
You are right about using WWW::Mechanize module. This module has a method - find_all_links() wherein you can point out the regex to match the kind of pages you want to grab or follow.
For example:
my $obj = WWW::Mechanize->new;
.......
.......
my #pdf_links = $obj->find_all_links( url_regex => qr/^.+?\.pdf/ );
This gives you all the links pointing to pdf files, Now iterate through these links and issue a get call on each of them.
I suggest to try with wget. Something like:
wget -r --no-parent -A.pdf --user=LOGIN --pasword=PASSWORD http://www.server.com/dir/

Product idea/approach : Folder based disk organization

Sweet..I bought myself a 1TB portable harddrives this week. Don't you just love how much data you could store on one of these disks? The fact that I could store my bluray rips on to my portable harddisk and that my lg lcd tv can do HD rips right from the drive - that's amazing practicality right there! However, life it seems, is never so simple. I have 100s of movies unorganized in one huge folder, which is exactly what I needed to annoy myself while browsing the same on my tv to play a single movie. That got me thinking...
What if I had an automated way to organize movies into folders such that my folder-browsing-on-a-lcd-tv-or-a-comp would make my life a little easy?
I started thinking about this... I browsed a little in this context and I realized that if only I could "tag my movies somehow and create folders on-the-fly based on tags using hardlinks", I would have addressed my problem. I googled a bit to find software that works in the above fashion, only to find none.
A few more days of serious thought (as you know by now.. I think a lot.. and I guess this question is starting to sound like a blog rant/post of sorts...), in the interest of humanity, I thought I should come up with a generic way to address this: What if someone wanted to organize photos... organize music.. organize software?!
Turned my grey cells off for a while and here is an approach I came up with to solving my what-if scenario.
Tag / Group tag individual files (rely on a slick GUI to do it fast and do it good) - Adobe Flex/Eclipse RCP to do this?
Create hardlinks to each of the tagged files.
The first point is self-explanatory. The second (coz I am talking windows here), refers to making use of mklink.exe.
Consider a scenario where I have 2 movie files: I have a movie file "Transformers.avi" tagged as "english, action, bluray, sci-fi, imdb-top-50, must-watch-with-kids" and another movie file "The Specialist.avi" tagged as "english, bluray, thriller, adult". Here are a few of the possible locations I want to see my Transformers to be found:
[root directory]->all-tags->english
[root directory]->all-tags->bluray
[root directory]->all-tags->english->all-tags->bluray
[root directory]->all-tags->bluray->all-tags->action
[root direcotry]->all-tags->english->all-tags->action->bluray->all-tags->imdb-top-50
Given that windows has a limit of 1024 hardlinks to a single file, I probably would be allowed 7 unique tags per file. Each sub-folder will have an "all-tags" folder. Having it named "all-tags" makes it more accessible when order by name.
I believe this approach when automated to let you configure tags you want and where the hardlinks are created for you, helps you organize stuff effectively.
I don't know if there are better things out there. I would like your inputs on this approach and other possible ideas. I would like to gather inputs here and release something to sourceforge for everyone to use in a couple of weeks. I am sure, I can count on your positive response as always.
I believe hardlinks are not a good approach. Reason? A standalone player won't play them, and I wouldn't like a program who's made for tagging to tell me to stop making so many tags because of a Windows limitation on hardlinks (remembering each tag will increment the number of links exponentially).
Plus, "help" is not a good tag.
And I've had an idea once that I'm still planning to make some day to sort my own files - put the files in a big storage each below a GUID foldername (filename untouched) and store metadata in a sqlite database to be used by a smart file browser.
I was considering doing something similar to this with music for detecting duplicate songs and auto-organize funcationality.
For your application, I wouldn't recommend using any shell programs through Java. Exception handling becomes difficult, and your application becomes bound by the shell interface and implementation (i.e. windows versions or installations affect your application behavior).
I would use a database with a few tables: Files, Tags, and an association table.
The Files table would list the physical location of each file, the filename, and a unique identifier. This way, you can maintain information about each file without having to modify it for every tag association.
The Tags table would list each tag, and any metadata you want to store for each tag.
A third table, maybe 'FileTags' would store the assocation between tags and files. When adding tags to the stack, you would add a statement to the WHERE clause, and the list of files with all of the tags would be returned. This structure would also allow open your codebase up to other designs, such as include/exclude (autocomplete with X buttons), or possibly search.
If implemented in Java, your app would be platform independent, and would allow a very large number of tags and files. You can then use the system default application for opening the media file, and the user can make the selection in their native OS.
Reiser4?
...
(I mean nevermind Hans, but the tech...)
[disclaimer: Not a hacker. I know nothing of programming/coding, never mind filesystems & databases. I can barely code decent HTML even, if at all. Hey y'all! :D]
[footnote: does plain HTML5 work here? Too lazy to close my tags hehe :p]