Using wget to download dokuwiki pages in plain xhtml format only - wget

I'm currently modifying the offline-dokuwiki[1] shell script to get the latest documentation for an application for automatically embedding within instances of that application. This works quite well except in its current form it grabs three versions of each page:
The full page including header and footer
Just the content without header and footer
The raw wiki syntax
I'm only actually interested in 2. This is linked to from the main pages by a html <link> tag in the <head>, like so:
<link rel="alternate" type="text/html" title="Plain HTML"
href="/dokuwiki/doku.php?do=export_xhtml&id=documentation:index" />
and is the same url as the main wiki pages only they contain 'do=export_xhtml' in the querystring. Is there a way of instructing wget to only download these versions or to automatically add '&do=export_xhtml' to the end of any links it follows? If so this would be a great help.
[1] http://www.dokuwiki.org/tips:offline-dokuwiki.sh (author: samlt)

DokuWiki accepts the do parameter as HTTP header as well. You could run wget with the parameter --header "X-DokuWiki-Do: export_xhtml"

Related

Specify sitemap language (same language for all sitemap)

According to Google, you can specify languages in a sitemap like this:
<url>
<loc>http://www.example.com/english/page.html</loc>
<xhtml:link
rel="alternate"
hreflang="de"
href="http://www.example.com/deutsch/page.html"/>
<xhtml:link
rel="alternate"
hreflang="en"
href="http://www.example.com/english/page.html"/>
</url>
However, I just need to specify that ALL the sitemap/website is in Spanish, which means it's not a multi-language sitemap, it's a one-language sitemap but that language happens to be "Spanish".
Should I include a hreflang tag for each and every URL? or is there a better way to do this, like specifying it in the header section?
No, setting header for the sitemap xml only sets it for the sitemap.xml and not all locations declared in the sitemap. You have to declare it for all locations.
Checked with the URL inspection tool to see if there are any errors with Google trying to index your site.
If you had access to the server, you can set Link response header alongst the following lines.
Nginx
add-header Link <$scheme://$host$request_uri>; rel="alternate"; hreflang="es"
Apache
Header set Link "<%{REQUEST_SCHEME}://%{HTTP_HOST}%{REQUEST_URI}>; rel=\"alternate\"; hreflang=\"es\""
Also, you could set the "lang" attribute on the "html" tag or "link" tag of every page in your website. You can use a template for this if your site is built using a static site generator.
If you only have access in Cloud console, you have to make an entry for every location in your website in the sitemap.xml.

GitHub Pages: Image From Link Not Showing

I used to have my website hosted through Shopify, and when I linked to it in my LinkedIn job description the logo showed up. I've since moved my website to GitHub Pages, and now the logo is blank when I link to it in LinkedIn (or anywhere else for that matter). Is there something I can do to fix this, or is it just a con of GH Pages?
It always helps to include a link to the codebase for reference, but it looks like you're likely working with this repo on your GitHub profile.
It's possible that Shopify or a theme you were using before included these by default, but typically you have to specify the preview image in your site's metadata. The preview images for formatted links are pulled from an Open Graph image property, which you define in a meta tag in your HTML's <head> section (see the OG documentation here). So, in your head include file, you'd add a meta tag like this:
<meta property="og:image" content="https://graemeharrison.com/assets/img/logo.png" />
Then, ideally, you'll include this head file in each layout file so that it's included in each page's HTML.
A couple of things that worked for me:
Put your image in your 'public' directory near index.html, and in your meta tag retrieve it with content="http://yourdomain.com/yourimage.png". (https didn't work for me but http did)
Also, https://www.linkedin.com/post-inspector is a good tool to check your og image appears.

Facebook Object Debugger returns 404 not found when trying to scrape

I have a simple Tumblr website blog, upon which I post content.
However since I changed my DNS, the Facebook Object debugger sees really old data for my root url: http://www.kofferbaque.nl/ and for every post (for instance: http://kofferbaque.nl/post/96638253942/moodoid-le-monde-moo) it shows a 404 not found, which is bullshit because the actual content is there.
The full error message: Error parsing input URL, no data was cached, or no data was scraped.
I have tried the following things to fix it:
clear browser cache / cookies / history
using ?fbrefresh=1 after the URL (didn't work)
I've added a FB app_id to the page (made sure the app was in production - added the correct namespaces etc. - also didn't change anything)
Checked out other questions regarding this subject
Rechecked all my meta tags a dozen times
What other options are there to fix this issue?
If you need more info please ask in the comments.
2014-09-08 - Update
When throwing my url into the static debugger https://developers.facebook.com/tools/debug/og/echo?q=http://www.kofferbaque.nl/. The 'net' tab from firebug gives the following response:
<meta http-equiv="refresh" content="0; URL=/tools/debug/og/echo?q=http%3A%2F%2Fwww.kofferbaque.nl%2F&_fb_noscript=1" /><meta http-equiv="X-Frame-Options" content="DENY" />
2014-09-11 - Update
removed duplicate <!DOCTYPE html> declaration
cleanup up <html> start tag (aka - removed IE support temporarily)
I've placed a test blog post to see if it would work, it didn't. Somehow my root url started 'magically' updating itself. Or let's say, it removed the old data - probably due to the fact that I removed the old app it was still refering to. However, it still doesn't see the 'newer' tags correctly.
Still no succes
2014-09-12 - Update
Done:
moving <meta> tags to the top of the <head> element
removed fb:app_id from page + the body script, for it has no purpose.
This appearantly doesn't make any changes. It also appears that tumblr injects lots of script tags at the start of the head element. Maybe that is the reason the Facebook scraper doesn't 'see' the meta tags.
The frustrating bit is that through some other og tag scanner: http://iframely.com/debug?uri=http%3A%2F%2Fkofferbaque.nl%2F, it shows all the correct info.
First, the HTML is not valid. You got the doctype two times (at least on the post page), and there is content before the html tag (script tag and IE conditionals).
This may be the problem, but make sure you put the og-tags together at the beginning of the head section - The debugger only reads part of the page afaik, so make sure the og-tags are in that part. Put all the other og-tags right after "og:site_name".
Btw: ?fbrefresh=1 is not really necessary, you can use ANY parameter - just to create a different url. But the debugger offers a button to refresh the scraping, so it´s useless anyway.

using .php with .css or some thing better?

I have a log in page for my web site. The log in file is "index.php" this will be the first page you come to when comming to my site. The rest of my site is HTML with a style.css file providing the look for my site. Now my questions is how do I get my index.php file too look like the rest of my web site?
Right now when you come to mydomain.com/index.php it is just a white page with a log in and password box. I would like my log in page to look like the rest of my web site. Can some one please refer me as how to do this?
I have other .php files that would also need to be linked with the .css such as register.php and so forth. thanks guys.
If there is a different/better method of doing what I need please feel free to chime in, I'm all ears at this point I've been trying to do this for 2 days.
Like you would do in every other html page you will have to link the file the same way.
I guess that you have already seen that in every php file there is html code?
Just stay out of the php brackets
<!DOCTYPE html>
<html>
<head>
<link rel="stylesheet" type="text/css" href="styles.css">
</head>
<body
<?php
"php code in here"
?>
</body>
</html>
If you don't find the usual html markup somewhere search for a include function in the php file.
Maybe the html header is in other php file and it is being called from there.
They would be included like this
include '_header.php';
You can use the CSS file, similar to how you use it in your HTML files. You can either post the CSS tag below your PHP code, or you can use an echo "cssTagHere"; call within your PHP code.
If you're using a login page, though, are you maintaining that security with the rest of your site by using PHP on your other pages?

Get google to index links from javascript generated content

On my site I have a directory of things which is generated through jquery ajax calls, which subsequently creates the html.
To my knwoledge goole and other bots aren't aware of dom changes after the page load, and won't index the directory.
What I'd like to achieve, is to serve the search bots a dedicated page which only contains the links to the things.
Would adding a noscript tag to the directory page be a solution? (in the noscript section, I would link to a page which merely serves the links to the things.)
I've looked at both the robots.txt and the meta tag, but neither seem to do what I want.
It looks like you stumbled on the answer to this yourself, but I'll post the answer to this question anyway for posterity:
Implement Google's AJAX crawling specification. If links to your page contain #! (a URL fragment starting with an exclamation point), Googlebot will send everything after the ! to the server in the special query string parameter _escaped_fragment_.
You then look for the _escaped_fragment_ parameter in your server code, and if present, return static HTML.
(I went into a little more detail in this answer.)