Explanation of the Doctype Syntax - doctype

There are plenty of threads explaining what Doctype to choose, but I can't find any explaining the actual syntax. Take for example:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd ">
In particular:
Can PUBLIC be replaced with other values and what does it mean?
Why does the url need to be surrounded with quotes?
What is the "-"?
Why is the first string separated by two slashes rather than 1?
Does the EN stand for English? If so, why do websites also use lang=en?

While it does not answer all of your questions, but I think it's a good start. Wikipedia is your friend. ;)
http://en.wikipedia.org/wiki/Document_Type_Declaration
p.s. For the double quote question being left out, I think the quotes are there in order to interpret the strings with whitespaces correctly

Nice question. I never really thought twice about it.
I found http://www.blooberry.com/indexdot/html/tagpages/d/doctype.htm that explains each section in detail.

Why is the first string separated by two slashes rather than 1?
The SGML syntax is "Owner//Keyword Description//Language".
But in practice, it's irrelevant, as browsers don't actually use SGML parsers for HTML. DOCTYPE is just a switch to decide between quirks mode and standards mode.

Doctypes aren't limited to HTML pages. Doctypes are used to reference Document Type Definitions (DTD), which define the constraint on the structure of a XML document.
There are different types possible, OP's example follows a "public external DTD" format:
<!DOCTYPE root_element PUBLIC "DTD_name" "DTD_location">
where:
root_element: is the root of the xml
DTD_name: an identifier of the DTD, so that processors could use a local version of it rather than having to download it
DTD_location: the location of the DTD in case it isn't locally available.
The DTD_name also has a defined format:
"prefix//owner_of_the_DTD//description_of_the_DTD//ISO 639_language_identifier"
where prefix is one of the following:
ISO: The DTD is an ISO standard. All ISO standards are approved.
+: The DTD is an approved non-ISO standard.
-: The DTD is an unapproved non-ISO standard.
Source: http://xmlwriter.net/xml_guide/doctype_declaration.shtml

Related

Markdown metadata format

Is there a standard or convention for embedding metadata in a Markdown formatted post, such as the publication date or post author for conditional rendering by the renderer?
Looks like this Yaml metadata format might be it.
There are all kinds of strategies, e.g. an accompanying file mypost.meta.edn, but I'm hoping to keep it all in one file.
There are two common formats that look very similar but are actually different in some very specific ways. And a third which is very different.
YAML Front Matter
The Jekyll static site generator popularized YAML front matter which is deliminated by YAML section markers. Yes, the dashes are actually part of the YAML syntax. And the metadata is defined using any valid YAML syntax. Here is an example from the Jekyll docs:
---
layout: post
title: Blogging Like a Hacker
---
Note that YAML front matter is not parsed by the Markdown parser, but is removed prior to parsing by Jekyll (or whatever tool you're using) and could actually be used to request a different parser than the default Markdown parser for that page (I don't recall if Jekyll does that, but I have seen some tools which do).
MultiMarkdown Metadata
The older and simpler MultiMarkdown Metadata is actually incorporated into a few Markdown parsers. While it has more recently been updated to optionally support YAML deliminators, traditionally, the metadata ends and the Markdown document begins upon the first blank line (if the first line was blank, then no metadata). And while the syntax looks very similar to YAML, only key-value pairs are supported with no implied types. Here is an example from the MultiMarkdown docs:
Title: A Sample MultiMarkdown Document
Author: Fletcher T. Penney
Date: February 9, 2011
Comment: This is a comment intended to demonstrate
metadata that spans multiple lines, yet
is treated as a single value.
CSS: http://example.com/standard.css
The MultiMarkdown parser includes a bunch of additional options which are unique to that parser, but the key-value metadata is used across multiple parsers. Unfortunately, I have never seen any two which behaved exactly the same. Without the Markdown rules defining such a format everyone has done their own slightly different interpretation resulting in a lot of variety.
The one thing that is more common is the support for YAML deliminators and basic key-value definitions.
Pandoc Title Block
For completeness there is also the Pandoc Title Block. If has a very different syntax and is not easily confused with the other two. To my knowledge, it is only supported by Pandoc (if enabled), and it only supports three types of data: title, author, and date. Here is an example from the Pandoc documentation:
% title
% author(s) (separated by semicolons)
% date
Note that Pandoc Title Blocks are one of two style supported by Pandoc. Pandoc also supports YAML Metadata as described above.
A workaround use standard syntax and compatible with all other viewers.
I was also looking for a way to add application specific metadata to markdown files while make sure the existing viewers such as vscode and github page will ignore added metadata. Also to use extended markdown syntax is not a good idea because I want to make sure my files can be rendered correctly on different viewers.
So here is my solution: at beginning of markdown file, use following syntax to add metadata:
[_metadata_:author]:- "daveying"
[_metadata_:tags]:- "markdown metadata"
This is the standard syntax for link references, and they will not be rendered while your application can extract these data out.
The - after : is just a placeholder for url, I don't use url as value because you cannot have space in urls, but I have scenarios require array values.
Most Markdown renderers seem to support this YAML format for metadata at the top of the file:
---
layout: post
published-on: 1 January 2000
title: Blogging Like a Boss
---
Content goes here.
The most consistent form of metadata that I've found for Markdown is actually HTML meta tags, since most Markdown interpreters recognize HTML tags and will not render meta tags, meaning that metadata can be stored in a way that will not show up in rendered HTML.
<title>Hello World</title>
<meta name="description" content="The quick brown fox jumped over the lazy dog.">
<meta name="author" content="John Smith">
## Heading
Markdown content begins here
You can try this in something like GitHub Gist or StackEdit.
Correct.
Use the yaml front matter key-value syntax — like MultiMarkdown supports — but (ab)use the official markdown URL syntax to add your metadata.
… my workaround looks like this:
---
[//]: # (Title: My Awesome Title)
[//]: # (Author: Alan Smithee)
[//]: # (Date: 2018-04-27)
[//]: # (Comment: This is my awesome comment. Oh yah.)
[//]: # (Tags: #foo, #bar)
[//]: # (CSS: https://path-to-css)
---
Put this block at the top of your .md doc, with no blank line between the top of the doc and the first ---.
Your fake yaml won't be included when you render to HTML, etc. … it only appears in the .md.
You can also use this technique for adding comments in the body of a markdown doc.
This is not a standard way, but works with Markdown Extra.
I wanted something that worked in the parser, but also didn't leave any clutter when I browse the files on Bitbucket where I store the files.
So I use Abbreviations from the Markdown Extra syntax.
*[blog-date]: 2018-04-27
*[blog-tags]: foo,bar
then I parse them with regexp:
^\*\[blog-date\]:\s*(.+)\s*$
As long as I don't write the exact keywords in the text, they leave no trace. So use some prefix obscure enough to hide them.
I haven't seen this mentioned elsewhere here or in various blogs discussing the subject, but in a project for my personal website, I've decided to use a simple JSON object at the top of each markdown file to store metadata. It's a little more cumbersome to type compared to some of the more textual formats above, but it's super easy to parse. Basically I just do a regex such as ^\s*({.*?})\s*(.*)$ (with the s option on to treat . as \n) to capture the json and markdown content, then parse the json with the language's standard method. It allows pretty easily for arbitrary meta fields.

Unicode alternative for <wbr> tag

I am looking for Unicode solution which will be an alternative to <wbr> tag in following regards:
Allowing line breaks at given position.
Browser can find string in the page even when 'broken' apart.
When copied from page, these characters will not be transfered.
This is exactly what <wbr> does in modern browsers. ­, ­ and ​ does not seem to conform.
Here is a http://jsfiddle.net/qY6mp/7/ to demonstrate.
I will be thankful for any pointers.
The Unicode counterpart of <wbr> is U+200B ZERO WIDTH SPACE, representable in HTML as ​ if you don’t want or can’t use it as such. It’s not clear from the question why you think it does not conform. The main problem with it as questionable support in IE 6, but this is not a conformance issue.
According to Unicode line breaking rules, U+200B “is used to enable additional (invisible) break opportunities wherever SPACE cannot be used”. HTML specifications do not require conformance to the Unicode standard, in this issue or otherwise, but modern browsers generally implement U+200B this way.
What happens when text is copied from an HTML document is outside the scope of specifications. The same applies to requirement 2 in the question. Since generally copy and paste copies characters, including zero width characters, and search functionality operates on characters, requirements 2 and 3 are really asking for a character that does not behave like character.
Note that hyphenation is completely different issue.

Wikipedia (MediaWiki) URI encoding scheme

How do Wikipedia (or MediaWiki in general) encode page titles in URIs? It's not normal URI encoding, since spaces are replaced with underscores and double quotes are not encoded and things like that.
http://en.wikipedia.org/wiki/Wikipedia:Naming_conventions_%28technical_restrictions%29 - here you've got some kind of description of what their engine enforces on article names.
They should have something like this in their LocalSettings.php:
$wgArticlePath = '/wiki/$1';
and proper server URI rewrites configuration - they seem to be using Apache (HTTP header), so it's probably mod_rewrite. http://www.mediawiki.org/wiki/Manual:Short_URL
You can also refer to the index.php file for an article on Wikipedia like this: http://en.wikipedia.org/w/index.php?title=Foo%20bar and get redirected by the engine to http://en.wikipedia.org/wiki/Foo_bar. Behind the scenes mod_rewrite translates it into /index.php?title=Foo_bar. For the MediaWiki engine it's the same as if you visited http://en.wikipedia.org/w/index.php?title=Foo_bar - this page doesn't redirect you.
The process is quite complex and isn't exactly pretty. You need to look at the Title class found in includes/Title.php. You should start with the newFromText method, but the bulk of the logic is in the secureAndSplit method.
Note that (as ever with MediaWiki) the code is not decoupled in the slightest. If you want to replicate it, you'll need to extract the logic rather than simply re-using the class.
The logic looks something like this:
Decode character references (e.g. é)
Convert spaces to underscores
Check whether the title is a reference to a namespace or interwiki
Remove hash fragments (e.g. Apple#Name
Remove forbidden characters
Forbid subdirectory links (e.g. ../directory/page)
Forbid triple tilde sequences (~~~) (for some reason)
Limit the size to 255 bytes
Capitalise the first letter
Furthermore, I believe I'm right in saying that quotation marks don't need to be encoded by the original user -- browsers can handle them transparently.
I hope that helps!

URL Shortening: What's the best encoding to use?

I'm adding a feature to my project where we are generating links to internal stuff of our website, and we want these links to be as short as possible, so we'll be making our own "URL Shortener".
I'm wondering what's the best encoding / alphabet to use for the generated short URLs.
This is largely a subjective question, I'd like to know what your opinions are regarding the best approach / trade-off.
Several options I've thought of:
- Digits, uppercase + lowercase (base 62)
- Digits, only lowercase (base 36)
- Base 32 (http://www.crockford.com/wrmg/base32.html)
- linkpot.net (using common short english words)
Of course, the second two are better for uses other than clicking, and the first two are better for Twitter.
Also, if I'm going with "clickable-only" URLs, I'd like to make the alphabet as large as possible, adding other symbols.
What symbols can I use in URLs that won't get URL encoded?
What symbols should I use? Could some of these prove problematic? I'm thinking slash and dot, for example.
What do you think?
NOTE: The main target for these URLs is Twitter. Keeping this in mind, we should probably have the largest alphabet possible, since most people will be clicking. However, I'm interested in your experience with people using short URLs in other ways (over the phone, in printed paper, etc). How likely is it this could happen?
NOTE 2: I'm not making "yet another URL shortener", please don't condemn me with downvotes. We are generating short URLs for internal stuff in our site, not allowing anyone to shorten any URL. Imagine Google Maps giving you short URLs when you generate a link to a specific coordinate.
I would go with Base-62, it's the shortest. Shortened URL is not meant for someone to manually enter anyway so don't worry about case-sensitivity.
If these are "clickable only URLS" I'd probably go with a base-64 encoding. MIME's base-64 uses a couple of characters you shouldn't use, but there are enough unreserved safe characters in URLs that you can just swap them out. (Also, you don't need the padding that MIME's base-64 uses, since you know when your URL ends.)
Here's a page that discusses one way to do this.
You can look at RFC2396 to figure out exactly what characters are safe in URIs if you want to double check.
I'd be curious to know a little more about the implementation. How will these URLs be "unshortened", or will the internal pages being accessed be saved as shortened URLs? In either case, even if you went with the encoding set of [A-Z] you'd be able to reference 26 * 26 * 26 = 17,576 pages with only 3 characters; how many internal web pages are you talking about?
In general I would lean on what your use case requirements are for picking the right encoding set. Are you planning on having these links available for "uses other than clicking"? What would those uses be, and how do you suspect they'll alter the encoding? (For example, using parts of the URL as a file name on a case-insensitive file system reduces the available character set.)
Here's an informative page on the character set you have available to you when writing a URL.

Should I use accented characters in URLs?

When one creates web content in languages different than English the problem of search engine optimized and user friendly URLs emerge.
I'm wondering whether it is the best practice to use de-accented letters in URLs -- risking that some words have completely different meanings with and without certain accents -- or it is better to stick to the usage of non-english characters where appropriate sacrificing the readability of those URLs in less advanced environments (e.g. MSIE, view source).
"Exotic" letters could appear anywhere: in titles of documents, in tags, in user names, etc, so they're not always under the complete supervision of the maintainer of the website.
A possible approach of course would be setting up alternate -- unaccented -- URLs as well which would point to the original destination, but I would like to learn your opinions about using accented URLs as primary document identifiers.
There's no ambiguity here: RFC3986 says no, that is, URIs cannot contain unicode characters, only ASCII.
An entirely different matter is how browsers represent encoded characters when displaying a URI, for example some browsers will display a space in a URL instead of '%20'. This is how IDN works too: punycoded strings are encoded and decoded by browsers on the fly, so if you visit café.com, you're really visiting xn--caf-dma.com. What appears to be unicode chars in URLs is really only 'visual sugar' on the part of the browser: if you use a browser that doesn't support IDN or unicode, the encoded version won't work because the underlying definition of URLs simply doesn't support it, so for it to work consistently, you need to % encode.
When faced with a similar problem, I took advantage of URL rewriting to allow such pages to be accessible by either the accented or unaccented character. The actual URL would be something like
http://www.mysite.com/myresume.html
And a rewriting+character translating function allows this reference
http://www.mysite.com/myresumé.html
to load the same resource. So to answer your question, as the primary resource identifier, I confine myself to 0-9, A-Z, a-z and the occasional hyphen.
Considering URLs with accents often tend to end up looking like this :
http://fr.wikipedia.org/wiki/%C3%89l%C3%A9phant
...which is not that nice... I think we'll still be using de-accented URLs for some time.
Though, things should get better, as accented URLs are now accepted by web browsers, it seems.
The firefox 3.5 I'm currently using displays the URL the nice way, and not with %stuff, btw ; this seems to be "new" since firefox 3.0 (see Firefox 3: UTF-8 support in location bar) ; so, not probably not supported in IE 6, at least -- and there are still quite too many people using this one :-(
Maybe URL with no accent are not looking the best that could be ; but, still, people are used to them, and seem to generally understand them quite well.
You should avoid non-ASCII characters in URLs that may be entered in browser manually by users. It's ok for embedded links pre-encoded by server.
We found out that browser can encode the URL in different ways and it's very hard to figure out what encoding it uses. See my question on this issue,
Handling Character Encoding in URI on Tomcat
There are several areas in a full URL, and each one might has different rules.
The protocol is plain ASCII.
The DNS entry is governed by IDN (International Domain Names) rules, and can contain (most) of the Unicode characters.
The path (after the first /), the user name and the password can again be everything. They are escaped (as %XX), but those are just bytes. What is the encoding of these bytes is difficult to know (is interpreted by the http server).
The parameters part (after the first ?) is passed "as is" (after %XX unescapeing) to some server-side application thing (php, asp, jsp, cgi), and how that interprets the bytes is another story).
It is recommended that the path/user/password/arguments are utf-8, but not mandatory, and not everyone respects that.
So you should definitely allow for non-ASCII (we are not in the 80s anymore), but exactly what you do with that might be tricky. Try to use Unicode and stay away from legacy code pages, tag your content with the proper encoding/charset if you can (using meta in html, language directives for asp/jsp, etc.)