Markdown metadata format - metadata

Is there a standard or convention for embedding metadata in a Markdown formatted post, such as the publication date or post author for conditional rendering by the renderer?
Looks like this Yaml metadata format might be it.
There are all kinds of strategies, e.g. an accompanying file mypost.meta.edn, but I'm hoping to keep it all in one file.

There are two common formats that look very similar but are actually different in some very specific ways. And a third which is very different.
YAML Front Matter
The Jekyll static site generator popularized YAML front matter which is deliminated by YAML section markers. Yes, the dashes are actually part of the YAML syntax. And the metadata is defined using any valid YAML syntax. Here is an example from the Jekyll docs:
---
layout: post
title: Blogging Like a Hacker
---
Note that YAML front matter is not parsed by the Markdown parser, but is removed prior to parsing by Jekyll (or whatever tool you're using) and could actually be used to request a different parser than the default Markdown parser for that page (I don't recall if Jekyll does that, but I have seen some tools which do).
MultiMarkdown Metadata
The older and simpler MultiMarkdown Metadata is actually incorporated into a few Markdown parsers. While it has more recently been updated to optionally support YAML deliminators, traditionally, the metadata ends and the Markdown document begins upon the first blank line (if the first line was blank, then no metadata). And while the syntax looks very similar to YAML, only key-value pairs are supported with no implied types. Here is an example from the MultiMarkdown docs:
Title: A Sample MultiMarkdown Document
Author: Fletcher T. Penney
Date: February 9, 2011
Comment: This is a comment intended to demonstrate
metadata that spans multiple lines, yet
is treated as a single value.
CSS: http://example.com/standard.css
The MultiMarkdown parser includes a bunch of additional options which are unique to that parser, but the key-value metadata is used across multiple parsers. Unfortunately, I have never seen any two which behaved exactly the same. Without the Markdown rules defining such a format everyone has done their own slightly different interpretation resulting in a lot of variety.
The one thing that is more common is the support for YAML deliminators and basic key-value definitions.
Pandoc Title Block
For completeness there is also the Pandoc Title Block. If has a very different syntax and is not easily confused with the other two. To my knowledge, it is only supported by Pandoc (if enabled), and it only supports three types of data: title, author, and date. Here is an example from the Pandoc documentation:
% title
% author(s) (separated by semicolons)
% date
Note that Pandoc Title Blocks are one of two style supported by Pandoc. Pandoc also supports YAML Metadata as described above.

A workaround use standard syntax and compatible with all other viewers.
I was also looking for a way to add application specific metadata to markdown files while make sure the existing viewers such as vscode and github page will ignore added metadata. Also to use extended markdown syntax is not a good idea because I want to make sure my files can be rendered correctly on different viewers.
So here is my solution: at beginning of markdown file, use following syntax to add metadata:
[_metadata_:author]:- "daveying"
[_metadata_:tags]:- "markdown metadata"
This is the standard syntax for link references, and they will not be rendered while your application can extract these data out.
The - after : is just a placeholder for url, I don't use url as value because you cannot have space in urls, but I have scenarios require array values.

Most Markdown renderers seem to support this YAML format for metadata at the top of the file:
---
layout: post
published-on: 1 January 2000
title: Blogging Like a Boss
---
Content goes here.

The most consistent form of metadata that I've found for Markdown is actually HTML meta tags, since most Markdown interpreters recognize HTML tags and will not render meta tags, meaning that metadata can be stored in a way that will not show up in rendered HTML.
<title>Hello World</title>
<meta name="description" content="The quick brown fox jumped over the lazy dog.">
<meta name="author" content="John Smith">
## Heading
Markdown content begins here
You can try this in something like GitHub Gist or StackEdit.

Correct.
Use the yaml front matter key-value syntax — like MultiMarkdown supports — but (ab)use the official markdown URL syntax to add your metadata.
… my workaround looks like this:
---
[//]: # (Title: My Awesome Title)
[//]: # (Author: Alan Smithee)
[//]: # (Date: 2018-04-27)
[//]: # (Comment: This is my awesome comment. Oh yah.)
[//]: # (Tags: #foo, #bar)
[//]: # (CSS: https://path-to-css)
---
Put this block at the top of your .md doc, with no blank line between the top of the doc and the first ---.
Your fake yaml won't be included when you render to HTML, etc. … it only appears in the .md.
You can also use this technique for adding comments in the body of a markdown doc.

This is not a standard way, but works with Markdown Extra.
I wanted something that worked in the parser, but also didn't leave any clutter when I browse the files on Bitbucket where I store the files.
So I use Abbreviations from the Markdown Extra syntax.
*[blog-date]: 2018-04-27
*[blog-tags]: foo,bar
then I parse them with regexp:
^\*\[blog-date\]:\s*(.+)\s*$
As long as I don't write the exact keywords in the text, they leave no trace. So use some prefix obscure enough to hide them.

I haven't seen this mentioned elsewhere here or in various blogs discussing the subject, but in a project for my personal website, I've decided to use a simple JSON object at the top of each markdown file to store metadata. It's a little more cumbersome to type compared to some of the more textual formats above, but it's super easy to parse. Basically I just do a regex such as ^\s*({.*?})\s*(.*)$ (with the s option on to treat . as \n) to capture the json and markdown content, then parse the json with the language's standard method. It allows pretty easily for arbitrary meta fields.

Related

How are GitHub markdown anchor links constructed?

From a GitHub Markdown header
# Söme/title-header_
GitHub's renderer creates the anchor
#sömetitle-header_
Apparently, spaces and / are removed, letters (ASCII and Unicode) are lowercased, and - and _ are preserved.
Is this correct; are there other rules?
GitHub.com's process for converting Markdown heading text to id="" attributes for automatic #fragment links is not defined by any of the Markdown specifications nor implementations.
For example, it isn't described in the GitHub Flavored Markdown Spec.
Instead, it's something that GitHub do themselves privately after initial conversion from Markdown to HTML is completed, this is described in Step 4 in GitHub's own readme file on the topic (emphasis mine):
This library is the first step of a journey that every markup file in a repository goes on before it is rendered on GitHub.com:
github-markup selects an underlying library to convert the raw markup to HTML.
The HTML is sanitized, aggressively removing things that could harm you and your kin—such as script tags, inline-styles, and class or id attributes.
Syntax highlighting is performed on code blocks. See github/linguist for more information about syntax highlighting.
The HTML is passed through other filters that add special sauce, such as emoji, task lists, named anchors, CDN caching for images, and autolinking.
The resulting HTML is rendered on GitHub.com.
.md / Markdown files are processed by CommonMarker + libcmark, which does not include id="" attribute and #fragment URI generation as a built-in feature, but CommonMarker's documentation actually provides a sample implementation of Markdown header id="" attributes for #fragment links on the front-page, repeated below:
class MyHtmlRenderer < CommonMarker::HtmlRenderer
def initialize
super
#headerid = 1
end
def header(node)
block do
out("<h", node.header_level, " id=\"", #headerid, "\">",
:children, "</h", node.header_level, ">")
#headerid += 1
end
end
end
# this renderer prints directly to STDOUT, instead
# of returning a string
myrenderer = MyHtmlRenderer.new
print(myrenderer.render(doc))
# Print any warnings to STDERR
renderer.warnings.each do |w|
STDERR.write("#{w}\n")
end
The above above generates numeric monotonically increasing header id="" values, which helps prevent id="" collisions (though this is not a perfect solution), whereas as you've observed GitHub prefers to use the header's textContent as the basis for id="" attribute values.
...which means that GitHub is simply doing their own thing when it comes to generating id="" attributes, and there is no published specification for whatever transformation GitHub is using.
The following code gives me quite some mileage:
lower=header.strip().lower().replace(" ","-")
anchor=""
for c in lower:
if c.isalnum() or c in "-_":
anchor+=c

pandoc markdown to docx - keep list on one page

I have a markdown list like so:
* Question A
- Answer 1
- Answer 2
- Answer 3
I need to ensure that all the answers (1 - 3) appear on the same page as Question A when I convert the markdown document to docx using pandoc. How can I do this?
Use custom styles in your Markdown and then define those styles in a custom docx template.
It's important to note that Pandoc's documentation states (emphasis added):
Because pandoc’s intermediate representation of a document is less
expressive than many of the formats it converts between, one should
not expect perfect conversions between every format and every other.
Pandoc attempts to preserve the structural elements of a document, but
not formatting details...
Of course, Markdown has no concept of "pages" or "page breaks," so that is not something Pandoc can handle by default. However, Pandoc is aware of docx styles. As the documentation explains:
By default, pandoc’s docx output applies a predefined set of styles
for blocks such as paragraphs and block quotes, and uses largely
default formatting (italics, bold) for inlines. This will work for
most purposes, especially alongside a reference.docx file. However, if
you need to apply your own styles to blocks, or match a preexisting
set of styles, pandoc allows you to define custom styles for blocks
and text using divs and spans, respectively.
If you define a div or span with the attribute custom-style, pandoc
will apply your specified style to the contained elements. So, for
example using the bracketed_spans syntax,
[Get out]{custom-style="Emphatically"}, he said.
would produce a docx file with “Get out” styled with character style
Emphatically. Similarly, using the fenced_divs syntax,
Dickinson starts the poem simply:
::: {custom-style="Poetry"}
| A Bird came down the Walk---
| He did not know I saw---
:::
would style the two contained lines with the Poetry paragraph style.
If the styles are not yet in your reference.docx, they will be defined
in the output file as inheriting from normal text. If they are already
defined, pandoc will not alter the definition.
If you don't want to define the style manually, but would like it applied to every list automatically (or perhaps to every list which follows a specific pattern), you could define a custom filter which applied the style(s) to every matching element in the document.
Of course, that only adds the style names to the output. You still need to define the styles (tell Word how to display elements assigned those styles). As the documentation for the --reference-doc option explains :
For best results, the reference docx should be a modified version of a
docx file produced using pandoc. The contents of the reference docx
are ignored, but its stylesheets and document properties (including
margins, page size, header, and footer) are used in the new docx. If
no reference docx is specified on the command line, pandoc will look
for a file reference.docx in the user data directory (see --data-dir).
If this is not found either, sensible defaults will be used.
To produce a custom reference.docx, first get a copy of the default
reference.docx: pandoc --print-default-data-file reference.docx >
custom-reference.docx. Then open custom-reference.docx in Word, modify
the styles as you wish, and save the file.
Of course, when modifying the custom-reference.docx in Word, you can add your new custom style which you have used in your Markdown. As #CindyMeister points out in a comment:
Word would handle this using styles, where the Question style would
have the paragraph setting "Keep with Next". the Answer style would
have this as well. A third style, for the last entry, would NOT have
the setting activated. In addition, all three styles would have the
paragraph setting "Keep together" activated.
Finally, when using pandoc to convert your Markdown to a Word docx file, use the option --reference-doc=custom-reference.docx and your custom style definitions will be included in the generated docx file. As long as you also properly identify which elements in the Markdown document get which styles, your should have a list which doesn't get broken across a page break as long at the entire list fits on one page.

GitHub satanically messing with Markdown - changes 666 to DCLXVI

My GitHub repository has nothing but a readme in it. In this readme, locally I wrote this:
Factoids:
- There are about six different ways to do everything in Forked.
- There are actually six different ways to enter loops.
- There are six directionals and six I/O commands.
- 666. ha.
Emphasis on the last line.
What GitHub decided to show was not 666.
DCLXVI is the Roman Numeral number for 666.
This really creeped me out. My local file and the raw file both show 666.
What is GitHub doing, and why is the indentation on the un-numbered list messed up? Is this an easter egg, or some satanic bug?
This seems to be followed by github/markup issue 991, where on ordered sub-list, decimal numerals automatically turns into roman numerals.
I have found the cause of problem. It is CSS
This is the expected way for nested ordered lists to render in HTML.
This is not expected in HTML. https://jsfiddle.net/tf5jtv8s
We don't make any modifications to the default HTML behavior.
ol ol,ul ol{list-style-type:lower-roman}
I don't know CSS but my understanding is that this is the cause of problem. I can get expected result by disabling CSS. (I am from my mobile so I can't use browser inspector)
As mentioned in "A formal spec for GitHub Flavored Markdown", GitHub markdown spec GFM: GitHub Flavored Markdown Spec is built on top of the CommonMark Spec.
And as Tommi Kaikkonen mentioned in his answer, the ordered list is because of the dot following 666. See GFM Spec section 5.2.
As mentioned in section 6.1, any ASCII punctuation character may be backslash-escaped, to avoid this issue.
That means:
- 666\. ha.
(as explicitly shown in ForNeVeR's answer)
That is why that 666 number is changed to roman numerals in a GitHub README markdown.
Mike Lippert commented:
the 1st element in that list so it should show as i not dclxvi.
Markdown ordered lists ignore the actual number used and number sequentially, and I haven't seen a way to change that.
However, no: it shows dclxvi, because the generated html code is <ol start="666">, which is consistent with the GFM specs:
If the list item is ordered, then it is also assigned a start number, based on the ordered list marker"
(here, '666' is the ordered list marker)
Mike adds:
#VonC For anyone else here's another useful excerpt from VonC's doc link:
"The start number of an ordered list is determined by the list number of its initial list item. The numbers of subsequent list items are disregarded."
Also, why is the spacing messed up? I didn't catch that in your answer
You get an ordered list <ol> within an un-ordered list item <li>:
<ul>
<li>
<ol start="666">
<li>ha.</li>
</ol>
</li>
</ul>
GitHub CSS rules include:
.markdown-body ol {
padding-left: 2em;
}
If you put 3em, you would get
instead of
Adding a period after 666 makes it an ordered list marker.
GitHub declares CSS that renders ordered list markers using roman numerals:
ol ol,ul ol {
list-style-type: lower-roman
}
Escape the period with a backslash, and you should see the correct output.
While other answers are good at explaining why you have the problem, they haven't given you an exact example of how to fix that.
And it seems that you've already solved it in an imperfect manner, replacing your text with
- `666`. ha.
There's a common trick to escape the dot after the number to make it look like a normal text (and not an ordered list label):
- 666\. ha. (this will render as you probably want)

strikethrough code in markdown on github

I am talking about github markdown here, for files like README.md.
Question:
Is it possible to strikethrough a complete code block in markdown on github?
I know how to mark text as a block of code
this is
multiline code
and
this
this
also
by indenting by 4 spaces or by using ``` or `...
I also know how to strike through texts using
del tag
s tag
~~
Temporary solution:
Independently they work fine, but together not as expected or desired. I tried several combinations of the above mentioned.
For now, I use this:
striked
through
by using ~~ and ` for every single line.
Requirement:
I would like to have a code formatted text striked through, where the code block is continuous:
unfortunately, this is
not striked through
or at least with only a small paragraph in between:
unfortunately, also not
striked through
Is this possible at all?
I found some old posts and hints on using jekyll, but what I was searching for is a simple way, preferably in markdown.
This would only be possible with raw HTML, which GitHub doesn't allow. But you may be able to use a diff instead.
Code blocks are for "pre-formatted" text only. The only formatting you can get in a code block is the formatting that can be represented in plain text (indentation, capitalization, etc). There is no mechanism to mark up the content of a code block (as bold, italic, stricken, underlined, etc). This was an intentional design decision. Otherwise, how would you be able to show Markdown text in a code block? If you want formatted text, then you need to use something other than a code block.
As the rules state:
HTML is a publishing format; Markdown is a writing format. Thus, Markdown’s formatting syntax only addresses issues that can be conveyed in plain text.
For any markup that is not covered by Markdown’s syntax, you simply use HTML itself.
Therefore you would need to format your own custom HTML code block with the various bits marked up properly:
<pre><code><del>some stricken code</del>
<del>A second line of stricken code</del>
</code></pre>
However, for security reasons, GitHub will strip out any such raw HTML in your Markdown. So while this works where you have full control of the entire stack, on a hosted service it is most likely not possible.
However, I'm assuming you want to show some changes made to a block of code. As it turns out, a specific format already exists for that, namely, a diff. Just use a fenced code block with diff as the language and GitHub will format it correctly:
```diff
Unchanged Line
- Removed Line
+ Added Line
```
You can see how GitHub displays the above code block live (you can also see that in raw), but I've included a screenshot below for convenience.
I realize that the formatting does not use strike-through, but it does use a commonly used and understood format. For more complex blocks, you should probably use the diff utility program to generate the diff for you.
Expanding on Waylan's answer:
This may be obvious to others, but it caught me. When you have indented lines, be sure + or - is the first character on the line or it won't highlight.
```diff
<div>
Unchanged Line
<ul>
- <li>This won't work</li>
- <li>This will</li>
+ <li>1st character, then indent</li>
</ul>
</div>
```
After much much trying, I finally got it to work! It boils down to this:
inside ``` block, nothing is rendered (other than syntax highlights for language specified)
inside <code> block, markdown won't render, only HTML. You can use <strike>. It's fine, but you don't get the syntax coloring
now for the magic: use HTML for striking, and markdown for coloring:
<strike>
```language
this is
multiline code
```
</strike>
P.S. ``` blocks should always be surrounded by blank lines to work
On the subject of marking up the content of a code block, to tack an italicized string on to the end of a line of "code", try something like:
<code>id\_pn\_aside\_subscriber\_form\__form\_id_</code>
(You can see this in action at: https://github.com/devonostendorf/post-notif#how-do-you-use-the-stylesheet_filename-attribute-with-the-shortcode)
I had a hard time finding an example that matched this precise use case, so I hope this proves useful for anyone else trying to accomplish a similar effect.

How to get rid of Org-mode link translation?

I write articles with org-mode, It works very well. But I found a very annoying problem.
I post my article to some forum, also I have a lot of pics to post.
I use IMG code to post the pictures.
eg. [IMG]http://abc.com/a.jpg[/IMG]
I export my org file to ascii or html or anything else formant, Org-mode always make "http" special. It export like this:
[IMG][http://abc.com/a.jpg[/IMG]]
between "http", There are always a pair of "[". Every time I have to remove this myself.
I wish Org-mode do not handle http string.
Any idea?
Org mode actually parses that particular markup poorly (with the square brackets). If your image links are on a separate line, for example between paragraphs, you can use some markup to disable org formatting:
#+BEGIN_EXAMPLE
[IMG]http://abc.com/a.jpg[/IMG]
#+END_EXAMPLE
A shorthand for this is simply to start the line with a colon followed by a space:
: [IMG]http://abc.com/a.jpg[/IMG]