RMarkdown with Knit to HTML: How to increase section number? - knitr

I am generating HTML pages with KnitR HTML from RMarkdown (using R Studio).
I need section numbering starting at an arbitrary number, not 0 or 1.
Example of what I am trying to achieve:
TITLE
2 SECTION
2.1 SUB-SECTION
I am using a code such as:
---
title: "TITLE"
output:
html_document:
number_sections: yes
---
# SECTION
## SUBSECTION

I found the following in this question.
The pandoc option --number-offset allows to specify the "offset for section headings in HTML output":
--number-offset=NUMBER[,NUMBER,…]
Offset for section headings in HTML output (ignored in other output formats). The first number is added to the section number for top-level headers, the second for second-level headers, and so on. So, for example, if you want the first top-level header in your document to be numbered “6”, specify --number-offset=5. If your document starts with a level-2 header which you want to be numbered “1.5”, specify --number-offset=1,4. Offsets are 0 by default. Implies --number-sections.
Quoted from the pandoc README file.

Related

How are GitHub markdown anchor links constructed?

From a GitHub Markdown header
# Söme/title-header_
GitHub's renderer creates the anchor
#sömetitle-header_
Apparently, spaces and / are removed, letters (ASCII and Unicode) are lowercased, and - and _ are preserved.
Is this correct; are there other rules?
GitHub.com's process for converting Markdown heading text to id="" attributes for automatic #fragment links is not defined by any of the Markdown specifications nor implementations.
For example, it isn't described in the GitHub Flavored Markdown Spec.
Instead, it's something that GitHub do themselves privately after initial conversion from Markdown to HTML is completed, this is described in Step 4 in GitHub's own readme file on the topic (emphasis mine):
This library is the first step of a journey that every markup file in a repository goes on before it is rendered on GitHub.com:
github-markup selects an underlying library to convert the raw markup to HTML.
The HTML is sanitized, aggressively removing things that could harm you and your kin—such as script tags, inline-styles, and class or id attributes.
Syntax highlighting is performed on code blocks. See github/linguist for more information about syntax highlighting.
The HTML is passed through other filters that add special sauce, such as emoji, task lists, named anchors, CDN caching for images, and autolinking.
The resulting HTML is rendered on GitHub.com.
.md / Markdown files are processed by CommonMarker + libcmark, which does not include id="" attribute and #fragment URI generation as a built-in feature, but CommonMarker's documentation actually provides a sample implementation of Markdown header id="" attributes for #fragment links on the front-page, repeated below:
class MyHtmlRenderer < CommonMarker::HtmlRenderer
def initialize
super
#headerid = 1
end
def header(node)
block do
out("<h", node.header_level, " id=\"", #headerid, "\">",
:children, "</h", node.header_level, ">")
#headerid += 1
end
end
end
# this renderer prints directly to STDOUT, instead
# of returning a string
myrenderer = MyHtmlRenderer.new
print(myrenderer.render(doc))
# Print any warnings to STDERR
renderer.warnings.each do |w|
STDERR.write("#{w}\n")
end
The above above generates numeric monotonically increasing header id="" values, which helps prevent id="" collisions (though this is not a perfect solution), whereas as you've observed GitHub prefers to use the header's textContent as the basis for id="" attribute values.
...which means that GitHub is simply doing their own thing when it comes to generating id="" attributes, and there is no published specification for whatever transformation GitHub is using.
The following code gives me quite some mileage:
lower=header.strip().lower().replace(" ","-")
anchor=""
for c in lower:
if c.isalnum() or c in "-_":
anchor+=c

rMarkdown: compile word docx document with line numbers

I am trying to produce a document with rMarkdown code that will render a word docx document that contains continuous line numbers. This is to facilitate workflow for publication submissions where line numbers are often requested.
I have tried using a style docx document with line numbers that was originally rendered from rMarkdown and subsequently edited to change heading style and line numbers. Heading style does change according to that style document but line numbers are not introduced. Desperate latex package use also didn't work in yaml as below.
---
output:
pdf_document:
citation_package: natbib
keep_tex: yes
fig_caption: yes
latex_engine: pdflatex
word_document:
reference_docx: ele_style.docx
geometry: margin=1in
fontfamily: mathpazo
fontsize: 11pt
mainfont: Times New Roman
linespread: 2
header-includes:
-\usepackage{helvet}
-\usepackage{setspace}
-\doublespacing
-\usepackage{lineno}
-\linenumbers
bibliography: LitRev_Bibli.bib
biblio-style: Zotero/styles/ecology-letters.csl
---
Does anyone know how to render line numbers into word docx from rMarkdown?

pandoc markdown to docx - keep list on one page

I have a markdown list like so:
* Question A
- Answer 1
- Answer 2
- Answer 3
I need to ensure that all the answers (1 - 3) appear on the same page as Question A when I convert the markdown document to docx using pandoc. How can I do this?
Use custom styles in your Markdown and then define those styles in a custom docx template.
It's important to note that Pandoc's documentation states (emphasis added):
Because pandoc’s intermediate representation of a document is less
expressive than many of the formats it converts between, one should
not expect perfect conversions between every format and every other.
Pandoc attempts to preserve the structural elements of a document, but
not formatting details...
Of course, Markdown has no concept of "pages" or "page breaks," so that is not something Pandoc can handle by default. However, Pandoc is aware of docx styles. As the documentation explains:
By default, pandoc’s docx output applies a predefined set of styles
for blocks such as paragraphs and block quotes, and uses largely
default formatting (italics, bold) for inlines. This will work for
most purposes, especially alongside a reference.docx file. However, if
you need to apply your own styles to blocks, or match a preexisting
set of styles, pandoc allows you to define custom styles for blocks
and text using divs and spans, respectively.
If you define a div or span with the attribute custom-style, pandoc
will apply your specified style to the contained elements. So, for
example using the bracketed_spans syntax,
[Get out]{custom-style="Emphatically"}, he said.
would produce a docx file with “Get out” styled with character style
Emphatically. Similarly, using the fenced_divs syntax,
Dickinson starts the poem simply:
::: {custom-style="Poetry"}
| A Bird came down the Walk---
| He did not know I saw---
:::
would style the two contained lines with the Poetry paragraph style.
If the styles are not yet in your reference.docx, they will be defined
in the output file as inheriting from normal text. If they are already
defined, pandoc will not alter the definition.
If you don't want to define the style manually, but would like it applied to every list automatically (or perhaps to every list which follows a specific pattern), you could define a custom filter which applied the style(s) to every matching element in the document.
Of course, that only adds the style names to the output. You still need to define the styles (tell Word how to display elements assigned those styles). As the documentation for the --reference-doc option explains :
For best results, the reference docx should be a modified version of a
docx file produced using pandoc. The contents of the reference docx
are ignored, but its stylesheets and document properties (including
margins, page size, header, and footer) are used in the new docx. If
no reference docx is specified on the command line, pandoc will look
for a file reference.docx in the user data directory (see --data-dir).
If this is not found either, sensible defaults will be used.
To produce a custom reference.docx, first get a copy of the default
reference.docx: pandoc --print-default-data-file reference.docx >
custom-reference.docx. Then open custom-reference.docx in Word, modify
the styles as you wish, and save the file.
Of course, when modifying the custom-reference.docx in Word, you can add your new custom style which you have used in your Markdown. As #CindyMeister points out in a comment:
Word would handle this using styles, where the Question style would
have the paragraph setting "Keep with Next". the Answer style would
have this as well. A third style, for the last entry, would NOT have
the setting activated. In addition, all three styles would have the
paragraph setting "Keep together" activated.
Finally, when using pandoc to convert your Markdown to a Word docx file, use the option --reference-doc=custom-reference.docx and your custom style definitions will be included in the generated docx file. As long as you also properly identify which elements in the Markdown document get which styles, your should have a list which doesn't get broken across a page break as long at the entire list fits on one page.

Markdown metadata format

Is there a standard or convention for embedding metadata in a Markdown formatted post, such as the publication date or post author for conditional rendering by the renderer?
Looks like this Yaml metadata format might be it.
There are all kinds of strategies, e.g. an accompanying file mypost.meta.edn, but I'm hoping to keep it all in one file.
There are two common formats that look very similar but are actually different in some very specific ways. And a third which is very different.
YAML Front Matter
The Jekyll static site generator popularized YAML front matter which is deliminated by YAML section markers. Yes, the dashes are actually part of the YAML syntax. And the metadata is defined using any valid YAML syntax. Here is an example from the Jekyll docs:
---
layout: post
title: Blogging Like a Hacker
---
Note that YAML front matter is not parsed by the Markdown parser, but is removed prior to parsing by Jekyll (or whatever tool you're using) and could actually be used to request a different parser than the default Markdown parser for that page (I don't recall if Jekyll does that, but I have seen some tools which do).
MultiMarkdown Metadata
The older and simpler MultiMarkdown Metadata is actually incorporated into a few Markdown parsers. While it has more recently been updated to optionally support YAML deliminators, traditionally, the metadata ends and the Markdown document begins upon the first blank line (if the first line was blank, then no metadata). And while the syntax looks very similar to YAML, only key-value pairs are supported with no implied types. Here is an example from the MultiMarkdown docs:
Title: A Sample MultiMarkdown Document
Author: Fletcher T. Penney
Date: February 9, 2011
Comment: This is a comment intended to demonstrate
metadata that spans multiple lines, yet
is treated as a single value.
CSS: http://example.com/standard.css
The MultiMarkdown parser includes a bunch of additional options which are unique to that parser, but the key-value metadata is used across multiple parsers. Unfortunately, I have never seen any two which behaved exactly the same. Without the Markdown rules defining such a format everyone has done their own slightly different interpretation resulting in a lot of variety.
The one thing that is more common is the support for YAML deliminators and basic key-value definitions.
Pandoc Title Block
For completeness there is also the Pandoc Title Block. If has a very different syntax and is not easily confused with the other two. To my knowledge, it is only supported by Pandoc (if enabled), and it only supports three types of data: title, author, and date. Here is an example from the Pandoc documentation:
% title
% author(s) (separated by semicolons)
% date
Note that Pandoc Title Blocks are one of two style supported by Pandoc. Pandoc also supports YAML Metadata as described above.
A workaround use standard syntax and compatible with all other viewers.
I was also looking for a way to add application specific metadata to markdown files while make sure the existing viewers such as vscode and github page will ignore added metadata. Also to use extended markdown syntax is not a good idea because I want to make sure my files can be rendered correctly on different viewers.
So here is my solution: at beginning of markdown file, use following syntax to add metadata:
[_metadata_:author]:- "daveying"
[_metadata_:tags]:- "markdown metadata"
This is the standard syntax for link references, and they will not be rendered while your application can extract these data out.
The - after : is just a placeholder for url, I don't use url as value because you cannot have space in urls, but I have scenarios require array values.
Most Markdown renderers seem to support this YAML format for metadata at the top of the file:
---
layout: post
published-on: 1 January 2000
title: Blogging Like a Boss
---
Content goes here.
The most consistent form of metadata that I've found for Markdown is actually HTML meta tags, since most Markdown interpreters recognize HTML tags and will not render meta tags, meaning that metadata can be stored in a way that will not show up in rendered HTML.
<title>Hello World</title>
<meta name="description" content="The quick brown fox jumped over the lazy dog.">
<meta name="author" content="John Smith">
## Heading
Markdown content begins here
You can try this in something like GitHub Gist or StackEdit.
Correct.
Use the yaml front matter key-value syntax — like MultiMarkdown supports — but (ab)use the official markdown URL syntax to add your metadata.
… my workaround looks like this:
---
[//]: # (Title: My Awesome Title)
[//]: # (Author: Alan Smithee)
[//]: # (Date: 2018-04-27)
[//]: # (Comment: This is my awesome comment. Oh yah.)
[//]: # (Tags: #foo, #bar)
[//]: # (CSS: https://path-to-css)
---
Put this block at the top of your .md doc, with no blank line between the top of the doc and the first ---.
Your fake yaml won't be included when you render to HTML, etc. … it only appears in the .md.
You can also use this technique for adding comments in the body of a markdown doc.
This is not a standard way, but works with Markdown Extra.
I wanted something that worked in the parser, but also didn't leave any clutter when I browse the files on Bitbucket where I store the files.
So I use Abbreviations from the Markdown Extra syntax.
*[blog-date]: 2018-04-27
*[blog-tags]: foo,bar
then I parse them with regexp:
^\*\[blog-date\]:\s*(.+)\s*$
As long as I don't write the exact keywords in the text, they leave no trace. So use some prefix obscure enough to hide them.
I haven't seen this mentioned elsewhere here or in various blogs discussing the subject, but in a project for my personal website, I've decided to use a simple JSON object at the top of each markdown file to store metadata. It's a little more cumbersome to type compared to some of the more textual formats above, but it's super easy to parse. Basically I just do a regex such as ^\s*({.*?})\s*(.*)$ (with the s option on to treat . as \n) to capture the json and markdown content, then parse the json with the language's standard method. It allows pretty easily for arbitrary meta fields.

Determine whether file is a PDF in perl?

Using perl, what is the best way to determine whether a file is a PDF?
Apparently, not all PDFs start with %PDF. See the comments on this answer: https://stackoverflow.com/a/941962/327528
Detecting a PDF is not hard, but there are some corner cases to be aware of.
All conforming PDFs contain a one-line header identifying the PDF specification to which the file conforms. Usually it's %PDF-1.N where N is a digit between 0 and 7.
The third edition of the PDF Reference has an implementation note that Acrobat viewer require only that the header appears within the first 1024 bytes of a file. (I've seen some cases where a job control prefix was added to the start of a PDF file, so '%PDF-1.' weren't the first seven bytes of the file)
The subsequent implementation note from the third edition (PDF 1.4) states: Acrobat viewers will also accept a header of the form: %!PS-Adobe-N.n PDF-M.m but note that this isn't part of the ISO32000:2008 (PDF 1.7) specification.
If the file doesn't begin immediately with %PDF-1.N, be careful because I've seen a case where a zip file containing a PDF was mistakenly identified as a PDF because that part of the embedded file wasn't compressed. so a check for the PDF file trailer is a good idea.
The end of a PDF will contain a line with '%%EOF',
The third edition of the PDF Reference has an implementation note that Acrobat viewer requires only that the %%EOF marker appears within the last 1024 bytes of a file.
Two lines above the %%EOF should be the 'startxref' token and the line in between should be a number for the byte offset from the start of the file to the last cross reference table.
In sum, read in the first and last 1kb of the file into a byte buffer, check that the relevant identifying byte string tokens are approximately where they are supposed to be and if they are then you have a reasonable expectation that you have a PDF file on your hands.
The module PDF::Parse has method called IsaPDF which
Returns true, if the file could be parsed and is a PDF-file.