Convert Stack Exchange Markdown to Github Markdown - github

Has anyone documented the differences between Stack Exchange Markup and Github Markup?
I'm in the midst of a project to convert Stack Exchange Markdown to Github Markdown. It might be a little more complicated because Jekyll on Github Pages uses a Markdown derivative called "Kramdown".
I've already written some of the conversion in my Python program. For example old SE posts with #Header must be converted to # Header.
Another example are "> Block quote" lines have two spaces appended to the end of the line.
Now it's starting to get tricky (for me at least) where in an image in SE is specified as:
[![Ubuntu 5 DE.png][1]][1]
**Note:** Blah, blah, blah
[1]: https://i.stack.imgur.com/MoxHd.jpg
It has to be converted to Github image markdown format:
![Ubuntu 5 DE.png](https://i.stack.imgur.com/MoxHd.jpg)
**Note:** Blah, blah, blah
Another example of "footer hyper links" (for lack of a better noun) in Stack Exchange Markdown is:
- [Jack Master Volume?][1]
The simplest solution then is to install [JackMix][2]:
find listed [here][3].
[this script][4] is where you are heading:
[1]: https://discourse.ardour.org/t/jack-master-volume/84650
[2]: http://www.arnoldarts.de/jackmix/.
[3]: http://jackaudio.org/applications/
[4]: https://unix.stackexchange.com/questions/374085/lower-or-increase-pulseaudio-volume-on-all-outputs
Needs to be converted to Github Markdown format of:
- [Jack Master Volume?](https://discourse.ardour.org/t/jack-master-volume/84650)
The simplest solution then is to install [JackMix](http://www.arnoldarts.de/jackmix/.):
find listed [here](http://jackaudio.org/applications/).
[this script](https://unix.stackexchange.com/questions/374085/lower-or-increase-pulseaudio-volume-on-all-outputs) is where you are heading:
Finally tonight I discovered that in Stack Exchange you can have:
<!-- language: bash -->
#!/bin/bash
cat "$Filename.zip" | base64 > "$Filename64"
That needs reformatting to Github Markdown like this:
``` bash
#!/bin/bash
cat "$Filename.zip" | base64 > "$Filename64"
```
It gets even more complicated when SE Markdown has:
<!-- language-all: lang-bash -->
Or it has this:
<pre><code>Some lines of code
some more lines of
code </code></pre>
An existing Github Repo to convert would be awesome! If not then if someone has documented the differences between Stack Exchange Markup and Github Markup that would be great too.
If this question goes unanswered for a month then I guess I'll be answering it eventually after the trial-error-fix process is finished.

The self-answer seems to answer a different question, so I'll answer the one posed above.
Has anyone documented the differences between Stack Exchange Markup and Github Markup?
As of mid-2020, Stack Exchange uses CommonMark with support for a few custom features "like spoilers, MathJax, circuit diagrams, stack snippets, etc."
GitHub uses their own dialect, GitHub Flavored Markdown (GFM). The most notable extensions that GFM introduces are probably tables (supported on SE since late 2020), fenced code blocks (supported on SE since early 2019), and task lists (unsupported but also not really necessary on SE).
Most of the examples shown above work fine when rendered on Stack Exchange or by a compliant GFM implementation, but let's look at them in turn:
For example old SE posts with #Header must be converted to # Header.
This should have been cleaned up by the migration to CommonMark:
we will run a big migration across the network that will convert existing posts to use the new CommonMark format
Another example are "> Block quote" lines have two spaces appended to the end of the line.
Two spaces at the end of the line indicate a line break in Markdown, going all the way back to the original implementation.
Blockquotes do not require such line breaks, though they will rewrap without them. Line breaks and blockquotes are unrelated features.
Your image examples are interchangeable in both formats. Let's look at the SE one:
[![Ubuntu 5 DE.png][1]][1]
**Note:** Blah, blah, blah
[1]: https://i.stack.imgur.com/MoxHd.jpg
There is nothing specific to Stack Exchange here.
The [foo][1]… [1]: https://... syntax is a reference-style link which, again, comes from the original project. It is equivalent to the inline form [foo](https://...). Both forms work on both platforms. ![Ubuntu 5 DE.png][1]… [1]: https://i.stack.imgur.com/MoxHd.jpg is the same as ![Ubuntu 5 DE.png](https://i.stack.imgur.com/MoxHd.jpg).
On SE images are wrapped by a link by default, which introduces the wrapping [...][1]. But, again, this is not SE-specific. It's like comparing <img src="..."> to <img src="...">. This also works on both platforms.
Another example of "footer hyper links" (for lack of a better noun)
These are reference-style links already discussed and present in every version of Markdown I've ever seen, including the original project. They work in GFM as well as they do in CommonMark. No conversion is necessary.
Finally tonight I discovered that in Stack Exchange you can have:
<!-- language: bash -->
#!/bin/bash
cat "$Filename.zip" | base64 > "$Filename64"
That needs reformatting to Github Markdown like this:
``` bash
#!/bin/bash cat "$Filename.zip" | base64 > "$Filename64"
```
This is the only example above that requires any special behaviour at all. However, it mostly works out of the box. GFM supports indented code blocks and SE has supported fenced code blocks for over three years.
GFM doesn't understand legacy HTML-style SE language hints <!-- language: bash --> and <!-- language-all: ... -->, so it will render such code blocks without syntax highlighting. But they will still render as a code block.
The last example is just embedded HTML that both platforms (and the original) know how to render:
<pre><code>Some lines of code
some more lines of
code </code></pre>
With the exception of HTML-style language hints, every example you show above works out of the box on both Stack Exchange and GitHub Flavored Markdown. No conversion necessary.

Converting thousands of Stack Exchange Q&A in markdown format isn't as easy
as simply copying them over to GitHub Pages. The python program
stack-to-blog.py was used to convert Stack Exchange posts to
GitHub Pages Posts.
The full stack-to-blog.py program can be accessed on the
Pippim Website repo 🔗.
The program automatically:
Creates Jekyll front matter on posts and front matter totals for site.
Selects Stack Exchange Posts based on meeting minimum criteria such as up-votes or accepted answer status.
If self-answered question, the answer is included and not the question.
If self-answered question, the accepted answer alone doesn't qualify. Votes from other are the qualifier.
Initial testing allows selecting small set of random record numbers to convert.
Converts Stack Exchange Markdown formats to GitHub Pages Kramdown Markdown format.
Creates hyperlinks to original Answer in Stack Exchange and Kramdown in GitHub Pages.
Creates search word to URL indices excluding 50% of words like "a", "the", etc. to save space.
Selectively inserts Table of Contents based on minimum criteria settings.
Selectively inserts Section Navigation Buttons for: Top (Top of Page), ToS (Top of Section), ToC (Table of Contents) and Skip (Skip section).
Selectively inserts "Copy Code Block to System Clipboard" button based on lines of code.
Creates HTML with "Top Ten Answers" with the most votes.
Creates powerful nested expandable/collapsible detail/summary HTML for many thousands of tags by post.
Remaps hyperlinks in Stack Exchange Posts to {{ site.title }} website posts if they were converted.
Fixes old broken #header Stack Exchange Markdown.
Converts < block quote Stack Exchange Markdown into what works in Jekyll Kramdown.
Convert Stack Exchange <!-- language --> tags to fenced code block language.
When no fenced code block language is provided, uses shebang language first (if available).
Converts older four-space indented code blocks to fenced code blocks.
Converts Stack Exchange Hyperlinks where the website post title is implied and not explicit.
Prints list of self-answered questions that were not accepted after the mandatory two day wait period.
Prints list of Rouge Syntax Highlighting languages not supported in fenced code blocks.
Prints summary totals when finished.
Full documentation is provided here.
This is what program looks like when running:

Related

How to write diff code with syntax highlight in Github

Github supports syntax highlight as follows:
```javascript
let message = 'hello world!'
```
And it supports diff as follows: (but WITHOUT syntax highlight)
```diff
-let message = 'hello world!'
+let message = 'hello stackoverflow!'
```
How can I get both 'syntax hightlight' AND 'diff' ?
No, this is not a supported feature at this time.
GitHub documents their processing of lightweight markup languages (including Markdown, among others) in github/markup. Note step 3:
Syntax highlighting is performed on code blocks. See github/linguist for more information about syntax highlighting.
If we follow that link, we find a list of grammars that Linguist uses to provide syntax highlighting on GitHub. Linguist can only apply one of the grammars in that list to a block of code at a time. Of course, one of the grammars is Diff. However, that grammar knows nothing about the language of code being diffed, so you don't get syntax highlighting of that.
Of course, there are other languages which are often combined. For example, HTML is often included in a templating language. Therefore, in addition to the HTML grammar, we also find grammars for HTML+Django, HTML+ECR HTML+EEX, HTML+ERB, and HTML+PHP. In each case, the single grammar is aware of two languages. Both the specific templating language and the HTML which is interspersed within the template.
To accomplish the same thing with a diff, you would need a separate "diff" grammar for every single language listed. In other words, the number of grammars would double. Of course, a way to avoid this might be to treat diff differently. When diff is specified, they could run the block through the syntax highlighter twice, once for diff and once for the source language. However, at least when processing code blocks in lightweight markup languages, they have not implemented such a feature.
And if they ever were to implement such a feature in the future, it would likely be more complicated that simply running the code block through twice. After all, every line of the diff has diff specific content which would confuse the other language grammar. Therefore, every grammar would need to be diff aware, or each line would need to be fed to the grammar separately with the diff parts removed. The problem with the later is that the grammar would not have the context of each line and is more likely to get things wrong. Whether such a solution is possible is outside this cope of this answer, but the point is that it is reasonable to expect that such a feature would be much lower priority to support due to the complexity involved.
So why does GitHub do syntax highlighting in other places on its website? Because, in those cases, it has access to the two source files being diffed and it generates the diff itself. Each source is first highlighted (avoiding the complexity mentioned above), then the diff is created from the two highlighted source files. However, a diff included in a Markdown code block is already a diff when GitHub first sees it. There is no way for them to highlight the pre-diff code first. In other words, the process they currently use would not be transferable to supporting the requested feature.
You would need to post-process the output of the git diff in order to add syntax highlighting for the right language of the file being diff'ed.
But since you are asking for GitHub, that post-processing is not in your control, and is not provided by GitHub at the moment in its GFM (GitHub Flavored Markdown Spec).
It is supported for source files, in a regular diff like this one or in a PR: GitHub does the syntax highlighting of the two versions of the file, and then computes the diff.
It is not supported in a regular markdown fenced code block, where the +/- of a diff would throw off the syntax highlighting engine, considering there is no "diff" operation done here (just the writer trying to add diff +/- symbols)

How to escape symbols in GitHub-flavored markdown internal links / heading anchors?

Does anybody know how to maintain symbols in markdown internal links?
For example:
[A](#A) works fine
[A and B](#a-and-b) works fine
...whereas:
[A/B](#a-b) does not work
[A-B](#a-b) does not work
Thanks for your help!
I remember running into this problem too.
[A/B](#ab) should work, instead of using [A/B](#a-b).
A / character is considered a non-character in this case, but when it is not part of a word (ex. Movies / Shows / Videos) then it needs to be treated like an empty word:
[Movies / Shows / Videos](#movies--shows--videos)
I'm not sure how [A-B](#a-b) isn't working for you, because it should work?
I recommend checking here, every now and then, for additional information being added to the conversation around Github Markdown Heading Anchors: https://gist.github.com/asabaylus/3071099
This is also known as github-slugging or GitHub-style slugging.
After copying and pasting your code into my markdown editor, Mou, I see no issues with either statement. In fact, I copied the link as well, and it keeps the symbols you want.
Perhaps this is an issue with your version of markdown or your editor. If you are using a different flavor of markdown, like github, I'd be sure to specify that with tags as that may be your issue. Basic markdown should handle escaping characters though unless it's a bracket. If you want some helpful information, please visit this stack overflow thread: Escaping Brackets

Make backticks and links overlap work with GitHub Markdown

We are trying to implement an automatic markdown generator for an easily maintainable documentation.
When mentioning a variable's type, we would like to prefix it with ? when it is nullable, use backticks around it and add a link to its description. For example: `?[Article](#article)`.
However, the backticks break the link syntax because of the overlap. We use `?`[`Article`](#article) instead to make the link works but it creates a space between ? and Article as follow: ?Article.
Is it possible to make it look like ?Article with a link on Article only?
I just tested this out and discovered that there is no space between ? and Article. What appears to be a space is simply GitHub's styling of two <code> blocks up against each other.
Wrapping the whole thing in backticks won't work because backticks indicate code, and Markdown treats the contents as if they are a code sample where you want to show the source.
The best workaround I can find is to use <code> tags directly:
<code>?[Article](https://stackoverflow.com/)</code>
On both GitHub and Stack Overflow this renders like so:
?Article
(I have used a link to Stack Overflow as the link target here simply so we get a rendered link as an example. I expect that #article will work equally well in your environment.)
In my opinion this is even a reasonable way of doing what you want. Markdown's backticks compile to <code> tags, and inline HTML code is expressly permitted by Markdown:
For any markup that is not covered by Markdown’s syntax, you simply use HTML itself. There’s no need to preface it or delimit it to indicate that you’re switching from Markdown to HTML; you just use the tags.

Translating longer texts (view and email templates) with gettext

I'm developing a multilingual PHP web application, and I've got long(-ish) texts that I need to translate with gettext. These are email templates (usually short, but still several lines) and parts of view templates (longer descriptive blocks of text). These texts would include some simple HTML (things like bold/italic for emphasis, probably a link here or there). The templates are PHP scripts whose output is captured.
The problem is that gettext seems very clumsy for handling longer texts. Longer texts would generally have more changes over time than short texts — I can either change the msgid and make sure to update it in all translations (could be lots of work and very error-prone when the msgid is long), or I can keep the msgid unchanged and modify only the translations (which would leave misleading outdated texts in the templates). Also, I've seen advice against including HTML in gettext strings, but avoiding it would break a single natural piece of text into lots of chunks, which will be an even bigger nightmare to translate and reassemble, and I've also seen advice against unnecessary splitting of gettext strings into separate msgids.
The other approach I see is to ignore gettext altogether for these longer texts, and to separate those blocks in external subtemplates for each locale, and just include the one for the current locale. The disadvantage is that I'm separating the translation effort between gettext .po files and separate templates located in a completely different location.
Since this application will be used as a starting point for other applications in the future, I'm trying to come up with the best approach for the long term. I need some advice for best practices in such scenarios. How have you implemented similar cases? What turned out to work and what turned out a bad idea?
Here's the workflow I used, on a very heavily-trafficked site that had about several dozen long-ish blocks of styled textual content, translated into six languages:
Pick a text-based markup language (we used Markdown)
For long strings, use fixed message IDs like "About_page_intro_markdown" that:
describes the intent of the text
makes clear that it will be interpreted in markdown format
Have our app render "*_markdown" strings appropriately, making sure to allow only a few safe HTML tags
Build a tool for translators that:
shows them their Markdown rendered in realtime (sort of like the Markdown dingus)
makes it easy for them to see the now-authoritative base language translation of the text (since that's no longer in the msgid)
Teach translators how to use the new workflow
Pros of this workflow:
Message IDs don't change all the time
Because translators are editing in a safe higher-level syntax, hard to mess up HTML
Non-technical translators found it very easy to write in Markdown, vs. HTML
Cons of this workflow:
Having static unchanging message IDs means changes in the text need to be transmitted out of band (which we'd do anyway, as long text can raise questions about tone or emphasis)
I'm very happy with the way this workflow operated for our website, and would absolutely recommend it, and use it again. It took a couple of days to get started, but it was easy to build, train, and launch.
Hope this helps, and good luck with your project.
I just had this particular problem, and I believe I solved it in an elegant way.
The problem: We wanted to use Gettext in PHP, and use primary language strings as keys translations. However, for large blocks of HTML (with h1, h2, p, a, etc...) I'd either have to:
Create a translation for each tag with content.
or
Put the entire block with tags in one translation.
Neither of those options appealed to me, so this is what I did:
Keep simple strings ("OK","Add","Confirm","My Awesome App") as regular Gettext .po entries, with the original text as the key
Write content (large text blocks) in markdown, and keep them in files.
Example files would be /homepage/content.md (primary / source text), /homepage/content.da-DK.md, /homepage/content.de-DE.md
Write a class that fetches the content files (for the current locale) and parses it. I then used it like:
<?=Template::getContent("homepage/content")?>
However, what about dynamic large text? Simple. Use a templating engine. I decided on Smarty, and used it in my Template class.
I could now use templating logic.. within markdown! How awesome is that?!
Then came the tricky part..
For content to look good, at times you need to structure your HTML differently. Consider a campaign area with 3 "feature boxes" beneath it. The easy solution: Have a file for the campaign area, and one for each of the 3 boxes.
But I could do better than that.
I wrote a quick block parser, so I would write all the content in one file, and then render each block seperately.
Example file:
[block campaign]
Buy this now!
=============
Blaaaah... And a smarty tag: {$cool}
[/block]
[block feature 1]
Feature 1
---------
asdasd you get it..
[/block]
[block feature 2] ...
And this is how I would render them in the markup:
<?php
// At the top of the document...
// Class handles locale. :)
$template = Template::getContent("homepage/content", [
"cool" => "Smarty variable! AWESOME!"
]);
?>
...
<title><?=_("My Awesome App")?></title>
...
<div class="hero">
<!-- Template data already processed! :) -->
<?=$template->renderBlock("campaign")?>
</div>
<div class="featurebox">
<?=$template->renderBlock("feature 1")?>
</div>
<div class="featurebox">
<?=$template->renderBlock("feature 2")?>
</div>
I'm afraid I can't provide any source code, as this was for a company project, but I hope you get the idea.
gettext wasn't really designed for translating large pieces of text.
fwiw I've included basic HTML (strong, a, etc) in gettext strings as I was confident our translators knew what they were doing (mostly right) and that the translations would be well tested.
I've tried the approach of breaking up the text into one string per paragraph. Roughly as it looks odd if there's one paragraph of English in the middle of the text. Where one of those strings have changed this has meant that we have had to wait for translations before releasing a new version, which has slowed us down. On the plus side it's easy for translators to see which part of the text has changed. This approach worked well for the one application I've tried it with.
Splitting some text out into external locations also worked, but it caused management overhead, rather than just a .po file or two, there was a whole bunch of other text that had to be manually compared to the English version and updated accordingly. This is doable if you remember to provide notes to your translators explaining where and what the difference was in the English version.
I'm still not sold on either approach myself.

Best way to author man pages?

What's the best way to author man pages? Should I write using the standard man macros, or is there some clever package available now that takes some kind of XML-ified source and can output man pages, HTML, ASCII, and what not?
Thanks
I have previously used the GNU version of nroff called groff to write man pages.
Nice intro article on it here:
http://www.linuxjournal.com/article/1158
Doxygen is what you are looking for.
Keep in mind that it is designed to document source code but you could easily adapt it.
It can generate html, pdf, and latex documentation too.
If you are looking at writing once and generating different output formats such as manpages, HTML, plain txt, or even PDF, then docbook should work best.
A tool that is commonly used in the Tcl community is doctools which can produce a restricted (but useful) subset of the manpage format, suitable for rendering with groff or nroff. It can also generate both plain text and HTML directly.
For my atinout program I have been using ronn which lets you write man pages in a very, very readable markdown like syntax. I am extremely happy with it.
atinout(1) -- Send AT commands to modem, capturing the response
===============================================================
## SYNOPSIS
`atinout` <input_file>|`-` <modem_device> <output_file>|`-`<br>
`atinout` `--version`<br>
`atinout` `--usage`<br>
`atinout` `--help`<br>
## DESCRIPTION
**Atinout** reads a list of AT commands. It sends those commands one by one
to the modem, waiting for the final result code for the
currently running command before continuing with the next command in
the list. The output from the commands is saved.
...
see the whole page here.