JSON-LD: HTML code examples in the articleBody - schema.org

In the article, I have a couple of examples of an HTML code (text not images). Should I skipped them and include content without the code examples? I'm using TechArticle and the article content is placed in articleBody.

You can include the HTML examples in the articleBody property.
In JSON-LD values, the characters < and > have no special meaning, so the HTML won’t get interpreted, it’s just plain text.
Alternative
If you want to provide metadata about the code examples, or if they are more like attachments instead of inlined, you can use the hasPart property (on Article) to provide SoftwareSourceCode items (bold emphasis mine):
Computer programming source code. Example: Full (compile ready) solutions, code snippet samples, scripts, templates.
While HTML is not a programming language, I think it should be fine to use it for HTML, too. The programmingLanguage property expects a ComputerLanguage value, which says (bold emphasis mine):
This type covers computer programming languages such as Scheme and Lisp, as well as other language-like computer representations.

Related

ICU Message Format and Interaction / Links

In my application I use the ICU message format to localize user-visible strings for different languages. A relevant (contrived) example would be:
(EN) Click on this link to find out more.
(DE) Dieser Link führt zu weiteren Informationen.
The issue I ran into is interactivity and styling for this sentence. I want the bold fragments to be clickable links in the sentence and style them differently.
The go-to solution for such cases would be to separate the bold words and make them a separate ICU message and handle this message separately in the app (to apply interactivity and styling), perhaps like a button. The problem is how this should be implemented in the context of a sentence, since the language prescribes different number and order of sentence-fragments:
(EN) (Click on) (this link) (to find out more)
(DE) (Dieser Link) (führt zu weiteren Informationen)
I could of course feed the string of the link as a parameter into an ICU message containing the surrounding sentence to obtain the sentence as one string, but then I can't apply the link-action or styling in-app, since I end up with one sentence.
What is a solution to this problem? For context: The app is written in flutter targeting mobile.

Markdown metadata format

Is there a standard or convention for embedding metadata in a Markdown formatted post, such as the publication date or post author for conditional rendering by the renderer?
Looks like this Yaml metadata format might be it.
There are all kinds of strategies, e.g. an accompanying file mypost.meta.edn, but I'm hoping to keep it all in one file.
There are two common formats that look very similar but are actually different in some very specific ways. And a third which is very different.
YAML Front Matter
The Jekyll static site generator popularized YAML front matter which is deliminated by YAML section markers. Yes, the dashes are actually part of the YAML syntax. And the metadata is defined using any valid YAML syntax. Here is an example from the Jekyll docs:
---
layout: post
title: Blogging Like a Hacker
---
Note that YAML front matter is not parsed by the Markdown parser, but is removed prior to parsing by Jekyll (or whatever tool you're using) and could actually be used to request a different parser than the default Markdown parser for that page (I don't recall if Jekyll does that, but I have seen some tools which do).
MultiMarkdown Metadata
The older and simpler MultiMarkdown Metadata is actually incorporated into a few Markdown parsers. While it has more recently been updated to optionally support YAML deliminators, traditionally, the metadata ends and the Markdown document begins upon the first blank line (if the first line was blank, then no metadata). And while the syntax looks very similar to YAML, only key-value pairs are supported with no implied types. Here is an example from the MultiMarkdown docs:
Title: A Sample MultiMarkdown Document
Author: Fletcher T. Penney
Date: February 9, 2011
Comment: This is a comment intended to demonstrate
metadata that spans multiple lines, yet
is treated as a single value.
CSS: http://example.com/standard.css
The MultiMarkdown parser includes a bunch of additional options which are unique to that parser, but the key-value metadata is used across multiple parsers. Unfortunately, I have never seen any two which behaved exactly the same. Without the Markdown rules defining such a format everyone has done their own slightly different interpretation resulting in a lot of variety.
The one thing that is more common is the support for YAML deliminators and basic key-value definitions.
Pandoc Title Block
For completeness there is also the Pandoc Title Block. If has a very different syntax and is not easily confused with the other two. To my knowledge, it is only supported by Pandoc (if enabled), and it only supports three types of data: title, author, and date. Here is an example from the Pandoc documentation:
% title
% author(s) (separated by semicolons)
% date
Note that Pandoc Title Blocks are one of two style supported by Pandoc. Pandoc also supports YAML Metadata as described above.
A workaround use standard syntax and compatible with all other viewers.
I was also looking for a way to add application specific metadata to markdown files while make sure the existing viewers such as vscode and github page will ignore added metadata. Also to use extended markdown syntax is not a good idea because I want to make sure my files can be rendered correctly on different viewers.
So here is my solution: at beginning of markdown file, use following syntax to add metadata:
[_metadata_:author]:- "daveying"
[_metadata_:tags]:- "markdown metadata"
This is the standard syntax for link references, and they will not be rendered while your application can extract these data out.
The - after : is just a placeholder for url, I don't use url as value because you cannot have space in urls, but I have scenarios require array values.
Most Markdown renderers seem to support this YAML format for metadata at the top of the file:
---
layout: post
published-on: 1 January 2000
title: Blogging Like a Boss
---
Content goes here.
The most consistent form of metadata that I've found for Markdown is actually HTML meta tags, since most Markdown interpreters recognize HTML tags and will not render meta tags, meaning that metadata can be stored in a way that will not show up in rendered HTML.
<title>Hello World</title>
<meta name="description" content="The quick brown fox jumped over the lazy dog.">
<meta name="author" content="John Smith">
## Heading
Markdown content begins here
You can try this in something like GitHub Gist or StackEdit.
Correct.
Use the yaml front matter key-value syntax — like MultiMarkdown supports — but (ab)use the official markdown URL syntax to add your metadata.
… my workaround looks like this:
---
[//]: # (Title: My Awesome Title)
[//]: # (Author: Alan Smithee)
[//]: # (Date: 2018-04-27)
[//]: # (Comment: This is my awesome comment. Oh yah.)
[//]: # (Tags: #foo, #bar)
[//]: # (CSS: https://path-to-css)
---
Put this block at the top of your .md doc, with no blank line between the top of the doc and the first ---.
Your fake yaml won't be included when you render to HTML, etc. … it only appears in the .md.
You can also use this technique for adding comments in the body of a markdown doc.
This is not a standard way, but works with Markdown Extra.
I wanted something that worked in the parser, but also didn't leave any clutter when I browse the files on Bitbucket where I store the files.
So I use Abbreviations from the Markdown Extra syntax.
*[blog-date]: 2018-04-27
*[blog-tags]: foo,bar
then I parse them with regexp:
^\*\[blog-date\]:\s*(.+)\s*$
As long as I don't write the exact keywords in the text, they leave no trace. So use some prefix obscure enough to hide them.
I haven't seen this mentioned elsewhere here or in various blogs discussing the subject, but in a project for my personal website, I've decided to use a simple JSON object at the top of each markdown file to store metadata. It's a little more cumbersome to type compared to some of the more textual formats above, but it's super easy to parse. Basically I just do a regex such as ^\s*({.*?})\s*(.*)$ (with the s option on to treat . as \n) to capture the json and markdown content, then parse the json with the language's standard method. It allows pretty easily for arbitrary meta fields.

Unicode alternative for <wbr> tag

I am looking for Unicode solution which will be an alternative to <wbr> tag in following regards:
Allowing line breaks at given position.
Browser can find string in the page even when 'broken' apart.
When copied from page, these characters will not be transfered.
This is exactly what <wbr> does in modern browsers. ­, ­ and ​ does not seem to conform.
Here is a http://jsfiddle.net/qY6mp/7/ to demonstrate.
I will be thankful for any pointers.
The Unicode counterpart of <wbr> is U+200B ZERO WIDTH SPACE, representable in HTML as ​ if you don’t want or can’t use it as such. It’s not clear from the question why you think it does not conform. The main problem with it as questionable support in IE 6, but this is not a conformance issue.
According to Unicode line breaking rules, U+200B “is used to enable additional (invisible) break opportunities wherever SPACE cannot be used”. HTML specifications do not require conformance to the Unicode standard, in this issue or otherwise, but modern browsers generally implement U+200B this way.
What happens when text is copied from an HTML document is outside the scope of specifications. The same applies to requirement 2 in the question. Since generally copy and paste copies characters, including zero width characters, and search functionality operates on characters, requirements 2 and 3 are really asking for a character that does not behave like character.
Note that hyphenation is completely different issue.

Superscript in markdown (Github flavored)?

Following this lead, I tried this in a Github README.md:
<span style="vertical-align: baseline; position: relative;top: -0.5em;>text in superscript</span>
Does not work, the text appears as normal. Help?
Use the <sup></sup>tag (<sub></sub> is the equivalent for subscripts). See this gist for an example.
You have a few options for this. The answer depends on exactly what you're trying to do, how readable you want the content to be when viewed as Markdown and where your content will be rendered:
HTML Tags
As others have said, <sup> and <sub> tags work well for arbitrary text. Embedding HTML in a Markdown document like this is well supported so this approach should work with most tools that render Markdown.
Personally, I find HTML impairs the readable of Markdown somewhat, when working with it "bare" (eg. in a text editor) but small tags like this aren't too bad.
LaTeX (New!)
As of May 2022, GitHub supports embedding LaTeX expressions in Markdown docs directly. This gives us new way to render arbitrary text as superscript or subscript in GitHub flavoured Markdown, and it works quite well.
LaTeX expressions are delineated by $$ for blocks or $ for inline expressions. In LaTeX you indicate superscript with the ^ and subscript with _. Curly braces ({ and }) can be used to group characters. You also need to escape spaces with a backslash. The GitHub implementation uses MathJax so see their docs for what else is possible.
You can use super or subscript for mathematical expressions that require it, eg:
$$e^{-\frac{t}{RC}}$$
Which renders as..
Or render arbitrary text as super or subscript inline, eg:
And so it was indeed: she was now only $_{ten\ inches\ high}$, and her face brightened up at the thought that she was now the right size for going through the little door into that lovely garden.
Which renders as..
I've put a few other examples here in a Gist.
Unicode
If the superscript (or subscript) you need is of a mathematical nature, Unicode may well have you covered.
I've compiled a list of all the Unicode super and subscript characters I could identify in this gist. Some of the more common/useful ones are:
⁰ SUPERSCRIPT ZERO (U+2070)
¹ SUPERSCRIPT ONE (U+00B9)
² SUPERSCRIPT TWO (U+00B2)
³ SUPERSCRIPT THREE (U+00B3)
ⁿ SUPERSCRIPT LATIN SMALL LETTER N (U+207F)
People also often reach for <sup> and <sub> tags in an attempt to render specific symbols like these:
™ TRADE MARK SIGN (U+2122)
® REGISTERED SIGN (U+00AE)
℠ SERVICE MARK (U+2120)
Assuming your editor supports Unicode, you can copy and paste the characters above directly into your document or find them in your systems emoji and symbols picker.
On MacOS, simultaneously press the Command ⌘ + Control + Space keys to open the emoji picker. You can browse or search, or click the small icon in the top right to open the more advanced Character Viewer.
On Windows, you can a emoji and symbol picker by pressing ⊞ Windows + ..
Alternatively, if you're putting these characters in an HTML document, you could use the hex values above in an HTML character escape. Eg, ² instead of ². This works with GitHub (and should work anywhere else your Markdown is rendered to HTML) but is less readable when presented as raw text.
Images
If your requirements are especially unusual, you can always just inline an image. The GitHub supported syntax is:
![Alt text goes here, if you'd like](path/to/image.png)
You can use a full path (eg. starting with https:// or http://) but it's often easier to use a relative path, which will load the image from the repo, relative to the Markdown document.
If you happen to know LaTeX (or want to learn it) you could do just about any text manipulation imaginable and render it to an image. Sites like Quicklatex make this quite easy. Of course, if you know your document will be rendered on GitHub, you can use the new (2022) embedded LaTeX syntax discussed earlier)
Comments about previous answers
The universal solution is using the HTML tag <sup>, as suggested in the main answer.
However, the idea behind Markdown is precisely to avoid the use of such tags:
The document should look nice as plain text, not only when rendered.
Another answer proposes using Unicode characters, which makes the document look nice as a plain text document but could reduce compatibility.
Finally, I would like to remember the simplest solution for some documents: the character ^.
Some Markdown implementation (e.g. MacDown in macOS) interprets the caret as an instruction for superscript.
Ex.
Sin^2 + Cos^2 = 1
Clearly, Stack Overflow does not interpret the caret as a superscript instruction. However, the text is comprehensible, and this is what really matters when using Markdown.
If you only need superscript numbers, you can use pure Unicode. It provides all numbers plus several additional characters as superscripts:
x⁰¹²³⁴⁵⁶⁷⁸⁹⁺⁻⁼⁽⁾ⁿⁱ
However, it might be that the chosen font does not support them, so be sure to check the rendered output.
In fact, there are even quite a few superscript letters, however, their intended use might not be for superscript, and font support might be even worse. Use your own judgement.

Website localization for multibyte languages

I have started to code a multi-language feature for a medium-sized website with a lot of hardcoded text. As the website is supposed to be translated into Japanese and Korean (multibyte character set) I am considering the following:
If I use string externalization, do the strings for Japanese or Korean need to be in unicode form within the locale file (i.e. 台北 instead of 台北 as string value)?
Would it make more sense to store the localization in a DB (i.e. MySQL) and retrieve the respective values via a localization function in PHP?
Your thought input is much appreciated.
Best regards
$0.02 from someone who has some experience with i18n...
Keep your translations in human-readable form, as it will likely be translators and not coders managing these resources.
If this text (hard-coded, you say) is not subject to frequent change, then you may wish to store these resources as files that you read in at runtime.
If this text is subject to frequent change, then you may wish to explore other alternatives for storing resources, such as databases or in-memory key-value stores.
Depending upon your requirements, you may want to consider a mixture of the above.
But I strongly suggest that you avoid mixing code (the HTML character entities) with your translation resources. Most translators will not understand what they mean and may break them when they are translating. And on the flip-side, a programmer may not understand how to insert code or formatting into the translation resources properly, unless they actually understand that language.
tl;dr
- use UTF-8
- don't mix any code/formatting into the translations themselves
- how you store the translations depends upon your requirements
I doubt that string externalization would be your biggest problem. But let me give you some advise.
String externalization
Of course you would need to separate translatable strings from the code. I would recommend storing translation in plain text, UTF-8 encoded file containing key-value pairs:
some.key=some translation
Of course you would need to write a helper script to resolve this at runtime. The script would need to detect end-user's language.
Language detection
Web browsers are so nice to send AcceptLanguage header each time they send a request. What you need to do, is to read the content of this header and check if you support any of the language user has listed. If so, read the resource file (as defined above) and return strings for given language, return your default language otherwise. The code example below will give you the most desired language (which is not necessary the one you support):
<?php
$locale = Locale::acceptFromHttp($_SERVER['HTTP_ACCEPT_LANGUAGE']);
echo $locale;
?>
This is still, not the biggest of your challenges.
Styles and style sheets
The real problem with multilingual web sites or web applications are styles. People tend to put style definitions in-line, which is problematic to say the least. Also, designers tend to think that Arial is the best font for entire Universe, as well as emphasis always have to come with bolded font. The only problem is, the font might be unreadable under some circumstances.
I must admit, I don't know why it happens, but most of the times web browsers tend to ignore bold attribute for Asian scripts (which is good), but sometimes they do not and it could became a major challenge for end users if your font definition is say font-family:Arial; font-size:10px;.
The other problem could be colors. Depending on your web site design, some colors used might be inappropriate for target customers. That is because we all tend to assign meaning to colors based on our cultural background.
Images containing localizable text could also give you a headache, you would need to either externalize such texts (and write them down just like any other HTML element), or prepare multilingual resources structure (i.e. put all images to directories named after language code ("en", "ja", "ko")).
The real challenge however, are hard-coded formatting tags like <b>, <i>, <u>, <strong>, etc. Nobody should use them nowadays, style classes should be used instead but the common practice is different. You would probably need to replace them with style classes; each element could have more than one style class, which to my surprise is not common knowledge (for example <p class="main boldText">).
OK, once you have your styles externalized, you would probably be forced to implement some sort of CSS Localization Mechanism. This is needed in the lights of what I wrote above. The easiest way to do that is to create directory structure similar to the one I mentioned before - "en" for English base CSS files, "ja" for Japanese and "ko" for Korean, so each language would have their own, separate set of CSS files. This is similar to UI skins, only in that case user won't be able to choose the skin, you will decide on which CSS to present them - you would detect language anyway.
As for in-line style definitions (<p style="whatever">), after you define CSS L10n Mechanism, you could override any style by forcing it with !important keyword. That is, unless somebody in his very wrong mind put this keyword to in-line style definition.
Concatenations
Well, this is your biggest challenge. Even people who understand the need of string externalization tend to concatenate the strings like this:
$result = $label + ": " + $product;
$message = "$your_basket_is + $basket_status + ".";
This poses serious problem for Internationalization (and if it is not resolved for Localization as well). That is because, the order of the sentence tend to be different after translating text into different language (this especially regards to Korean). Also, I showed you hard-coded punctuations, which are not necessary correct for Asian languages. That is what I have to go through on a daily basis :/
What you would probably need to do, is to remove such concatenations, or use some means of message formatting. The PHP example (taken directly from web page I am referencing) would be:
<?php
$fmt = new MessageFormatter("en_US", "{0,number,integer} monkeys on {1,number,integer} trees make {2,number} monkeys per tree");
echo $fmt->format(array(4560, 123, 4560/123));
$fmt = new MessageFormatter("de", "{0,number,integer} Affen auf {1,number,integer} Bäumen sind {2,number} Affen pro Baum");
echo $fmt->format(array(4560, 123, 4560/123));
?>
As you can see in this example, numbers are also formatted to much locale style. This leads us to:
Locale aware formatting
Dates, times, numbers and currencies or other similar information need to be formatted according to user-detected Locale. There is a slight difference here: you should attempt to do that, even if you do not support related language resources (do not have translations). Of course for currency symbol, you would use whatever is your real currency, not the user's default, but the format should respect end user's cultural background.
Summary
I have just presented you with a short introduction to multilingual web site design with focus on Japanese and Korean target markets. If at some point you would need to support Chinese Simplified as well, support for GB18030 encoding would be probably needed as well. This would be very challenging...
You do not want to store all your text as HTML entities. It'll drive you mad. The only reason to do this is if you need to serve your document in an ASCII encoding and cannot embed the characters directly. But in this day and age there's no reason for that; serve your document as UTF-8 and write and store your contents in UTF-8 and be done with it.
Whether or not to store translations in the database depends on many factors, including performance, caching, whether you need to be able to search for the text, whether the text should be editable by non-programmers etc. Usually .mo/.po translation files with gettext are a good way to go unless proven otherwise.