Why does Github Flavored Markup only add newlines for lines that start with [\w\<]? - github

In our site (which is aimed at highly non-technical people), we let them use Markdown when sending emails. That way, they get nice things like bold, italic, etc. Being non-technical, however, they would never get past the “add two lines to make newlines actually work” quirk.
For that reason mainly, we are using a variant of Github Flavored Markdown.
We mainly borrowed this part:
# in very clear cases, let newlines become <br /> tags
text.gsub!(/^[\w\<][^\n]*\n+/) do |x|
x =~ /\n{2}/ ? x : (x.strip!; x << " \n")
end
This works well, but in some cases it doesn’t add the new-lines, and I guess the key to that is the “in very clear cases” part of that comment.
If I interpret it correctly, this is only adding newlines for lines that start with either a word character or a ‘<’.
Does anyone know why that is? Particularly, why ‘<’?
What would be the harm in just adding the two spaces to essentially anything (lines starting with spaces, hyphens, anything)?

'<' character is used at the beginning of a line to quote messages. I guess that is the reason.

The other answer to this question is quite wrong. This has nothing to do with quoting, and the character for markdown quoting is >.
^[\w\<][^\n]*\n+
Let's break the above regex into parts:
^ = anchor start of string.
[\w\<] matches a word character or the start of word boundary. \< is not a literal, but rather a GNU word boundary. See here (do a ctrl+f for \<).
[^\n]* matches any length of non-newline characters
\n matches a new line.
+ is, I believe, a possessive quantifier.
I believe, but am not 100% sure, that this simply is used to set x to a line of text. Then, the heavy work is done with the next line:
x =~ /\n{2}/ ? x : (x.strip!; x << " \n")
This says "if x satisfies the regex \n{2} (that is, has two line breaks), leave x as is. Otherwise, strip x and append a newline character.

Related

What counts as a newline for Raku *source* files?

I was somewhat surprised to observe that the following code
# comment 
say 1;
# comment 
say 2;
# comment say 3;
# comment say 4;
prints 1, 2, 3, and 4.
Here are the relevant characters after "# comment":
say "

".uninames.raku;
# OUTPUT: «("PARAGRAPH SEPARATOR", "LINE SEPARATOR", "<control-000B>", "<control-000C>").Seq»
Note that many/all of these characters are invisible in most fonts. At least with my editor, none cause the following text to be printed on a new line. And at least one (<control-000C>, aka Form Feed, sometimes printed as ^L) is in fairly wide use in Vim/Emacs as a section separator.
This raises a few questions:
Is this intentional, or a bug?
If intentional, what's the use-case (other than winning obfuscated code contests!)
Is it just these 4 characters, or are there others? (I found these because they share the mandatory break Unicode property. Does that property (or some other Unicode property?) govern what Raku considers as a newline?)
Just, really, wow.
(I realize #4 is not technically a question, but I feel it needed to be said).
Raku's syntax is defined as a Raku grammar. The rule for parsing such a comment is:
token comment:sym<#> {
'#' {} \N*
}
That is, it eats everything after the # that is not a newline character. As with all built-in character classes in Raku, \n and its negation are Unicode-aware. The language design docs state:
\n matches a logical (platform independent) newline, not just \x0a. See TR18 section 1.6 for a list of logical newlines.
Which is a reference to the Unicode standard for regular expressions.
I somewhat doubt there was ever a specific language design discussion along the lines of "let's enable all the kinds of newlines in Unicode, it'll be cool!" Rather, the decisions were that Raku should follow the Unicode regex technical report, and that Raku syntax would be defined in terms of a Raku grammar and thus make use of the Unicode-aware character classes. That a range of different newline characters are supported is a consequence of consistently following those principles.

Is there a Unicode Character to indicate possible linebreak in a string

We are looking for a "BREAK NO-SPACE" character reverse to NO-BREAK SPACE. It should not print anything, just indicate the components down the line, the word can be split and linebroken at these positions.
Is there anything similar to this in Unicode or any other encoding scheme? It would make life easier since we could then rely on built-in methods for line split in our framework instead of introducing custom logic and some "Magic Character".
Soft hyphen U+00AD is invisible but indicates where a word should be broken.
So I found the Zero Width Space character 200B. The documentation describes exactly what I was looking for.

Perl regex presumably removing non ASCII characters

I found a code with regex where it is claimed that it strips the text of any non-ASCII characters.
The code is written in Perl and the part of code that does it is:
$sentence =~ tr/\000-\011\013-\014\016-\037\041-\055\173-\377//d;
I want to understand how this regex works and in order to do this I have used regexr. I found out that \000, \011, \013, \014, \016, \037, \041, \055, \173, \377 mean separate characters as NULL, TAB, VERTICAL TAB ... But I still do not get why "-" symbols are used in the regex. Do they really mean "dash symbol" as shown in regexr or something else? Is this regex really suited for deleting non-ASCII characters?
This isn't really a regex. The dash indicates a character range, like inside a regex character class [a-z].
The expression deletes some ASCII characters, too (mainly whitespace) and spares a range of characters which are not ASCII; the full ASCII range would simply be \000-\177.
To be explicit, the d flag says to delete any characters not between the first pair of slashes. See further the documentation.

Does Unicode have a special marker character?

My father created in mid 90's an encoding for his engineering purposes for his company's computers. It was close to ISO 8859-2 (Latin 2), but with some differences.
For example there was added a special "MARKER CHARACTER". This character wasn't determined to be a literal, but also it wasn't a control character.
The purpose of this character was to be inserted by machine when needed to split text into parts. See the following Python parser script:
re.sub(r'\{\{', r'~{{', text)
re.sub(r'\[\[', r'~[[', text)
re.sub(r'\]\]', r']]~', text)
re.sub(r'\}\}', r'}}~', text)
parts = text.strip('~').split('~')
inCurly = [False]
inSharp = [False]
whereAmI = ['']
for part in parts:
if part[:2] == '{{':
inCurly.append(True)
whereAmI.append('Curly')
elif part[:2] == '[[':
inSharp.append(True)
whereAmI.append('Sharp')
if whereAmI[-1] == 'Sharp' and not inCurly[-1]:
# some advanced magic on current part,
# if it is directly surrounded by sharp brackets,
# but these sharp brackets are not in curly brackets anyhow
# (not: "{{ (( [[ some text ]] )) }}")
# detecting closing brackets and popping inSharp, inCurly, whereAmI
# joining parts back to text
This is an easy parser for advanced purposes, you can detect more parenthesis or quotation marks as you want. But this have one huge fault. It break things when a ~ is in text.
For this purpose and similar purposes like this (but in C lang I think) he added to his encoding/character set that marker character.
For years I use for this purpose three german "sharp s": ßßß, because it is almost impossible to see three of them in a row. But this is not an ideal solution.
Yesterday my father told me this story and I immediatelly thought: is there some equivalent in an Unicode family? Unicode is a modern developing standard spreaded all over the world in past decade drastically. There should be a special character only for this particular purpose, or not?
I don't think there's anything called that specifically, but you might find zero-width space or information separator, among others, suitable for the purpose. You can arbitrarily select any character as your marker, and use an escape character if it occurs within the string.
In the control pictures block, there is a symbol for the group separator.

complete list of special characters (e.g \nquit)

I desperatly tried to find out what symbol '\nquit' is... and I couldnt find any reference in the web.
What I tried to find is a complete list of all of those characters (\n, \p, \0, ...) but I couldn't find any.
cheers usche
Wikipedia has a list of C language escapes here.
As noted in my comment, I believe this represents the newline (linefeed) character \n followed by the word quit (which would be forced by the newline to the beginning of the next line of output). But in that case the string should be "double"-quoted rather than 'single'-quoted.