What counts as a newline for Raku *source* files? - unicode

I was somewhat surprised to observe that the following code
# comment 
say 1;
# comment 
say 2;
# comment say 3;
# comment say 4;
prints 1, 2, 3, and 4.
Here are the relevant characters after "# comment":
say "

".uninames.raku;
# OUTPUT: «("PARAGRAPH SEPARATOR", "LINE SEPARATOR", "<control-000B>", "<control-000C>").Seq»
Note that many/all of these characters are invisible in most fonts. At least with my editor, none cause the following text to be printed on a new line. And at least one (<control-000C>, aka Form Feed, sometimes printed as ^L) is in fairly wide use in Vim/Emacs as a section separator.
This raises a few questions:
Is this intentional, or a bug?
If intentional, what's the use-case (other than winning obfuscated code contests!)
Is it just these 4 characters, or are there others? (I found these because they share the mandatory break Unicode property. Does that property (or some other Unicode property?) govern what Raku considers as a newline?)
Just, really, wow.
(I realize #4 is not technically a question, but I feel it needed to be said).

Raku's syntax is defined as a Raku grammar. The rule for parsing such a comment is:
token comment:sym<#> {
'#' {} \N*
}
That is, it eats everything after the # that is not a newline character. As with all built-in character classes in Raku, \n and its negation are Unicode-aware. The language design docs state:
\n matches a logical (platform independent) newline, not just \x0a. See TR18 section 1.6 for a list of logical newlines.
Which is a reference to the Unicode standard for regular expressions.
I somewhat doubt there was ever a specific language design discussion along the lines of "let's enable all the kinds of newlines in Unicode, it'll be cool!" Rather, the decisions were that Raku should follow the Unicode regex technical report, and that Raku syntax would be defined in terms of a Raku grammar and thus make use of the Unicode-aware character classes. That a range of different newline characters are supported is a consequence of consistently following those principles.

Related

Do information separators constitute line-breaks in Unicode?

This Wikipedia article which lists all Unicode whitespaces mentions 7 of them as line/paragraph separating characters (LF, VT, FF, CR, NEL, LS, PS). Here there is nothing given about ASCII 'information separator' characters (FS, GS, RS, US). But surprisingly FS, GS, RS have 'paragraph separator (B)' as their bidirectional class. This is confusing.
Now, when I encounter one of these 'information separator' characters in a text, should I consider them as line-break or not? In other words, if I am writing a function which splits at line breaks, then should I split at these three characters? (string.splitlines() function in Python does consider them as line breaks. I don't know about other implementations.)
For example:
Both in the linked Wikipedia table and in the Unicode bidi class database, LF is considered as line-break. So I can break line when I encounter that character.
Both in the linked Wikipedia table and in the Unicode bidi class database, SP is not considered as line-break. So I can't break a line when I encounter that character. (suppose no word-wrap).
The linked Wikipedia table does not mention GS as a line-break. But the Unicode bidi class database does mention it as line-break. I'm confused: what should I do in this case? What does bidi class refer to in this case?
Here I'm only asking about the Unicode standard. But if you know, you can also mention about line-breaks in the ASCII standard.
PS: I'm not sure whether the table in the linked Wikipedia page is correct. But I wasn't able to find any other good resource which lists all whitespaces.
FS, GS, RS, and US belong to the line break class Combining_Mark (CM). The relevant file in the Unicode Character Database for this information is LineBreak.txt.
UAX #14 (Unicode Line Breaking Algorithm) describes class CM as follows:
Combining character sequences are treated as units for the purpose of
line breaking. The line breaking behavior of the sequence is that of
the base character.
In other words: Class CM characters prohibit line breaks before them – they essentially “glue” themselves to the previous character. However, for all other purposes, the line breaking algorithm completely ignores the presence of class CM characters. Whether or not a line break opportunity exists after a class CM character depends solely* on the line break class of the base character it has been applied to, i.e. the first character going backwards that is not of class CM.
*There are some exceptions to this rule involving mandatory breaks and a few special formatting characters, but they shouldn’t be relevant for your purposes.

Do Unicode's line breaking rules require the last character to be a mandatory break?

I'm trying to use libunibreak (https://github.com/adah1972/libunibreak) to mark the possible line breaks in some given unicode text.
Libunibreak gives back four possible options for each code unit in some text:
LINEBREAK_MUSTBREAK
LINEBREAK_ALLOWBREAK
LINEBREAK_NOBREAK
LINEBREAK_INSIDEACHAR
Hopefully these are self explanatory. I would expect that MUSTBREAK corresponds to newline characters like LF. However, for any given text Libunibreak always indicates that the last character is MUSTBREAK
So for example with the string "abc", the output would be [NOBREAK,NOBREAK,MUSTBREAK]. For "abc\n" the output would be [NOBREAK,NOBREAK,NOBREAK,MUSTBREAK]. I use the MUSTBREAK attribute to start a new line when drawing text so the first case ("abc") creates an extra linebreak that shouldn't be there.
Is this behaviour what Unicode specifies or is this a quirk of the library implementation I'm using?
Yes, this is what the Unicode line breaking algorithm specifies. Rule LB3 in UAX #14: Unicode Line Breaking Algorithm, section 6.1 "Non-tailorable Line Breaking Rules" says:
Always break at the end of text.
The spec further explains:
[This rule is] designed to deal with degenerate cases, so that there is [...] at least one line break for the whole text.

Why does Github Flavored Markup only add newlines for lines that start with [\w\<]?

In our site (which is aimed at highly non-technical people), we let them use Markdown when sending emails. That way, they get nice things like bold, italic, etc. Being non-technical, however, they would never get past the “add two lines to make newlines actually work” quirk.
For that reason mainly, we are using a variant of Github Flavored Markdown.
We mainly borrowed this part:
# in very clear cases, let newlines become <br /> tags
text.gsub!(/^[\w\<][^\n]*\n+/) do |x|
x =~ /\n{2}/ ? x : (x.strip!; x << " \n")
end
This works well, but in some cases it doesn’t add the new-lines, and I guess the key to that is the “in very clear cases” part of that comment.
If I interpret it correctly, this is only adding newlines for lines that start with either a word character or a ‘<’.
Does anyone know why that is? Particularly, why ‘<’?
What would be the harm in just adding the two spaces to essentially anything (lines starting with spaces, hyphens, anything)?
'<' character is used at the beginning of a line to quote messages. I guess that is the reason.
The other answer to this question is quite wrong. This has nothing to do with quoting, and the character for markdown quoting is >.
^[\w\<][^\n]*\n+
Let's break the above regex into parts:
^ = anchor start of string.
[\w\<] matches a word character or the start of word boundary. \< is not a literal, but rather a GNU word boundary. See here (do a ctrl+f for \<).
[^\n]* matches any length of non-newline characters
\n matches a new line.
+ is, I believe, a possessive quantifier.
I believe, but am not 100% sure, that this simply is used to set x to a line of text. Then, the heavy work is done with the next line:
x =~ /\n{2}/ ? x : (x.strip!; x << " \n")
This says "if x satisfies the regex \n{2} (that is, has two line breaks), leave x as is. Otherwise, strip x and append a newline character.

What is the syntax for new line in Objective-C?

Can anyone tell me what is the symbol used for new line?
In the C language we use '\n' for new line. What do we use in Objective-C?
is it same?
Objective-C is an extension of C. So '\n' works too in Objective-C.
It's the same (\n), but there's a lot more to the topic depending on whether it's just a new line or a new paragraph, what context the text will be processed in, etc. From the documentation (referencing the Cocoa docs here because they cover both Objective-C [implicitly] and Cocoa, since you have the iphone tag on your question):
There are a number of ways in which a line or paragraph break may be represented. Historically \n, \r, and \r\n have been used. Unicode defines an unambiguous paragraph separator, U+2029 (for which Cocoa provides the constant NSParagraphSeparatorCharacter), and an unambiguous line separator, U+2028 (for which Cocoa provides the constant NSLineSeparatorCharacter).
In the Cocoa text system, the NSParagraphSeparatorCharacter is treated consistently as a paragraph break, and NSLineSeparatorCharacter is treated consistently as a line break that is not a paragraph break—that is, a line break within a paragraph. However, in other contexts, there are few guarantees as to how these characters will be treated. POSIX-level software, for example, often recognizes only \n as a break. Some older Macintosh software recognizes only \r, and some Windows software recognizes only \r\n. Often there is no distinction between line and paragraph breaks.
Which line or paragraph break character you should use depends on how your data may be used and on what platforms. The Cocoa text system recognizes \n, \r, or \r\n all as paragraph breaks—equivalent to NSParagraphSeparatorCharacter. When it inserts paragraph breaks, for example with insertNewline:, it uses \n. Ordinarily NSLineSeparatorCharacter is used only for breaks that are specifically line breaks and not paragraph breaks, for example in insertLineBreak:, or for representing HTML <br> elements.
If your breaks are specifically intended as line breaks and not paragraph breaks, then you should typically use NSLineSeparatorCharacter. Otherwise, you may use \n, \r, or \r\n depending on what other software is likely to process your text. The default choice for Cocoa is usually \n.
It's the same, but if you are printing to the console, you should use
NSLog(#"This is a console statement\n on two different lines");
Hope this helps.
It's same dude. Objective-c is superset of c so most of the things from c will work in objective-c too.
Its same.In Objective c "\n" use for new line.
Straight from the dragons mouth.
http://developer.apple.com/library/mac/#documentation/cocoa/conceptual/Strings/Articles/stringsParagraphBreaks.html

What characters are allowed in Perl identifiers?

I'm working on regular expressions homework where one question is:
Using language reference manuals online determine the regular expressions for integer numeric constants and identifiers for Java, Python, Perl, and C.
I don't need help on the regular expression, I just have no idea what identifiers look like in Perl. I found pages describing valid identifiers for C, Python and Java, but I can't find anything about Perl.
EDIT: To clarify, finding the documentation was meant to be easy (like doing a Google search for python identifiers). I'm not taking a class in "doing Google searches".
Perl Integer Constants
Integer constants in Perl can be
in base 16 if they start with ^0x
in base 2 if they start with ^0b
in base 8 if they start with 0
otherwise they are in base 10.
Following that leader is any number of valid digits in that base and also optional underscores.
Note that digit does not mean \p{POSIX_Digit}; it means \p{Decimal_Number}, which is really quite different, you know.
Please note that any leading minus sign is not part of the integer constant, which is easily proven by:
$ perl -MO=Concise,-exec -le '$x = -3**$y'
1 <0> enter
2 <;> nextstate(main 1 -e:1) v:{
3 <$> const(IV 3) s
4 <$> gvsv(*y) s
5 <2> pow[t1] sK/2
6 <1> negate[t2] sK/1
7 <$> gvsv(*x) s
8 <2> sassign vKS/2
9 <#> leave[1 ref] vKP/REFC
-e syntax OK
See the 3 const, and much later on the negate op-code? That tells you a bunch, including a curiosity of precedence.
Perl Identifiers
Identifiers specified via symbolic dereferencing have absolutely no restriction whatsoever on their names.
For example, 100->(200) calls the function named 100 with the arugments (100, 200).
For another, ${"What’s up, doc?"} refers to the scalar package variable by that name in the current package.
On the other hand, ${"What's up, doc?"} refers to the scalar package variable whose name is ${"s up, doc?"} and which is not in the current package, but rather in the What package. Well, unless the current package is the What package, of course. Similary $Who's is the $s variable in the Who package.
One can also have identifiers of the form ${^identifier}; these are not considered symbolic dereferences into the symbol table.
Identifiers with a single character alone can be a punctuation character, include $$ or %!.
Identifers can also be of the form $^C, which is either a control character or a circumflex folllowed by a non-control character.
If none of those things is true, a (non–fully qualified) identifier follows the Unicode rules related to characters with the properties ID_Start followed by those with the property ID_Continue. However, it overrules this in allowing all-digit identifiers and identifiers that start with (and perhaps have nothing else beyond) an underscore. You can generally pretend (but it’s really only pretending) that that is like saying \w+, where \w is as described in Annex C of UTS#18. That is, anything that has any of these:
the Alphabetic property — which includes far more than just Letters; it also contains various combining characters and the Letter_Number code points, plus the circled letters
the Decimal_Number property, which is rather more than merely [0-9]
Any and all characters with the Mark property, not just those marks that are deemed Other_Alphabetic
Any characters with the Connector_Puncutation property, of which underscore is just one such.
So either ^\d+$ or else
^[\p{Alphabetic}\p{Decimal_Number}\p{Mark}\p{Connector_Punctuation}]+$
ought to do it for the really simple ones if you don’t care to explore the intricacies of the Unicode ID_Start and ID_Continue properties. That’s how it’s really done, but I bet your instructor doesn’t know that. Perhaps one shan’t tell him, eh?
But you should cover the nonsimple ones I describe earlier.
And we haven’t talked about packages yet.
Perl Packages in Identifiers
Beyond those simple rules, you must also consider that identifiers may be qualified with a package name, and package names themselves follow the rules of identifiers.
The package separator is either :: or ' at your whim.
You do not have to specify a package if it is the first component in a fully qualified identifier, in which case it means the package main. That means things like $::foo and $'foo are equivalent to $main::foo, and isn't_it() is equivalent to isn::t_it(). (Typo removed)
Finally, as a special case, a trailing double-colon (but not a single-quote) at the end of a hash is permitted, and this then refers to the symbol table of that name.
Thus %main:: is the main symbol table, and because you can omit main, so too is %::.
Meanwhile %foo:: is the foo symbol table, as is %main::foo:: and also %::foo:: just for perversity’s sake.
Summary
It’s nice to see instructors giving people non-trivial assignments. The question is whether the instructor realized it was non-trivial. Probably not.
And it’s hardly just Perl, either. Regarding the Java identifiers, did you figure out yet that the textbooks lie? Here’s the demo:
$ perl -le 'print qq(public class escape { public static void main(String argv[]) { String var_\033 = "i am escape: ^\033"; System.out.println(var_\033); }})' > escape.java
$ javac escape.java
$ java escape | cat -v
i am escape: ^[
Yes, it’s true. It is also true for many other code points, especially if you use -encoding UTF-8 on the compile line. Your job is to find the pattern that describes these startlingly unforbidden Java identifiers. Hint: make sure to include code point U+0000.
There, aren’t you glad you asked? Hope this helps. Or something. ☺
The homework requests that you use the reference manuals, so I'll answer in those terms.
The Perl documentation is available at http://perldoc.perl.org/. The section that deals on variables is perldata. That will easily give you a usable answer.
In reality, I doubt that the complete answer is available in the documentation. There are special variables (see perlvar), and "use utf8;" can greatly affect the definition of "letter" and "number".
$ perl -E'use utf8; $é=123; say $é'
123
[ I only covered the identifier part. I just noticed the question is larger than that ]
The perlvar page of the Perl documentation has a section at the end roughly outlining the allowable syntax. In summary:
Any combination of letters, digits, underscores, and the special sequence :: (or '), provided it starts with a letter or underscore.
A sequence of digits.
A single punctuation character.
A single control character, which can also be written as caret-{letter}, e.g. ^W.
An alphanumeric string starting with a control character.
Note that most of the identifiers other than the ones in set 1 are either given a special meaning by Perl, or are reserved and may gain a special meaning in later versions. But if you're just trying to work out what is a valid identifier, then that doesn't really matter in your case.
Having no official specification (Perl is whatever the perl interpreter can parse) these can be a little tricky to discern.
This page has examples of all the integer constant formats. The format of identifiers will need to be inferred from various pages in perldoc.