Do Unicode's line breaking rules require the last character to be a mandatory break? - unicode

I'm trying to use libunibreak (https://github.com/adah1972/libunibreak) to mark the possible line breaks in some given unicode text.
Libunibreak gives back four possible options for each code unit in some text:
LINEBREAK_MUSTBREAK
LINEBREAK_ALLOWBREAK
LINEBREAK_NOBREAK
LINEBREAK_INSIDEACHAR
Hopefully these are self explanatory. I would expect that MUSTBREAK corresponds to newline characters like LF. However, for any given text Libunibreak always indicates that the last character is MUSTBREAK
So for example with the string "abc", the output would be [NOBREAK,NOBREAK,MUSTBREAK]. For "abc\n" the output would be [NOBREAK,NOBREAK,NOBREAK,MUSTBREAK]. I use the MUSTBREAK attribute to start a new line when drawing text so the first case ("abc") creates an extra linebreak that shouldn't be there.
Is this behaviour what Unicode specifies or is this a quirk of the library implementation I'm using?

Yes, this is what the Unicode line breaking algorithm specifies. Rule LB3 in UAX #14: Unicode Line Breaking Algorithm, section 6.1 "Non-tailorable Line Breaking Rules" says:
Always break at the end of text.
The spec further explains:
[This rule is] designed to deal with degenerate cases, so that there is [...] at least one line break for the whole text.

Related

Do information separators constitute line-breaks in Unicode?

This Wikipedia article which lists all Unicode whitespaces mentions 7 of them as line/paragraph separating characters (LF, VT, FF, CR, NEL, LS, PS). Here there is nothing given about ASCII 'information separator' characters (FS, GS, RS, US). But surprisingly FS, GS, RS have 'paragraph separator (B)' as their bidirectional class. This is confusing.
Now, when I encounter one of these 'information separator' characters in a text, should I consider them as line-break or not? In other words, if I am writing a function which splits at line breaks, then should I split at these three characters? (string.splitlines() function in Python does consider them as line breaks. I don't know about other implementations.)
For example:
Both in the linked Wikipedia table and in the Unicode bidi class database, LF is considered as line-break. So I can break line when I encounter that character.
Both in the linked Wikipedia table and in the Unicode bidi class database, SP is not considered as line-break. So I can't break a line when I encounter that character. (suppose no word-wrap).
The linked Wikipedia table does not mention GS as a line-break. But the Unicode bidi class database does mention it as line-break. I'm confused: what should I do in this case? What does bidi class refer to in this case?
Here I'm only asking about the Unicode standard. But if you know, you can also mention about line-breaks in the ASCII standard.
PS: I'm not sure whether the table in the linked Wikipedia page is correct. But I wasn't able to find any other good resource which lists all whitespaces.
FS, GS, RS, and US belong to the line break class Combining_Mark (CM). The relevant file in the Unicode Character Database for this information is LineBreak.txt.
UAX #14 (Unicode Line Breaking Algorithm) describes class CM as follows:
Combining character sequences are treated as units for the purpose of
line breaking. The line breaking behavior of the sequence is that of
the base character.
In other words: Class CM characters prohibit line breaks before them – they essentially “glue” themselves to the previous character. However, for all other purposes, the line breaking algorithm completely ignores the presence of class CM characters. Whether or not a line break opportunity exists after a class CM character depends solely* on the line break class of the base character it has been applied to, i.e. the first character going backwards that is not of class CM.
*There are some exceptions to this rule involving mandatory breaks and a few special formatting characters, but they shouldn’t be relevant for your purposes.

What counts as a newline for Raku *source* files?

I was somewhat surprised to observe that the following code
# comment 
say 1;
# comment 
say 2;
# comment say 3;
# comment say 4;
prints 1, 2, 3, and 4.
Here are the relevant characters after "# comment":
say "

".uninames.raku;
# OUTPUT: «("PARAGRAPH SEPARATOR", "LINE SEPARATOR", "<control-000B>", "<control-000C>").Seq»
Note that many/all of these characters are invisible in most fonts. At least with my editor, none cause the following text to be printed on a new line. And at least one (<control-000C>, aka Form Feed, sometimes printed as ^L) is in fairly wide use in Vim/Emacs as a section separator.
This raises a few questions:
Is this intentional, or a bug?
If intentional, what's the use-case (other than winning obfuscated code contests!)
Is it just these 4 characters, or are there others? (I found these because they share the mandatory break Unicode property. Does that property (or some other Unicode property?) govern what Raku considers as a newline?)
Just, really, wow.
(I realize #4 is not technically a question, but I feel it needed to be said).
Raku's syntax is defined as a Raku grammar. The rule for parsing such a comment is:
token comment:sym<#> {
'#' {} \N*
}
That is, it eats everything after the # that is not a newline character. As with all built-in character classes in Raku, \n and its negation are Unicode-aware. The language design docs state:
\n matches a logical (platform independent) newline, not just \x0a. See TR18 section 1.6 for a list of logical newlines.
Which is a reference to the Unicode standard for regular expressions.
I somewhat doubt there was ever a specific language design discussion along the lines of "let's enable all the kinds of newlines in Unicode, it'll be cool!" Rather, the decisions were that Raku should follow the Unicode regex technical report, and that Raku syntax would be defined in terms of a Raku grammar and thus make use of the Unicode-aware character classes. That a range of different newline characters are supported is a consequence of consistently following those principles.

Is there a Unicode Character to indicate possible linebreak in a string

We are looking for a "BREAK NO-SPACE" character reverse to NO-BREAK SPACE. It should not print anything, just indicate the components down the line, the word can be split and linebroken at these positions.
Is there anything similar to this in Unicode or any other encoding scheme? It would make life easier since we could then rely on built-in methods for line split in our framework instead of introducing custom logic and some "Magic Character".
Soft hyphen U+00AD is invisible but indicates where a word should be broken.
So I found the Zero Width Space character 200B. The documentation describes exactly what I was looking for.

Unicode character for marking

We are going to digitize a lot of books. We want to mark place of line break in original book without influencing the flow of digital book. Which invisible Unicode charter can be used to mark some special places in a raw file?
(\n will used to indicate end of paragraph)
This is a sentence
in the original book that
I want to mark line
break places.
What is the proper character to replace *:
This is a sentence * in the original book that * I want to mark line *break places.
Unicode has no concept of a hidden character that represents a line break in some original but does not cause line break in rendering. Unicode encodes plain text data, and its control characters for line breaks have an effect when plain text is rendered.
What matters here is how the files will be used. If they need to be processable with plain text editors, then you need to decide: either the line breaks are replicated in default rendering, or they are omitted when creating the file. You can’t make them invisible. And different text editors, like Notepad and Emacs, may well use different line control conventions; one program’s end of line is another program’s end of paragraph.
If the files will only be processed by programs that you create, then you can use whatever conventions you like. The most logical one is this:
“Line and Paragraph Separator. The Unicode Standard provides two unambiguous characters,
U+2028 line separator and U+2029 paragraph separator, to separate lines and
paragraphs. They are considered the default form of denoting line and paragraph boundaries
in Unicode plain text. A new line is begun after each line separator. A new paragraph
is begun after each paragraph separator. As these characters are separator codes, it is not necessary either to start the first line or paragraph or to end the last line or paragraph with them. Doing so would indicate that there was an empty paragraph or line following. The paragraph separator can be inserted between paragraphs of text. Its use allows the creation of plain text files, which can be laid out on a different line width at the receiving end. The line separator can be used to indicate an unconditional end of line.”
http://www.unicode.org/versions/Unicode6.1.0/ch16.pdf (pages 6 and 7 in the PDF)
Beware that U+2028 and U+2029 are generally not understood by text editors. They are suitable for storing data in plain text format. When the text is to be rendered, the rendering software has the option of ignoring the original division into lines and treating U+2028 as equivalent to a space, except if preceded by a hyphen (which poses a problem that cannot be resolved without higher level information: a line that ends with “foo-” and is follod by a line beginning with “bar” could represent the word “foobar” as hyphenated for line breaking, or a hyphenated compound “foo-bar” or, in some cases, the combination “foo- bar”).
Use the line feed character (LF, "\n", 0x0A) and/or maybe carriage return (CR, "\r", 0x0D).
I.e., the regular characters for this purpose.

What is the syntax for new line in Objective-C?

Can anyone tell me what is the symbol used for new line?
In the C language we use '\n' for new line. What do we use in Objective-C?
is it same?
Objective-C is an extension of C. So '\n' works too in Objective-C.
It's the same (\n), but there's a lot more to the topic depending on whether it's just a new line or a new paragraph, what context the text will be processed in, etc. From the documentation (referencing the Cocoa docs here because they cover both Objective-C [implicitly] and Cocoa, since you have the iphone tag on your question):
There are a number of ways in which a line or paragraph break may be represented. Historically \n, \r, and \r\n have been used. Unicode defines an unambiguous paragraph separator, U+2029 (for which Cocoa provides the constant NSParagraphSeparatorCharacter), and an unambiguous line separator, U+2028 (for which Cocoa provides the constant NSLineSeparatorCharacter).
In the Cocoa text system, the NSParagraphSeparatorCharacter is treated consistently as a paragraph break, and NSLineSeparatorCharacter is treated consistently as a line break that is not a paragraph break—that is, a line break within a paragraph. However, in other contexts, there are few guarantees as to how these characters will be treated. POSIX-level software, for example, often recognizes only \n as a break. Some older Macintosh software recognizes only \r, and some Windows software recognizes only \r\n. Often there is no distinction between line and paragraph breaks.
Which line or paragraph break character you should use depends on how your data may be used and on what platforms. The Cocoa text system recognizes \n, \r, or \r\n all as paragraph breaks—equivalent to NSParagraphSeparatorCharacter. When it inserts paragraph breaks, for example with insertNewline:, it uses \n. Ordinarily NSLineSeparatorCharacter is used only for breaks that are specifically line breaks and not paragraph breaks, for example in insertLineBreak:, or for representing HTML <br> elements.
If your breaks are specifically intended as line breaks and not paragraph breaks, then you should typically use NSLineSeparatorCharacter. Otherwise, you may use \n, \r, or \r\n depending on what other software is likely to process your text. The default choice for Cocoa is usually \n.
It's the same, but if you are printing to the console, you should use
NSLog(#"This is a console statement\n on two different lines");
Hope this helps.
It's same dude. Objective-c is superset of c so most of the things from c will work in objective-c too.
Its same.In Objective c "\n" use for new line.
Straight from the dragons mouth.
http://developer.apple.com/library/mac/#documentation/cocoa/conceptual/Strings/Articles/stringsParagraphBreaks.html