Any subtleties in source map line/column definitions when using source-maps for non-JS languages - unicode

I am getting a code generator to produce source maps that tracks line/column based on Unicode code-points with lines broken by (LF, CR, CRLF). I worry about other line-breaks embedded in comments and supplementary characters might cause source-map consumers to disagree on which part of a source text a (line, column) pair references.
Specifically, I am confused by the source-map v3 specification's use of terms like
the zero-based starting line in the original source
the zero-based starting column of the line in the source represented
I imagine that these numbers are used by programmer tools to navigate to/highlight chunks of code in the original source.
Problems due to different line break definitions
JavaScript recognizes CR, LF, CRLF, U+2028, U+2029 as line breaks
Rust recognizes those and U+85 though or only recognizes LF and CRLF as breaks depending on how you slice the grammar.
Java only treats CR, LF, CRLF as line breaks.
Languages that follow Unicode TR #13 might additionally treat VT and FF as line breaks.
So if a source-map generator treats U+0085 (for example) as a newline and a source-map consumer does not, might they disagree about where a (source-line, source-column) pair points?
Problems due to differing column definitions.
Older versions of JavaScript defined a source text as UTF-16 code-units, suggesting that a column count is the number of UTF-16 code-units since the end of the last line-break.
ECMAScript source text is represented as a sequence of characters in the Unicode character encoding, version 2.1 or later, using the UTF-16 transformation format.
But the current spec does not describe source texts in terms of UTF-16:
SourceCharacter::
    any Unicode code point
Are column counts likely to be thrown off if source-map consumers differently treat supplementary characters as occupying one code-point column or two UTF-16 columns?
For example, since '𝄬' is U+1d12C, a supplementary code-point encoded using two UTF-16 code-units, might column counts disagree for a line like
let 𝄬 = "𝄬" /* 𝄬 */ + ""
Is the + symbol at column 20 (zero-indexed by code-point) or column 23 (zero-indexed by UTF-16 code-unit)?
Am I missing something in the specification that clarifies this, or is there a de facto rule used by most source-map producers/consumers?
If these are problems, are there known workarounds or best practices when tracking line/column counts for source languages that translate to JS?
I will probably have to reverse-engineer what implementations like Mozilla's source-map.js or Chrome's dev console do, but thought I'd try and find a spec reference so I know whom to file bugs against and who is correct.

Related

Do information separators constitute line-breaks in Unicode?

This Wikipedia article which lists all Unicode whitespaces mentions 7 of them as line/paragraph separating characters (LF, VT, FF, CR, NEL, LS, PS). Here there is nothing given about ASCII 'information separator' characters (FS, GS, RS, US). But surprisingly FS, GS, RS have 'paragraph separator (B)' as their bidirectional class. This is confusing.
Now, when I encounter one of these 'information separator' characters in a text, should I consider them as line-break or not? In other words, if I am writing a function which splits at line breaks, then should I split at these three characters? (string.splitlines() function in Python does consider them as line breaks. I don't know about other implementations.)
For example:
Both in the linked Wikipedia table and in the Unicode bidi class database, LF is considered as line-break. So I can break line when I encounter that character.
Both in the linked Wikipedia table and in the Unicode bidi class database, SP is not considered as line-break. So I can't break a line when I encounter that character. (suppose no word-wrap).
The linked Wikipedia table does not mention GS as a line-break. But the Unicode bidi class database does mention it as line-break. I'm confused: what should I do in this case? What does bidi class refer to in this case?
Here I'm only asking about the Unicode standard. But if you know, you can also mention about line-breaks in the ASCII standard.
PS: I'm not sure whether the table in the linked Wikipedia page is correct. But I wasn't able to find any other good resource which lists all whitespaces.
FS, GS, RS, and US belong to the line break class Combining_Mark (CM). The relevant file in the Unicode Character Database for this information is LineBreak.txt.
UAX #14 (Unicode Line Breaking Algorithm) describes class CM as follows:
Combining character sequences are treated as units for the purpose of
line breaking. The line breaking behavior of the sequence is that of
the base character.
In other words: Class CM characters prohibit line breaks before them – they essentially “glue” themselves to the previous character. However, for all other purposes, the line breaking algorithm completely ignores the presence of class CM characters. Whether or not a line break opportunity exists after a class CM character depends solely* on the line break class of the base character it has been applied to, i.e. the first character going backwards that is not of class CM.
*There are some exceptions to this rule involving mandatory breaks and a few special formatting characters, but they shouldn’t be relevant for your purposes.

Is U+0085 NEXT LINE (NEL) deprecated?

There is few information about NEL. I think it was supposed to replace LF and CR LF, but it seems like it wasn't very used.
Is it somehow deprecated and should applications interpret it as a new line?
There are many different code points which can be used as line separators, mostly for legacy reasons. Unicode's design seeks to preserve all information, so text can be roundtrip en- and decoded to/from Unicode without information loss. NEL is/was used in EBCDIC to represent newlines, so it made its way into Unicode as a separate character.
It's not "deprecated" in the sense that it still fulfils the function it was designed for, but you will hardly find it in actual use unless you're dealing with legacy EBCDIC somehow. If you do, you may want to treat it as newline, but many modern systems to date just treat it as whitespace.
The Wikipedia article on NEL has more information.

Unicode version of ABNF?

I want to write a grammar for a file format whose content can contain characters other than US-ASCII ones. Since I am used to ABNF, I try to use it...
However, none of RFCs 5234 and 7405 are very friendly towards people who DO NOT use US ASCII.
In fact, I'm looking for an ABNF version (and possibly some basic rules as well) which is character oriented rather than byte oriented; the only thing which RFC 5234 has to say about this is in section 2.4:
2.4. External Encodings
External representations of terminal value characters will vary
according to constraints in the storage or transmission environment.
Hence, the same ABNF-based grammar may have multiple external
encodings, such as one for a 7-bit US-ASCII environment, another for
a binary octet environment, and still a different one when 16-bit
Unicode is used. Encoding details are beyond the scope of ABNF,
although Appendix B provides definitions for a 7-bit US-ASCII
environment as has been common to much of the Internet.
By separating external encoding from the syntax, it is intended that
alternate encoding environments can be used for the same syntax.
That doesn't really clarify matters.
Is there a version of ABNF somewhere which is code point oriented rather than byte oriented?
Refer to section 2.3 of RFC 5234, which says:
Rules resolve into a string of terminal values, sometimes called
characters. In ABNF, a character is merely a non-negative integer.
In certain contexts, a specific mapping (encoding) of values into a
character set (such as ASCII) will be specified.
Unicode is just the set of non-negative integers U+0000 through U+10FFFF minus the surrogate range D800-DFFF and there are various RFCs that use ABNF accordingly. An example is RFC 3987.
If the ABNF you're writing is intended for human reading, then I'd say just use the normal syntax and refer to code points instead of bytes instead. You could take a look at various language specifications that allow Unicode in source text, e.g. C#, Java, PowerShell, etc. They all have a grammar, and they all have to define Unicode characters somewhere (e.g. for identifiers).
E.g. the PowerShell grammar has lines like this:
double-quote-character:
       " (U+0022)
       Left double quotation mark (U+201C)
       Right double quotation mark (U+201D)
       Double low-9 quotation mark (U+201E)
Or in the Java specification:
UnicodeInputCharacter:
       UnicodeEscape
       RawInputCharacter
UnicodeEscape:
       \ UnicodeMarker HexDigit HexDigit HexDigit HexDigit
UnicodeMarker:
       u
       UnicodeMarker u
RawInputCharacter:
       any Unicode character
HexDigit: one of
       0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F
The \, u, and hexadecimal digits here are all ASCII characters.
Note that there is surrounding text explaining the intent – which is always better than just dumping a heap of grammar on someone.
If it's for automatic parser generation, you may be better off finding a tool that allows you to specify a grammar both in Unicode and ABNF-like form and publish that instead. People writing parsers should be expected to understand either, though.

XML and Unicode specifications: what’s a legal character?

My manager asked me to explain why I called jdom’s checkCharacterData before passing my string to an XMLStreamWriter, so I referred to the XML spec and then got confused.
XML 1.0 and XML 1.1 say that a valid XML character is “tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646.” That sounds stupid: tab, carriage return, and line feed are legal characters of Unicode. Then there’s the comment “any Unicode character, excluding the surrogate blocks, FFFE, and FFFF,” which was modified in XML 1.1 to refer to U+0000 – U+10FFFF excluding U+0000, U+D800 – U+DFFF, and U+FFFE – U+FFFF; note that NUL is excluded. Then there’s the Note that says authors are “discouraged” from using the compatibility characters including some characters that are already excluded by the BNF.
Question: What is/was a legal Unicode character? Is NUL a valid Unicode character? (I found a pdf of ISO 10646 (2nd edition, 2010) which doesn’t seem to exclude U+0000.) Did ISO 10646 or Unicode change between the 2000 edition and the 2010 edition to include control characters that were previously excluded? And as for XML, is there a reason that the text is so lenient/sloppy while the BNF is strict?
Question: What is/was a legal Unicode character?
The Unicode Glossary defines it thus:
Character. (1) The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape, rather than a specific shape (see also glyph), though in code tables some form of visual representation is essential for the reader’s understanding. (2) Synonym for abstract character. (3) The basic unit of encoding for the Unicode character encoding. (4) The English name for the ideographic written elements of Chinese origin. [See ideograph (2).]
Is NUL a valid Unicode character? (I found a pdf of ISO 10646 (2nd edition, 2010) which doesn’t seem to exclude U+0000.)
NUL is a codepoint, and it falls under the definition of "abstract character" so it is a character by sense 2 above.
Did ISO 10646 or Unicode change between the 2000 edition and the 2010 edition to include control characters that were previously excluded?
NUL has been a control character from early versions.
Appendix D contains a list of changes.
It says in table D.2 that there have been 65 control characters from Version 1 through Version 3 without change.
Table D-2 documents the number of characters assigned in the different versions of the Unicode standard.
V1.0 V1.1 V2.0 V2.1 V3.0
...
Controls 65 65 65 65 65
And as for XML, is there a reason that the text is so lenient/sloppy while the BNF is strict?
Writing specifications that are both complete and succinct is hard. When the text disagrees with the BNF, trust the BNF.
The use of the word “character” is intentionally fuzzy in the Unicode standard, but mostly it is used in a technical sense: a code point designated as an assigned character code point. This does not completely coincide with the intuitive concept of character. For example, the intuitive character that consists of letter i with macron and grave accent does not exist as a code point; in Unicode, it can only be represented as a sequence of two or three code points. As another example, the so-called control characters are not characters in the intuitive sense.
When other standards and specifications refer to “Unicode characters,” they refer to code points designated as assigned character code points. The set of Unicode characters varies by Unicode standard version, since new code points are assigned. Technically, the UnicodeData.txt file (at ftp://ftp.unicode.org/Public/UNIDATA/) indicates which code points are characters.
U+0000, conventionally denoted by NUL, has been a Unicode character since the beginning.
The XML specifications are inexact in many ways as regards to characters, as you have observed. But the essential definition is the BNF production for “Char” and the statement “XML processors MUST accept any character in the range specified for Char.” This means that in XML specifications, the concept of character is broader than Unicode character. The ranges in the production contain unassigned code points, actually a huge number of them.
The comment to the “Char” production in XML specifications is best ignored. It is very confusing and even incorrect. The “Char” production simply refers to a set of Unicode code points (different sets in different versions of XML). The set includes code points that you should never use in character data, as well as code points that should be avoided for various reasons. But such rules are at a level different from the formal rules of XML and requirements on XML implementations.
When selecting or writing a routine for checking character data, it depends on the application and purpose what should be accepted and what should be done with code points that fail the test. Even surrogate code points might be processed in some way instead of being just discarded; they may well appear due to confusions with encodings (or e.g. when a Java string has been naively taken as a string of Unicode characters – it is as such just a sequence of 16-bit code units).
I would ignore the verbage and just focus on the definitions:
XML 1.0:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
Document authors are encouraged to avoid "compatibility characters", as defined in section 2.3 of [Unicode]. The characters defined in the following ranges are also discouraged. They are either control characters or permanently undefined Unicode characters:
[#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDEF],
[#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF],
[#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF],
[#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF],
[#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF],
[#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF],
[#x10FFFE-#x10FFFF].
XML 1.1:
Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
RestrictedChar ::= [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | [#x86-#x9F]
Document authors are encouraged to avoid "compatibility characters", as defined in Unicode [Unicode]. The characters defined in the following ranges are also discouraged. They are either control characters or permanently undefined Unicode characters:
[#x1-#x8], [#xB-#xC], [#xE-#x1F], [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF],
[#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF],
[#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF],
[#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF],
[#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF],
[#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF],
[#x10FFFE-#x10FFFF].
It sounds stupid because it is stupid. The First Edition of XML (1998) read "the legal graphic characters of Unicode." For whatever reason, the word "graphic" was removed from the Second Edition of 2000, perhaps because it is inaccurate: XML allows many characters that are not graphic characters.
The definition in the Char production is indeed the right place to look.

What issues would come from treating UTF-16 as a fixed 16-bit encoding?

I was reading a few questions on SO about Unicode and there were some comments I didn't fully understand, like this one:
Dean Harding: UTF-8 is a
variable-length encoding, which is
more complex to process than a
fixed-length encoding. Also, see my
comments on Gumbo's answer: basically,
combining characters exist in all
encodings (UTF-8, UTF-16 & UTF-32) and
they require special handling. You can
use the same special handling that you
use for combining characters to also
handle surrogate pairs in UTF-16, so
for the most part you can ignore
surrogates and treat UTF-16 just like
a fixed encoding.
I've a little confused by the last part ("for the most part"). If UTF-16 is treated as fixed 16-bit encoding, what issues could this cause? What are the chances that there are characters outside of the BMP? If there are, what issues could this cause if you'd assumed two-byte characters?
I read the Wikipedia info on Surrogates but it didn't really make things any clearer to me!
Edit: I guess what I really mean is "Why would anyone suggest treating UTF-16 as fixed encoding when it seems bogus?"
Edit2:
I found another comment in "Is there any reason to prefer UTF-16 over UTF-8?" which I think explains this a little better:
Andrew Russell: For performance:
UTF-8 is much harder to decode than
UTF-16. In UTF-16 characters are
either a Basic Multilingual Plane
character (2 bytes) or a Surrogate
Pair (4 bytes). UTF-8 characters can
be anywhere between 1 and 4 bytes
This suggests the point being made was that UTF-16 would not have any three-byte characters, so by assuming 16bits, you wouldn't "totally screw up" by ending up one-byte off. But I'm still not convinced this is any different to assuming UTF-8 is single-byte characters!
UTF-16 includes all "base plane" characters. The BMP covers most of the current writing systems, and includes many older characters that one can practically encounter. Take a look at them and decide whether you really are going to encounter any characters from the extended planes: cuneiform, alchemical symbols, etc. Few people will really miss them.
If you still encounter characters that require extended planes, these are encoded by two code points (surrogates), and you'll see two empty squares or question marks instead of such a non-character. UTF is self-synchronizing, so a part of a surrogate character never looks like a legitimate character. This allows things like string searches to work even if surrogates are present and you don't handle them.
Thus issues arising from treating UTF-16 as effectively USC-2 are minimal, aside from the fact that you don't handle the extended characters.
EDIT: Unicode uses 'combining marks' that render at the space of previous character, like accents, tilde, circumflex, etc. Sometimes a combination of a diacritic mark with a letter can be represented as a distinct code point, e.g. á can be represented as a single \u00e1 instead of a plain 'a' + accent which are \u0061\u0301. Still you can't represent unusual combinations like z̃ as one code point. This makes search and splitting algorithms a bit more complex. If you somehow make your string data uniform (e.g. only using plain letters and combining marks), search and splitting become simple again, but anyway you lose the 'one position is one character' property. A symmetrical problem happens if you're seriously into typesetting and want to explicitly store ligatures like fi or ffl where one code point corresponds to 2 or 3 characters. This is not a UTF issue, it's an issue of Unicode in general, AFAICT.
It is important to understand that even UTF-32 is fixed-length when it comes to code points, not characters. There are many characters that are composed from multiple code points, and therefore you can't really have a Unicode encoding where one number (code unit) corresponds to one character (as perceived by users).
To answer your question - the most obvious issue with treating UTF-16 as fixed-length encoding form would be to break a string in a middle of a surrogate pair so you get two invalid code points. It all really depends what you are doing with the text.
I guess what I really mean is
"Why would anyone suggest treating
UTF-16 as fixed encoding when it seems
bogus?"
Two words: Backwards compatibility.
Unicode was originally intended to use a fixed-width 16-bit encoding (UCS-2), which is why early adopters of Unicode (e.g., Sun with Java and Microsoft with Windows NT), used a 16-bit character type. When it turned out that 65,536 characters wasn't enough for everyone, UTF-16 was developed in order to allow this 16-bit character systems to represent the 16 new "planes".
This meant that characters were no longer fixed-width, so people created the rationalization that "that's OK because UTF-16 is almost fixed width."
But I'm still not convinced this is
any different to assuming UTF-8 is
single-byte characters!
Strictly speaking, it's not any different. You'll get incorrect results for things like "\uD801\uDC00".lower().
However, assuming UTF-16 is fixed width is less likely to break than assuming UTF-8 is fixed-width. Non-ASCII characters are very common in languages other than English, but non-BMP characters are very rare.
You can use the same special handling
that you use for combining characters
to also handle surrogate pairs in
UTF-16
I don't know what he's talking about. Combining sequences, whose constituent characters have an individual identity, are nothing at all like surrogate characters, which are only meaningful in pairs.
In particular, the characters within a combining sequence can be converted to a different encoding form one characters at a time.
>>> 'a'.encode('UTF-8') + '\u0301'.encode('UTF-8')
b'a\xcc\x81'
But not surrogates:
>>> '\uD801'.encode('UTF-8') + '\uDC00'.encode('UTF-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud801' in position 0: surrogates not allowed
UTF-16 is a variable-length encoding. The older UCS-2 is not. If you treat a variable-length encoding like fixed (constant length) you risk introducing error whenever you use "number of 16-bit numbers" to mean "number of characters", since the number of characters might actually be less than the number of 16-bit quantities.
The Unicode standard has changed several times along the way. For example, UCS-2 is not a valid encoding anymore. It has been deprecated for a while now.
As mentioned by user 9000, even in UTF-32, you have sequences of characters that are interdependent. The à is a good example, although this character can be canonicalized to \x00E1. So you can make it simple.
Unicode, even when using the UTF-32 encoding, supports up to 30 code points, one after the other, to represent the most complex characters. (The existing characters do not use that many, I think the longest in existence is currently 17 if I'm correct.)
For that reason, Unicode developed Normalization Forms. It actually considers five different forms:
Unnormalized -- a sequence you create manually, for example; text editors are expected to save properly normalized (NFC) code sequences
NFD -- Normalization Form Decomposition
NFKD -- Normalization Form Compatibility Decomposition
NFC -- Normalization Form Canonical Composition
NFKC -- Normalization Form Compatibility Canonical Composition
Although in most situations it does not matter much because long compositions are rare, even in languages that use them.
And in most cases, your code already deals with canonical compositions. However, if you create strings manually in your code, you are not unlikely to create an unnormalized string (assuming you use such long forms).
Properly implemented servers on the Internet are expected to refused strings that are not canonical compositions as per Unicode. Long forms are also forbidden over connections. For example, the UTF-8 encoding technically allows for ASCII characters to be encoded using 1, 2, 3, or 4 bytes (and the old encoding allowed up to 6 bytes!) but those encoding are not permitted.
Any comment on the Internet that contradicts the Unicode Normalization Form document is simply incorrect.