Inconsistent word and character boundaries in ICU - unicode

I'm using ICU's break iterators for both characters and words as described in here. I expect the output of character break iterator stops more frequently and the break-points be a superset of that of word break iterator. For instance, if I pass abc, I get a, b, and c from character break iterator while I get abc from word break iterator.
Now, I have a Thai string as ด้าน้ำ. The problem is that the behavior of these two break iterators are inconsistent. Given the length of the above string is 6 in Unicode, I get these results from ICU 61.1 on MacOS:
Word boundaries:
[0, 5)
[5, 6)
Character boundaries:
[0, 2)
[2, 3)
[3, 6)
As you can see, character break operator breaks the word in [3, 6) (which seems correct), while word break operator breaks it in [5, 6). Here's a small Python3 code which uses PyICU to repro the issue:
import PyICU
def wordBreakIterator():
return PyICU.BreakIterator.createWordInstance(PyICU.Locale("th"))
def charBreakIterator():
return PyICU.BreakIterator.createCharacterInstance(PyICU.Locale("th"))
def printBoundaries(txt, bi):
bi.setText(txt)
start = bi.first()
try:
while True:
end = next(bi)
print("[{}, {})".format(start, end))
start = end
except StopIteration:
pass
if __name__ == "__main__":
text = u'ด้าน้ำ'
print("Word boundaries:")
printBoundaries(text, wordBreakIterator())
print("Character boundaries:")
printBoundaries(text, charBreakIterator())

Related

Are NFC normalization boundaries also extended grapheme cluster boundaries?

This question is related to text editing. Say you have a piece of text in normalization form NFC, and a cursor that points to an extended grapheme cluster boundary within this text. You want to insert another piece of text at the cursor location, and make sure that the resulting text is also in NFC. You also want to move the cursor on the first grapheme boundary that immediately follows the inserted text.
Now, since concatenating two strings that are both in NFC doesn't necessarily produce a string that is also in NFC, you might have to emend the text around the insertion point. For instance, if you have a string that contains 4 code points like so:
[0] LATIN SMALL LETTER B
[1] LATIN SMALL LETTER E
[2] COMBINING MACRON BELOW
--- Cursor location
[3] LATIN SMALL LETTER A
And you want to insert a 2-codepoints string {COMBINING ACUTE ACCENT, COMBINING DOT ABOVE} at the cursor location. Then the result will be:
[0] LATIN SMALL LETTER B
[1] LATIN SMALL LETTER E WITH ACUTE
[2] COMBINING MACRON BELOW
[3] COMBINING DOT ABOVE
--- Cursor location
[4] LATIN SMALL LETTER A
Now my question is: how do you figure out at which offset you should place the cursor after inserting the string, in such a way that the cursor ends up after the inserted string and also on a grapheme boundary? In this particular case, the text that follows the cursor location cannot possibly interact, during normalization, with what precedes. So the following sample Python code would work:
import unicodedata
def insert(text, cursor_pos, text_to_insert):
new_text = text[:cursor_pos] + text_to_insert
new_text = unicodedata.normalize("NFC", new_text)
new_cursor_pos = len(new_text)
new_text += text[cursor_pos:]
if new_cursor_pos == 0:
# grapheme_break_after is a function that
# returns the offset of the first grapheme
# boundary after the given index
new_cursor_pos = grapheme_break_after(new_text, 0)
return new_text, new_cursor_pos
But does this approach necessarily work? To be more explicit: is it necessarily the case that the text that follows a grapheme boundary doesn't interact with what precedes it during normalization, such that NFC(text[:grapheme_break]) + NFC(text[grapheme_break:]) == NFC(text) is always true?
Update
#nwellnhof's excellent analysis below motivated me to investigate things
further. So I followed the "When in doubt, use brute force" mantra and wrote a
small script that parses grapheme break properties and examines each code point
that can appear at the beginning of a grapheme, to test whether it can
possibly interact with preceding code points during normalization. Here's the
script:
from urllib.request import urlopen
import icu, unicodedata
URL = "http://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakProperty.txt"
break_props = {}
with urlopen(URL) as f:
for line in f:
line = line.decode()
p = line.find("#")
if p >= 0:
line = line[:p]
line = line.strip()
if not line:
continue
fields = [x.strip() for x in line.split(";")]
codes = [int(x, 16) for x in fields[0].split("..")]
if len(codes) == 2:
start, end = codes
else:
assert(len(codes) == 1)
start, end = codes[0], codes[0]
category = fields[1]
break_props.setdefault(category, []).extend(range(start, end + 1))
# The only code points that can't appear at the beginning of a grapheme boundary
# are those that appear in the following categories. See the regexps in
# UAX #29 Tables 1b and 1c.
to_ignore = set(c for name in ("Extend", "ZWJ", "SpacingMark") for c in break_props[name])
nfc = icu.Normalizer2.getNFCInstance()
for c in range(0x10FFFF + 1):
if c in to_ignore:
continue
if not nfc.hasBoundaryBefore(chr(c)):
print("U+%04X %s" % (c, unicodedata.name(chr(c))))
Looking at the output, it appears that there are about 40 code points that are
grapheme starters but still compose with preceding code points in NFC.
Basically, they are non-precomposed Hangul syllables of type V
(U+1161..U+1175) and T (U+11A8..U+11C2). Things makes sense when you examine
the regular expressions in UAX #29, Table
1c together with what
the standard says about Jamo composition (section 3.12, p. 147 of the version
13 of the standard).
The gist of it is that Hangul sequences of the form {L, V} can compose to a
Hangul syllable of type LV, and similarly sequences of the form {LV, T} can
compose to a syllable of type LVT.
To sum up, and assuming I'm not mistaken, the above Python code could
be corrected as follows:
import unicodedata
import icu # pip3 install icu
def insert(text, cursor_pos, text_to_insert):
new_text = text[:cursor_pos] + text_to_insert
new_text = unicodedata.normalize("NFC", new_text)
new_cursor_pos = len(new_text)
new_text += text[cursor_pos:]
new_text = unicodedata.normalize("NFC", new_text)
break_iter = icu.BreakIterator.createCharacterInstance(icu.Locale())
break_iter.setText(new_text)
if new_cursor_pos == 0:
# Move the cursor to the first grapheme boundary > 0.
new_cursor_pos = breakIter.nextBoundary()
elif new_cursor_pos > len(new_text):
new_cursor_pos = len(new_text)
elif not break_iter.isBoundary(new_cursor_pos):
# isBoundary() moves the cursor on the first boundary >= the given
# position.
new_cursor_pos = break_iter.current()
return new_text, new_cursor_pos
The (possibly) pointless test new_cursor_pos > len(new_text) is there to
catch the case len(NFC(x)) > len(NFC(x + y)). I'm not sure whether this can
actually happen with the current Unicode database (more tests would be needed to prove it), but it is theoretically quite possible. If, say, you have
a set a three code points A, B and C and two precomposed forms A+B and
A+B+C (but not A+C), then you could very well have NFC({A, C} + {B}) = {A+B+C}.
If this case doesn't occur in practice (which is very likely, especially with
"real" texts), then the above Python code will necessarily locate the first
grapheme boundary after the end of the inserted text. Otherwise, it will merely
locate some grapheme boundary after the inserted text, but not necessarily the
first one. I don't yet see how it could be possible to improve the second case (assuming it isn't merely theoretical), so I think I'll leave
my investigation at that for now.
As mentioned in my comment, the actual boundaries can differ slightly. But AFAICS, there should be no meaningful interaction. UAX #29 states:
6.1 Normalization
[...] the grapheme cluster boundary specification has the following features:
There is never a break within a sequence of nonspacing marks.
There is never a break between a base character and subsequent nonspacing marks.
This only mentions nonspacing marks. But with extended grapheme clusters (as opposed to legacy ones), I'm pretty sure these statements also apply to "non-starter" spacing marks[1]. This would cover all normalization non-starters (which must be either nonspacing (Mn) or spacing (Mc) marks). So there's never an extended grapheme cluster boundary before a non-starter[2] which should give you the guarantee you need.
Note that it's possible to have multiple runs of starters and non-starters ("normalization boundaries") within a single grapheme cluster, for example with U+034F COMBINING GRAPHEME JOINER.
[1] Some spacing marks are excluded, but these should all be starters.
[2] Except at the start of text.

Using M4, how to extract code from a string and increase indentation?

I need to write a M4 macro that extracts + transforms code between curly braces.
I want to transform
{
import math
a_list = [1, 4, 9, 16]
if val:
print([math.sqrt(i) for i in a_list])
else:
print("val is False")
print("bye bye")
}
to
import math
a_list = [1, 4, 9, 16]
if val:
print([math.sqrt(i) for i in a_list])
else:
print("val is False")
print("bye bye")
The macro has to trim the whitespace before the first { and after the last }.
Because this is python code, nside the curly braces, the relative indentation must be preserved.
Because the output of the macro will be outputted somewhere, needed a certain level of indentation.
The macro should also be able to add extra indentation (=some number of spaces), e.g. given as an argument.
The project is already using m4sugar, so the quotes are [ and ].
Thanks.

calculating upper case letters ,lowercase letters, and other characters

Write a program that accepts a sentence as console input and calculate the number of upper case letters , lower case letters and other characters.
Suppose the following input is supplied to the program:
Hello World;!#
Since this question sounds like a programming assignment, I've written this is a more-wordy manner. This is standard Python 3, not Jes.
#! /usr/bin/env python3
import sys
upper_case_chars = 0
lower_case_chars = 0
total_chars = 0
found_eof = False
# Read character after character from stdin, processing it in turn
# Stop if an error is encountered, or End-Of-File happens.
while (not found_eof):
try:
letter = str(sys.stdin.read(1))
except:
# handle any I/O error somewhat cleanly
break
if (letter != ''):
total_chars += 1
if (letter >= 'A' and letter <= 'Z'):
upper_case_chars += 1
elif (letter >= 'a' and letter <= 'z'):
lower_case_chars += 1
else:
found_eof = True
# write the results to the console
print("Upper-case Letters: %3u" % (upper_case_chars))
print("Lower-case Letters: %3u" % (lower_case_chars))
print("Other Letters: %3u" % (total_chars - (upper_case_chars + lower_case_chars)))
Note that you should modify the code to handle end-of-line characters yourself. Currently they're counted as "other". I've also left out handling of binary input, probably the str() will fail.

How is the full substring different from using .text()?

I'm failing to see how taking the full substring is different from just using .text()?
This is a snippet of a larger code set that I'm trying to understand but failing:
$(this).text().substring(0, ($(this).text().length - 1))
Substring takes a portion of the full text/string, but in this case it is taking the whole string, correct?
No, here substring is returning characters 0 to n-1 of an n length string.
x = "hello";
>>> "hello"
x.substring(0, x.length - 1)
>>> "hell"
From the MDN documentation linked:
substring extracts characters from indexA up to but not including indexB. In particular:
If indexA equals indexB, substring returns an empty string.
If indexB is omitted, substring extracts characters to the end of the string.
If either argument is less than 0 or is NaN, it is treated as if it were 0.
If either argument is greater than stringName.length, it is treated as if it were stringName.length.

Comparing characters in Rebol 3

I am trying to compare characters to see if they match. I can't figure out why it doesn't work. I'm expecting true on the output, but I'm getting false.
character: "a"
word: "aardvark"
(first word) = character ; expecting true, getting false
So "a" in Rebol is not a character, it is actually a string.
A single unicode character is its own independent type, with its own literal syntax, e.g. #"a". For example, it can be converted back and forth from INTEGER! to get a code point, which the single-letter string "a" cannot:
>> to integer! #"a"
== 97
>> to integer! "a"
** Script error: cannot MAKE/TO integer! from: "a"
** Where: to
** Near: to integer! "a"
A string is not a series of one-character STRING!s, it's a series of CHAR!. So what you want is therefore:
character: #"a"
word: "aardvark"
(first word) = character ;-- true!
(Note: Interestingly, binary conversions of both a single character string and that character will be equivalent:
>> to binary! "μ"
== #{CEBC}
>> to binary! #"μ"
== #{CEBC}
...those are UTF-8 byte representations.)
I recommend for cases like this, when things start to behave in a different way than you expected, to use things like probe and type?. This will help you get a sense of what's going on, and you can use the interactive Rebol console on small pieces of code.
For instance:
>> character: "a"
>> word: "aardvark"
>> type? first word
== char!
>> type? character
== string!
So you can indeed see that the first element of word is a character #"a", while your character is the string! "a". (Although I agree with #HostileFork that comparing a string of length 1 and a character is for a human the same.)
Other places you can test things are http://tryrebol.esperconsultancy.nl or in the chat room with RebolBot