Are NFC normalization boundaries also extended grapheme cluster boundaries? - unicode

This question is related to text editing. Say you have a piece of text in normalization form NFC, and a cursor that points to an extended grapheme cluster boundary within this text. You want to insert another piece of text at the cursor location, and make sure that the resulting text is also in NFC. You also want to move the cursor on the first grapheme boundary that immediately follows the inserted text.
Now, since concatenating two strings that are both in NFC doesn't necessarily produce a string that is also in NFC, you might have to emend the text around the insertion point. For instance, if you have a string that contains 4 code points like so:
[0] LATIN SMALL LETTER B
[1] LATIN SMALL LETTER E
[2] COMBINING MACRON BELOW
--- Cursor location
[3] LATIN SMALL LETTER A
And you want to insert a 2-codepoints string {COMBINING ACUTE ACCENT, COMBINING DOT ABOVE} at the cursor location. Then the result will be:
[0] LATIN SMALL LETTER B
[1] LATIN SMALL LETTER E WITH ACUTE
[2] COMBINING MACRON BELOW
[3] COMBINING DOT ABOVE
--- Cursor location
[4] LATIN SMALL LETTER A
Now my question is: how do you figure out at which offset you should place the cursor after inserting the string, in such a way that the cursor ends up after the inserted string and also on a grapheme boundary? In this particular case, the text that follows the cursor location cannot possibly interact, during normalization, with what precedes. So the following sample Python code would work:
import unicodedata
def insert(text, cursor_pos, text_to_insert):
new_text = text[:cursor_pos] + text_to_insert
new_text = unicodedata.normalize("NFC", new_text)
new_cursor_pos = len(new_text)
new_text += text[cursor_pos:]
if new_cursor_pos == 0:
# grapheme_break_after is a function that
# returns the offset of the first grapheme
# boundary after the given index
new_cursor_pos = grapheme_break_after(new_text, 0)
return new_text, new_cursor_pos
But does this approach necessarily work? To be more explicit: is it necessarily the case that the text that follows a grapheme boundary doesn't interact with what precedes it during normalization, such that NFC(text[:grapheme_break]) + NFC(text[grapheme_break:]) == NFC(text) is always true?
Update
#nwellnhof's excellent analysis below motivated me to investigate things
further. So I followed the "When in doubt, use brute force" mantra and wrote a
small script that parses grapheme break properties and examines each code point
that can appear at the beginning of a grapheme, to test whether it can
possibly interact with preceding code points during normalization. Here's the
script:
from urllib.request import urlopen
import icu, unicodedata
URL = "http://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakProperty.txt"
break_props = {}
with urlopen(URL) as f:
for line in f:
line = line.decode()
p = line.find("#")
if p >= 0:
line = line[:p]
line = line.strip()
if not line:
continue
fields = [x.strip() for x in line.split(";")]
codes = [int(x, 16) for x in fields[0].split("..")]
if len(codes) == 2:
start, end = codes
else:
assert(len(codes) == 1)
start, end = codes[0], codes[0]
category = fields[1]
break_props.setdefault(category, []).extend(range(start, end + 1))
# The only code points that can't appear at the beginning of a grapheme boundary
# are those that appear in the following categories. See the regexps in
# UAX #29 Tables 1b and 1c.
to_ignore = set(c for name in ("Extend", "ZWJ", "SpacingMark") for c in break_props[name])
nfc = icu.Normalizer2.getNFCInstance()
for c in range(0x10FFFF + 1):
if c in to_ignore:
continue
if not nfc.hasBoundaryBefore(chr(c)):
print("U+%04X %s" % (c, unicodedata.name(chr(c))))
Looking at the output, it appears that there are about 40 code points that are
grapheme starters but still compose with preceding code points in NFC.
Basically, they are non-precomposed Hangul syllables of type V
(U+1161..U+1175) and T (U+11A8..U+11C2). Things makes sense when you examine
the regular expressions in UAX #29, Table
1c together with what
the standard says about Jamo composition (section 3.12, p. 147 of the version
13 of the standard).
The gist of it is that Hangul sequences of the form {L, V} can compose to a
Hangul syllable of type LV, and similarly sequences of the form {LV, T} can
compose to a syllable of type LVT.
To sum up, and assuming I'm not mistaken, the above Python code could
be corrected as follows:
import unicodedata
import icu # pip3 install icu
def insert(text, cursor_pos, text_to_insert):
new_text = text[:cursor_pos] + text_to_insert
new_text = unicodedata.normalize("NFC", new_text)
new_cursor_pos = len(new_text)
new_text += text[cursor_pos:]
new_text = unicodedata.normalize("NFC", new_text)
break_iter = icu.BreakIterator.createCharacterInstance(icu.Locale())
break_iter.setText(new_text)
if new_cursor_pos == 0:
# Move the cursor to the first grapheme boundary > 0.
new_cursor_pos = breakIter.nextBoundary()
elif new_cursor_pos > len(new_text):
new_cursor_pos = len(new_text)
elif not break_iter.isBoundary(new_cursor_pos):
# isBoundary() moves the cursor on the first boundary >= the given
# position.
new_cursor_pos = break_iter.current()
return new_text, new_cursor_pos
The (possibly) pointless test new_cursor_pos > len(new_text) is there to
catch the case len(NFC(x)) > len(NFC(x + y)). I'm not sure whether this can
actually happen with the current Unicode database (more tests would be needed to prove it), but it is theoretically quite possible. If, say, you have
a set a three code points A, B and C and two precomposed forms A+B and
A+B+C (but not A+C), then you could very well have NFC({A, C} + {B}) = {A+B+C}.
If this case doesn't occur in practice (which is very likely, especially with
"real" texts), then the above Python code will necessarily locate the first
grapheme boundary after the end of the inserted text. Otherwise, it will merely
locate some grapheme boundary after the inserted text, but not necessarily the
first one. I don't yet see how it could be possible to improve the second case (assuming it isn't merely theoretical), so I think I'll leave
my investigation at that for now.

As mentioned in my comment, the actual boundaries can differ slightly. But AFAICS, there should be no meaningful interaction. UAX #29 states:
6.1 Normalization
[...] the grapheme cluster boundary specification has the following features:
There is never a break within a sequence of nonspacing marks.
There is never a break between a base character and subsequent nonspacing marks.
This only mentions nonspacing marks. But with extended grapheme clusters (as opposed to legacy ones), I'm pretty sure these statements also apply to "non-starter" spacing marks[1]. This would cover all normalization non-starters (which must be either nonspacing (Mn) or spacing (Mc) marks). So there's never an extended grapheme cluster boundary before a non-starter[2] which should give you the guarantee you need.
Note that it's possible to have multiple runs of starters and non-starters ("normalization boundaries") within a single grapheme cluster, for example with U+034F COMBINING GRAPHEME JOINER.
[1] Some spacing marks are excluded, but these should all be starters.
[2] Except at the start of text.

Related

How can I obtain only word without All Punctuation Marks when I read text file?

The text file abc.txt is an arbitrary article that has been scraped from the web. For example, it is as follows:
His name is "Donald" and he likes burger. On December 11, he married.
I want to extract only words in lower case and numbers except for all kinds of periods and quotes in the above article. In the case of the above example:
{his, name, is, Donald, and, he, likes, burger, on, December, 11, he, married}
My code is as follows:
filename = 'abc.txt';
fileID = fopen(filename,'r');
C = textscan(fileID,'%s','delimiter',{',','.',':',';','"','''});
fclose(fileID);
Cstr = C{:};
Cstr = Cstr(~cellfun('isempty',Cstr));
Is there any simple code to extract only alphabet words and numbers except all symbols?
Two steps are necessary as you want to convert certain words to lowercase.
regexprep converts words, which are either at the start of the string or follow a full stop and whitespace, to lower case.
In the regexprep function, we use the following pattern:
(?<=^|\. )([A-Z])
to indicate that:
(?<=^|\. ) We want to assert that before the word of interest either the start of string (^), or (|) a full stop (.) followed by whitespace are found. This type of construct is called a lookbehind.
([A-Z]) This part of the expression matches and captures (stores the match) a upper case letter (A-Z).
The ${lower($0)} component in the regex is called a dynamic expression, and replaces the contents of the captured group (([A-Z])) to lower case. This syntax is specific to the MATLAB language.
You can check the behaviour of the above expression here.
Once the lower case conversions have occurred, regexp finds all occurrences of one or more digits, lower case and upper case letters.
The pattern [a-zA-Z0-9]+ matches lower case letters, upper case letters and digits.
You can check the behavior of this regex here.
text = fileread('abc.txt')
data = {regexp(regexprep(text,'(?<=^|\. )([A-Z])','${lower($0)}'),'[a-zA-Z0-9]+','match')'}
>>data{1}
13×1 cell array
{'his' }
{'name' }
{'is' }
{'Donald' }
{'and' }
{'he' }
{'likes' }
{'burger' }
{'on' }
{'December'}
{'11' }
{'he' }
{'married' }

Extended Grapheme Clusters stop combining

I am having one question with the Extended Grapheme Clusters.
For example, look at following code:
let message = "c\u{0327}a va bien" // => "ça va bien"
How does Swift know it needs to be combined (i.e. ç) rather than treating it as a small letter c AND a "COMBINING CEDILLA"?
Use the unicodeScalars view on the string:
let message1 = "c\u{0327}".decomposedStringWithCanonicalMapping
for scalar in message1.unicodeScalars {
print(scalar) // print c and Combining Cedilla separately
}
let message2 = "c\u{0327}".precomposedStringWithCanonicalMapping
for scalar in message2.unicodeScalars {
print(scalar) // print Latin Small Letter C with Cedilla
}
Note that not all composite characters have a precomposed form, as noted by Apple's Technical Q&A:
Important: Do not convert to precomposed Unicode in an attempt to simplify your text processing. Precomposed Unicode can still contain composite characters. For example, there is no precomposed equivalent of U+0065 U+030A (LATIN SMALL LETTER E followed by COMBINING RING ABOVE)

Matching Unicode punctuation using LPeg

I am trying to create an LPeg pattern that would match any Unicode punctuation inside UTF-8 encoded input. I came up with the following marriage of Selene Unicode and LPeg:
local unicode = require("unicode")
local lpeg = require("lpeg")
local punctuation = lpeg.Cmt(lpeg.Cs(any * any^-3), function(s,i,a)
local match = unicode.utf8.match(a, "^%p")
if match == nil
return false
else
return i+#match
end
end)
This appears to work, but it will miss punctuation characters that are a combination of several Unicode codepoints (if such characters exist), as I am reading only 4 bytes ahead, it probably kills the performance of the parser, and it is undefined what the library match function will do, when I feed it a string that contains a runt UTF-8 character (although it appears to work now).
I would like to know whether this is a correct approach or if there is a better way to achieve what I am trying to achieve.
The correct way to match UTF-8 characters is shown in an example in the LPeg homepage. The first byte of a UTF-8 character determines how many more bytes are a part of it:
local cont = lpeg.R("\128\191") -- continuation byte
local utf8 = lpeg.R("\0\127")
+ lpeg.R("\194\223") * cont
+ lpeg.R("\224\239") * cont * cont
+ lpeg.R("\240\244") * cont * cont * cont
Building on this utf8 pattern we can use lpeg.Cmt and the Selene Unicode match function kind of like you proposed:
local punctuation = lpeg.Cmt(lpeg.C(utf8), function (s, i, c)
if unicode.utf8.match(c, "%p") then
return i
end
end)
Note that we return i, this is in accordance with what Cmt expects:
The given function gets as arguments the entire subject, the current position (after the match of patt), plus any capture values produced by patt. The first value returned by function defines how the match happens. If the call returns a number, the match succeeds and the returned number becomes the new current position.
This means we should return the same number the function receives, that is the position immediately after the UTF-8 character.

Created unicode & unicode without whitespace generators in ScalaCheck

During testing we want to qualify unicode characters, sometimes with wide ranges and sometimes more narrow. I've created a few specific generators:
// Generate a wide varying of Unicode strings with all legal characters (21-40 characters):
val latinUnicodeCharacter = Gen.choose('\u0041', '\u01B5').filter(Character.isDefined)
// Generate latin Unicode strings with all legal characters (21-40 characters):
val latinUnicodeGenerator: Gen[String] = Gen.chooseNum(21, 40).flatMap { n =>
Gen.sequence[String, Char](List.fill(n)(latinUnicodeCharacter))
}
// Generate latin unicode strings without whitespace (21-40 characters): !! COMES UP SHORT...
val latinUnicodeGeneratorNoWhitespace: Gen[String] = Gen.chooseNum(21, 40).flatMap { n =>
Gen.sequence[String, Char](List.fill(n)(latinUnicodeCharacter)).map(_.replaceAll("[\\p{Z}\\p{C}]", ""))
}
The latinUnicodeCharacter generator picks from characters ranging from standard latin ("A," "B," etc.) up to higher order latin character (Germanic/Nordic and others). This is good for testing latin-based character input for, say, names.
The latinUnicodeGenerator creates strings of 21-40 characters in length. These strings include horizontal space (not just a space character but other "horizontal space").
The final example, latinUnicodeGeneratorNoWhitespace, is used for say email addresses. We want the latin characters but we don't want spaces, control codes, and the like. The problem: Because I'm mapping the final result String and filtering out the control characters, the String shrinks and I end up with a total length that is less than 21 characters (sometimes).
So the question is: How can I implement latinUnicodeGeneratorNoWhitespace but do it inside the generator in such a way that I always get 21-40 character strings?
You could do this by putting together a sequence of your non-whitespace characters, another of whitespace, and then picking from either only the non-whitespace, or from both together:
import org.scalacheck.Gen
val myChars = ('A' to 'Z') ++ ('a' to 'z')
val ws = Seq(' ', '\t')
val myCharsGenNoWhitespace: Gen[String] = Gen.chooseNum(21, 40).flatMap { n =>
Gen.buildableOfN[String, Char](n, Gen.oneOf(myChars))
}
val myCharsGen: Gen[String] = Gen.chooseNum(21, 40).flatMap { n =>
Gen.buildableOfN[String, Char](n, Gen.oneOf(myChars ++ ws))
}
I would suggest considering what you're really testing for, though—the more you restrict the test cases, the less you're checking about how your program will behave on unexpected inputs.

Capitalizing only the first letters without changing any numbers or punctuation

I would like to modify a string that will have make the first letter capitalized and all other letters lower cased, and anything else will be unchanged.
I tried this:
function new_string=switchCase(str1)
%str1 represents the given string containing word or phrase
str1Lower=lower(str1);
spaces=str1Lower==' ';
caps1=[true spaces];
%we want the first letter and the letters after space to be capital.
strNew1=str1Lower;
strNew1(caps1)=strNew1(caps1)-32;
end
This function works nicely if there is nothing other than a letter after space. If we have anything else for example:
str1='WOW ! my ~Code~ Works !!'
Then it gives
new_string =
'Wow My ^code~ Works !'
However, it has to give (according to the requirement),
new_string =
'Wow! My ~code~ Works !'
I found a code which has similarity with this problem. However, that is ambiguous. Here I can ask question if I don't understand.
Any help will be appreciated! Thanks.
Interesting question +1.
I think the following should fulfil your requirements. I've written it as an example sub-routine and broken down each step so it is obvious what I'm doing. It should be straightforward to condense it into a function from here.
Note, there is probably also a clever way to do this with a single regular expression, but I'm not very good with regular expressions :-) I doubt a regular expression based solution will run much faster than what I've provided (but am happy to be proven wrong).
%# Your example string
Str1 ='WOW ! my ~Code~ Works !!';
%# Convert case to lower
Str1 = lower(Str1);
%# Convert to ascii
Str1 = double(Str1);
%# Find an index of all locations after spaces
I1 = logical([0, (Str1(1:end-1) == 32)]);
%# Eliminate locations that don't contain lower-case characters
I1 = logical(I1 .* ((Str1 >= 97) & (Str1 <= 122)));
%# Check manually if the first location contains a lower-case character
if Str1(1) >= 97 && Str1(1) <= 122; I1(1) = true; end;
%# Adjust all appropriate characters in ascii form
Str1(I1) = Str1(I1) - 32;
%# Convert result back to a string
Str1 = char(Str1);