Filtering out all non-kanji characters in a text with Python 3 - character

I have a text in which there are latin letters and japanese characters (hiragana, katakana & kanji).
I want to filter out all latin characters, hiragana and katakana but I am not sure how to do this in an elegant way.
My direct approach would be to just filter out every single letter of the latin alphabet in addition to every single hiragana/katakana but I am sure there is a better way.
I am guessing that I have to use regex but I am not quite sure how to go about it. Are letters somehow classified in roman letters, japanese, chinese etc.
If yes, could I somehow use this?
Here some sample text:
"Lesson 1:",, "私","わたし","I" "私たち","わたしたち","We" "あ なた","あなた","You" "あの人","あのひと","That person" "あの方","あのかた","That person (polite)" "皆さん","みなさん"
The program should only return the kanjis (chinese character) like this:
`私、人,方,皆`

I found the answer thanks to Olsgaarddk on reddit.
https://github.com/olsgaard/Japanese_nlp_scripts/blob/master/jp_regex.py
# -*- coding: utf-8 -*-
import re
''' This is a library of functions and variables that are helpful to have handy
when manipulating Japanese text in python.
This is optimized for Python 3.x, and takes advantage of the fact that all strings are unicode.
Copyright (c) 2014-2015, Mads Sørensen Ølsgaard
All rights reserved.
Released under BSD3 License, see http://opensource.org/licenses/BSD-3-Clause or license.txt '''
## UNICODE BLOCKS ##
# Regular expression unicode blocks collected from
# http://www.localizingjapan.com/blog/2012/01/20/regular-expressions-for-japanese-text/
hiragana_full = r'[ぁ-ゟ]'
katakana_full = r'[゠-ヿ]'
kanji = r'[㐀-䶵一-鿋豈-頻]'
radicals = r'[⺀-⿕]'
katakana_half_width = r'[⦅-゚]'
alphanum_full = r'[!-~]'
symbols_punct = r'[、-〿]'
misc_symbols = r'[ㇰ-ㇿ㈠-㉃㊀-㋾㌀-㍿]'
ascii_char = r'[ -~]'
## FUNCTIONS ##
def extract_unicode_block(unicode_block, string):
''' extracts and returns all texts from a unicode block from string argument.
Note that you must use the unicode blocks defined above, or patterns of similar form '''
return re.findall( unicode_block, string)
def remove_unicode_block(unicode_block, string):
''' removes all chaacters from a unicode block and returns all remaining texts from string argument.
Note that you must use the unicode blocks defined above, or patterns of similar form '''
return re.sub( unicode_block, '', string)
## EXAMPLES ##
text = '初めての駅 自由が丘の駅で、大井町線から降りると、ママは、トットちゃんの手を引っ張って、改札口を出ようとした。ぁゟ゠ヿ㐀䶵一鿋豈頻⺀⿕⦅゚abc!~、〿ㇰㇿ㈠㉃㊀㋾㌀㍿'
print('Original text string:', text, '\n')
print('All kanji removed:', remove_unicode_block(kanji, text))
print('All hiragana in text:', ''.join(extract_unicode_block(hiragana_full, text)))

Related

Are NFC normalization boundaries also extended grapheme cluster boundaries?

This question is related to text editing. Say you have a piece of text in normalization form NFC, and a cursor that points to an extended grapheme cluster boundary within this text. You want to insert another piece of text at the cursor location, and make sure that the resulting text is also in NFC. You also want to move the cursor on the first grapheme boundary that immediately follows the inserted text.
Now, since concatenating two strings that are both in NFC doesn't necessarily produce a string that is also in NFC, you might have to emend the text around the insertion point. For instance, if you have a string that contains 4 code points like so:
[0] LATIN SMALL LETTER B
[1] LATIN SMALL LETTER E
[2] COMBINING MACRON BELOW
--- Cursor location
[3] LATIN SMALL LETTER A
And you want to insert a 2-codepoints string {COMBINING ACUTE ACCENT, COMBINING DOT ABOVE} at the cursor location. Then the result will be:
[0] LATIN SMALL LETTER B
[1] LATIN SMALL LETTER E WITH ACUTE
[2] COMBINING MACRON BELOW
[3] COMBINING DOT ABOVE
--- Cursor location
[4] LATIN SMALL LETTER A
Now my question is: how do you figure out at which offset you should place the cursor after inserting the string, in such a way that the cursor ends up after the inserted string and also on a grapheme boundary? In this particular case, the text that follows the cursor location cannot possibly interact, during normalization, with what precedes. So the following sample Python code would work:
import unicodedata
def insert(text, cursor_pos, text_to_insert):
new_text = text[:cursor_pos] + text_to_insert
new_text = unicodedata.normalize("NFC", new_text)
new_cursor_pos = len(new_text)
new_text += text[cursor_pos:]
if new_cursor_pos == 0:
# grapheme_break_after is a function that
# returns the offset of the first grapheme
# boundary after the given index
new_cursor_pos = grapheme_break_after(new_text, 0)
return new_text, new_cursor_pos
But does this approach necessarily work? To be more explicit: is it necessarily the case that the text that follows a grapheme boundary doesn't interact with what precedes it during normalization, such that NFC(text[:grapheme_break]) + NFC(text[grapheme_break:]) == NFC(text) is always true?
Update
#nwellnhof's excellent analysis below motivated me to investigate things
further. So I followed the "When in doubt, use brute force" mantra and wrote a
small script that parses grapheme break properties and examines each code point
that can appear at the beginning of a grapheme, to test whether it can
possibly interact with preceding code points during normalization. Here's the
script:
from urllib.request import urlopen
import icu, unicodedata
URL = "http://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakProperty.txt"
break_props = {}
with urlopen(URL) as f:
for line in f:
line = line.decode()
p = line.find("#")
if p >= 0:
line = line[:p]
line = line.strip()
if not line:
continue
fields = [x.strip() for x in line.split(";")]
codes = [int(x, 16) for x in fields[0].split("..")]
if len(codes) == 2:
start, end = codes
else:
assert(len(codes) == 1)
start, end = codes[0], codes[0]
category = fields[1]
break_props.setdefault(category, []).extend(range(start, end + 1))
# The only code points that can't appear at the beginning of a grapheme boundary
# are those that appear in the following categories. See the regexps in
# UAX #29 Tables 1b and 1c.
to_ignore = set(c for name in ("Extend", "ZWJ", "SpacingMark") for c in break_props[name])
nfc = icu.Normalizer2.getNFCInstance()
for c in range(0x10FFFF + 1):
if c in to_ignore:
continue
if not nfc.hasBoundaryBefore(chr(c)):
print("U+%04X %s" % (c, unicodedata.name(chr(c))))
Looking at the output, it appears that there are about 40 code points that are
grapheme starters but still compose with preceding code points in NFC.
Basically, they are non-precomposed Hangul syllables of type V
(U+1161..U+1175) and T (U+11A8..U+11C2). Things makes sense when you examine
the regular expressions in UAX #29, Table
1c together with what
the standard says about Jamo composition (section 3.12, p. 147 of the version
13 of the standard).
The gist of it is that Hangul sequences of the form {L, V} can compose to a
Hangul syllable of type LV, and similarly sequences of the form {LV, T} can
compose to a syllable of type LVT.
To sum up, and assuming I'm not mistaken, the above Python code could
be corrected as follows:
import unicodedata
import icu # pip3 install icu
def insert(text, cursor_pos, text_to_insert):
new_text = text[:cursor_pos] + text_to_insert
new_text = unicodedata.normalize("NFC", new_text)
new_cursor_pos = len(new_text)
new_text += text[cursor_pos:]
new_text = unicodedata.normalize("NFC", new_text)
break_iter = icu.BreakIterator.createCharacterInstance(icu.Locale())
break_iter.setText(new_text)
if new_cursor_pos == 0:
# Move the cursor to the first grapheme boundary > 0.
new_cursor_pos = breakIter.nextBoundary()
elif new_cursor_pos > len(new_text):
new_cursor_pos = len(new_text)
elif not break_iter.isBoundary(new_cursor_pos):
# isBoundary() moves the cursor on the first boundary >= the given
# position.
new_cursor_pos = break_iter.current()
return new_text, new_cursor_pos
The (possibly) pointless test new_cursor_pos > len(new_text) is there to
catch the case len(NFC(x)) > len(NFC(x + y)). I'm not sure whether this can
actually happen with the current Unicode database (more tests would be needed to prove it), but it is theoretically quite possible. If, say, you have
a set a three code points A, B and C and two precomposed forms A+B and
A+B+C (but not A+C), then you could very well have NFC({A, C} + {B}) = {A+B+C}.
If this case doesn't occur in practice (which is very likely, especially with
"real" texts), then the above Python code will necessarily locate the first
grapheme boundary after the end of the inserted text. Otherwise, it will merely
locate some grapheme boundary after the inserted text, but not necessarily the
first one. I don't yet see how it could be possible to improve the second case (assuming it isn't merely theoretical), so I think I'll leave
my investigation at that for now.
As mentioned in my comment, the actual boundaries can differ slightly. But AFAICS, there should be no meaningful interaction. UAX #29 states:
6.1 Normalization
[...] the grapheme cluster boundary specification has the following features:
There is never a break within a sequence of nonspacing marks.
There is never a break between a base character and subsequent nonspacing marks.
This only mentions nonspacing marks. But with extended grapheme clusters (as opposed to legacy ones), I'm pretty sure these statements also apply to "non-starter" spacing marks[1]. This would cover all normalization non-starters (which must be either nonspacing (Mn) or spacing (Mc) marks). So there's never an extended grapheme cluster boundary before a non-starter[2] which should give you the guarantee you need.
Note that it's possible to have multiple runs of starters and non-starters ("normalization boundaries") within a single grapheme cluster, for example with U+034F COMBINING GRAPHEME JOINER.
[1] Some spacing marks are excluded, but these should all be starters.
[2] Except at the start of text.

Extended Grapheme Clusters stop combining

I am having one question with the Extended Grapheme Clusters.
For example, look at following code:
let message = "c\u{0327}a va bien" // => "ça va bien"
How does Swift know it needs to be combined (i.e. ç) rather than treating it as a small letter c AND a "COMBINING CEDILLA"?
Use the unicodeScalars view on the string:
let message1 = "c\u{0327}".decomposedStringWithCanonicalMapping
for scalar in message1.unicodeScalars {
print(scalar) // print c and Combining Cedilla separately
}
let message2 = "c\u{0327}".precomposedStringWithCanonicalMapping
for scalar in message2.unicodeScalars {
print(scalar) // print Latin Small Letter C with Cedilla
}
Note that not all composite characters have a precomposed form, as noted by Apple's Technical Q&A:
Important: Do not convert to precomposed Unicode in an attempt to simplify your text processing. Precomposed Unicode can still contain composite characters. For example, there is no precomposed equivalent of U+0065 U+030A (LATIN SMALL LETTER E followed by COMBINING RING ABOVE)

Created unicode & unicode without whitespace generators in ScalaCheck

During testing we want to qualify unicode characters, sometimes with wide ranges and sometimes more narrow. I've created a few specific generators:
// Generate a wide varying of Unicode strings with all legal characters (21-40 characters):
val latinUnicodeCharacter = Gen.choose('\u0041', '\u01B5').filter(Character.isDefined)
// Generate latin Unicode strings with all legal characters (21-40 characters):
val latinUnicodeGenerator: Gen[String] = Gen.chooseNum(21, 40).flatMap { n =>
Gen.sequence[String, Char](List.fill(n)(latinUnicodeCharacter))
}
// Generate latin unicode strings without whitespace (21-40 characters): !! COMES UP SHORT...
val latinUnicodeGeneratorNoWhitespace: Gen[String] = Gen.chooseNum(21, 40).flatMap { n =>
Gen.sequence[String, Char](List.fill(n)(latinUnicodeCharacter)).map(_.replaceAll("[\\p{Z}\\p{C}]", ""))
}
The latinUnicodeCharacter generator picks from characters ranging from standard latin ("A," "B," etc.) up to higher order latin character (Germanic/Nordic and others). This is good for testing latin-based character input for, say, names.
The latinUnicodeGenerator creates strings of 21-40 characters in length. These strings include horizontal space (not just a space character but other "horizontal space").
The final example, latinUnicodeGeneratorNoWhitespace, is used for say email addresses. We want the latin characters but we don't want spaces, control codes, and the like. The problem: Because I'm mapping the final result String and filtering out the control characters, the String shrinks and I end up with a total length that is less than 21 characters (sometimes).
So the question is: How can I implement latinUnicodeGeneratorNoWhitespace but do it inside the generator in such a way that I always get 21-40 character strings?
You could do this by putting together a sequence of your non-whitespace characters, another of whitespace, and then picking from either only the non-whitespace, or from both together:
import org.scalacheck.Gen
val myChars = ('A' to 'Z') ++ ('a' to 'z')
val ws = Seq(' ', '\t')
val myCharsGenNoWhitespace: Gen[String] = Gen.chooseNum(21, 40).flatMap { n =>
Gen.buildableOfN[String, Char](n, Gen.oneOf(myChars))
}
val myCharsGen: Gen[String] = Gen.chooseNum(21, 40).flatMap { n =>
Gen.buildableOfN[String, Char](n, Gen.oneOf(myChars ++ ws))
}
I would suggest considering what you're really testing for, though—the more you restrict the test cases, the less you're checking about how your program will behave on unexpected inputs.

Is there a clean way to specify character literals in Swift?

Swift seems to be trying to deprecate the notion of a string being composed of an array of atomic characters, which makes sense for many uses, but there's an awful lot of programming that involves picking through datastructures that are ASCII for all practical purposes: particularly with file I/O. The absence of a built in language feature to specify a character literal seems like a gaping hole, i.e. there is no analog of the C/Java/etc-esque:
String foo="a"
char bar='a'
This is rather inconvenient, because even if you convert your strings into arrays of characters, you can't do things like:
let ch:unichar = arrayOfCharacters[n]
if ch >= 'a' && ch <= 'z' {...whatever...}
One rather hacky workaround is to do something like this:
let LOWCASE_A = ("a" as NSString).characterAtIndex(0)
let LOWCASE_Z = ("z" as NSString).characterAtIndex(0)
if ch >= LOWCASE_A && ch <= LOWCASE_Z {...whatever...}
This works, but obviously it's pretty ugly. Does anyone have a better way?
Characters can be created from Strings as long as those Strings are only made up of a single character. And, since Character implements ExtendedGraphemeClusterLiteralConvertible, Swift will do this for you automatically on assignment. So, to create a Character in Swift, you can simply do something like:
let ch: Character = "a"
Then, you can use the contains method of an IntervalType (generated with the Range operators) to check if a character is within the range you're looking for:
if ("a"..."z").contains(ch) {
/* ... whatever ... */
}
Example:
let ch: Character = "m"
if ("a"..."z").contains(ch) {
println("yep")
} else {
println("nope")
}
Outputs:
yep
Update: As #MartinR pointed out, the ordering of Swift characters is based on Unicode Normalization Form D which is not in the same order as ASCII character codes. In your specific case, there are more characters between a and z than in straight ASCII (ä for example). See #MartinR's answer here for more info.
If you need to check if a character is in between two ASCII character codes, then you may need to do something like your original workaround. However, you'll also have to convert ch to an unichar and not a Character for it to work (see this question for more info on Character vs unichar):
let a_code = ("a" as NSString).characterAtIndex(0)
let z_code = ("z" as NSString).characterAtIndex(0)
let ch_code = (String(ch) as NSString).characterAtIndex(0)
if (a_code...z_code).contains(ch_code) {
println("yep")
} else {
println("nope")
}
Or, the even more verbose way without using NSString:
let startCharScalars = "a".unicodeScalars
let startCode = startCharScalars[startCharScalars.startIndex]
let endCharScalars = "z".unicodeScalars
let endCode = endCharScalars[endCharScalars.startIndex]
let chScalars = String(ch).unicodeScalars
let chCode = chScalars[chScalars.startIndex]
if (startCode...endCode).contains(chCode) {
println("yep")
} else {
println("nope")
}
Note: Both of those examples only work if the character only contains a single code point, but, as long as we're limited to ASCII, that shouldn't be a problem.
If you need C-style ASCII literals, you can just do this:
let chr = UInt8(ascii:"A") // == UInt8( 0x41 )
Or if you need 32-bit Unicode literals you can do this:
let unichr1 = UnicodeScalar("A").value // == UInt32( 0x41 )
let unichr2 = UnicodeScalar("é").value // == UInt32( 0xe9 )
let unichr3 = UnicodeScalar("😀").value // == UInt32( 0x1f600 )
Or 16-bit:
let unichr1 = UInt16(UnicodeScalar("A").value) // == UInt16( 0x41 )
let unichr2 = UInt16(UnicodeScalar("é").value) // == UInt16( 0xe9 )
All of these initializers will be evaluated at compile time, so it really is using an immediate literal at the assembly instruction level.
The feature you want was proposed to be in Swift 5.1, but that proposal was rejected for a few reasons:
Ambiguity
The proposal as written, in the current Swift ecosystem, would have allowed for expressions like 'x' + 'y' == "xy", which was not intended (the proper syntax would be "x" + "y" == "xy").
Amalgamation
The proposal was two in one.
First, it proposed a way to introduce single-quote literals into the language.
Second, it proposed that these would be convertible to numerical types to deal with ASCII values and Unicode codepoints.
These are both good proposals, and it was recommended that this be split into two and re-proposed. Those follow-up proposals have not yet been formalized.
Disagreement
It never reached consensus whether the default type of 'x' would be a Character or a Unicode.Scalar. The proposal went with Character, citing the Principle of Least Surprise, despite this lack of consensus.
You can read the full rejection rationale here.
The syntax might/would look like this:
let myChar = 'f' // Type is Character, value is solely the unicode U+0066 LATIN SMALL LETTER F
let myInt8: Int8 = 'f' // Type is Int8, value is 102 (0x66)
let myUInt8Array: [UInt8] = [ 'a', 'b', '1', '2' ] // Type is [UInt8], value is [ 97, 98, 49, 50 ] ([ 0x61, 0x62, 0x31, 0x32 ])
switch someUInt8 {
case 'a' ... 'f': return "Lowercase hex letter"
case 'A' ... 'F': return "Uppercase hex letter"
case '0' ... '9': return "Hex digit"
default: return "Non-hex character"
}
It also looks like you can use the following syntax:
Character("a")
This will create a Character from the specified single character string.
I have only tested this in Swift 4 and Xcode 10.1
Why do I exhume 7 year old posts? Fun I guess? Seriously though, I think I can add to the discussion.
It is not a gaping hole, or rather, it is a deliberate gaping hole that explicitly discourages conflating a string of text with a sequence of ASCII bytes.
You absolutely can pick apart a String. A String implements BidirectionalCollection and has many ways to manipulate the atoms. See: https://developer.apple.com/documentation/swift/string.
But you have to get used to the more generalized notion of a String. It can be picked apart from the User perspective, which is a sequence of grapheme clusters, each (usually) which a visually separable appearance, or from the encoding perspective, which can be one of several (UTF32, UTF16, UTF8).
At the risk of overanalyzing the wording of your question:
A data structure is conceptual, and independent of encoding in storage
A data structure encoded as an ASCII string is just one kind of ASCII string
By design the encoding of ASCII values 0-127 will have an identical encoding in UTF-8, so loading that stream with a UTF8 API is fine
A data structure encoded as a string where fields of the structure have UTF-8 Unicode string values is not an ASCII string, but a UTF-8 string itself
A string is either ASCII-encoded or not; "for practical purposes" isn't a meaningful qualifier. A UTF-8 database field where 99.99% of the text falls in the ASCII range (where encodings will match), but occasionally doesn't, will present some nasty bug opportunities.
Instead of a terse and low-level equivalence of fixed-width integers and English-only text, Swift has a richer API that forces more explicit naming of the involved categories and entities. If you want to deal with ASCII, there's a name (method) for that, and if you want to deal with human sub-categories, there's a name for that, too, and they're totally independent of one another. There is a strong move away from ASCII and the English-centric string handling model of C. This is factual, not evangelizing, and it can present an irksome learning curve.
(This is aimed at new-comers, acknowledging the OP probably has years of experience with this now.)
For what you're trying to do there, consider:
let foo = "abcDeé#¶œŎO!##"
foo.forEach { c in
print((c.isASCII ? "\(c) is ascii with value \(c.asciiValue ?? 0); " : "\(c) is not ascii; ")
+ ((c.isLetter ? "\(c) is a letter" : "\(c) is not a letter")))
}
b is ascii with value 98; b is a letter
c is ascii with value 99; c is a letter
D is ascii with value 68; D is a letter
e is ascii with value 101; e is a letter
é is not ascii; é is a letter
# is ascii with value 64; # is not a letter
¶ is not ascii; ¶ is not a letter
œ is not ascii; œ is a letter
Ŏ is not ascii; Ŏ is a letter
O is ascii with value 79; O is a letter
! is ascii with value 33; ! is not a letter
# is ascii with value 64; # is not a letter
# is ascii with value 35; # is not a letter

How to convert Unicode characters to escape codes

So, I have a bunch of strings like this: {\b\cf12 よろてそ } . I'm thinking I could iterate over each character and replace any unicode (Edit: Anything where AscW(char) > 127 or < 0) with a unicode escape code (\u###). However, I'm not sure how to programmatically do so. Any suggestions?
Clarification:
I have a string like {\b\cf12 よろてそ } and I want a string like {\b\cf12 [STUFF]}, where [STUFF] will display as よろてそ when I view the rtf text.
You can simply use the AscW() function to get the correct value:-
sRTF = "\u" & CStr(AscW(char))
Note unlike other escapes for unicode, RTF uses the decimal signed short int (2 bytes) representation for a unicode character. Which makes the conversion in VB6 really quite easy.
Edit
As MarkJ points out in a comment you would only do this for characters outside of 0-127 but then you would also need to give some other characters inside the 0-127 range special handling as well.
Another more roundabout way, would be to add the MSScript.OCX to the project and interface with VBScript's Escape function. For example
Sub main()
Dim s As String
s = ChrW$(&H3088) & ChrW$(&H308D) & ChrW$(&H3066) & ChrW$(&H305D)
Debug.Print MyEscape(s)
End Sub
Function MyEscape(s As String) As String
Dim scr As Object
Set scr = CreateObject("MSScriptControl.ScriptControl")
scr.Language = "VBScript"
scr.Reset
MyEscape = scr.eval("escape(" & dq(s) & ")")
End Function
Function dq(s)
dq = Chr$(34) & s & Chr$(34)
End Function
The Main routine passes in the original Japanese characters and the debug output says:
%u3088%u308D%u3066%u305D
HTH