Delete greek letters in string - postgresql

I am pre-processing dirty text fields. I have managed to delete single characters and numbers, but there are still Greek letters (from formulas) that I want to delete altogether.
Any type of Greek letter can occur at any position in the string.
Any ideas how to do it?
select regexp_replace(' ω ω α ω alkanediylbis alkylimino bis alkanolpolyethoxylate the formula where straight branched chain alkylene group also known alkanediyl group that has the range carbon atoms and least carbon atoms length and can the same different and are primary alkyl groups which contain carbon atoms each and can the same different and are alkylene groups which contain the range from carbon atoms each and and are the same different numerals the range each ', '\W+', '')

[Α-Ωα-ω] will match the standard Greek alphabet. (Note that the Α here is a distinct character from the Latin A, though they probably look identical).
Some commonly-used symbols are outside of the standard alphabet, so at the very least, you probably want to match the whole Greek Unicode block using [\u0370-\u03FF].
Unicode also has
The Greek Extended block containing letters with diacritics
The Coptic block with some very similar-looking characters
The Mathematical Operators block with its own ∆/∏/∑ symbols
Several copies of the Greek alphabet in the Mathematical Alphanumeric Symbols block
...and probably more.
Rather than trying to list everything you want to replace, it might be easier to list the things you want to keep. For example, to remove everything outside the printable ASCII range:
select regexp_replace(
'ABCΑαΒβΓγΔδΕεΖζΗηΘθΙιΚκΛλΜμΝνΞξΟοΠπΡρΣσςΤτΥυΦφΧχΨψΩω123',
'[^\u0020-\u007E]', '', 'g'
);
regexp_replace
----------------
ABC123

Related

What is the difference between "combining characters" and "modifier letters"?

In the Unicode standard, there are diacritical marks, such as U+0302, COMBINING CIRCUMFLEX ACCENT (◌̂), and U+02C6, MODIFIER LETTER CIRCUMFLEX ACCENT (ˆ). I know that combining characters are combined with the previous letter to, say, make a letter like "ô", but what are modifier letters used for? Is it just a printable representation of the combining character, and if so, how is that different from the plain U+005E, CIRCUMFLEX ACCENT (^)?
[I'm not interested int the circumflex itself, but rather this class of characters (there seem to be many of them, as you can see here).]
What is the difference between “combining characters” and “modifier
letters”?
Combining characters
Combining characters are always applied against a preceding base character. Here is an example taken from section 5.13 Rendering Nonspacing Marks of The Unicode Standard
Version 11.0 – Core Specification where a sequence of four combining characters are applied to the base character a:
Here's another example. Running this trivial Java code...
System.out.println("Base character: \u0930");
System.out.println("Base with combining characters: \u0930\u0903\u0951");
....yielded this output:
In this case the output was wider than the base character; one of the combining characters was placed above the base character, and the other was placed to the right of the base character.
I've provided both examples as screen shots because it can be difficult to find a font to render the resulting glyphs correctly.
Modifying Letters
In contrast to combining characters, modifying letters are freestanding. While they also usually modify another character (normally but not necessarily the preceding character) they are base characters themselves, and visually distinct. To use your example, here is the output of from a Java application printing the base character a followed by U+0302, COMBINING CIRCUMFLEX ACCENT (◌̂) and U+02C6, MODIFIER LETTER CIRCUMFLEX ACCENT (ˆ) respectively:
A 0302: Â
A 02C6: Aˆ
The MODIFIER LETTER CIRCUMFLEX ACCENT is rendered to the right of the A whereas the COMBINING CIRCUMFLEX ACCENT is rendered above it.
The actual meaning (semantics) of the circumflex character as a modifying letter is context driven. For example, in French, the circumflex on the o in côté affects its pronunciation, but the circumflex on the u in sûr does not; instead it is used to visually distinguish sûr (meaning sure) from the identically pronounced sur (meaning on). In French a circumflex on o always affects pronunciation, and on u it never does.
Is it just a printable representation of the
combining character...
No - the modifying letter carries meaning. In the case of the French circumflex that meaning may be context driven based on the letter it modifies, as described above. But the meaning can be contained within the modifying letter itself. For example:
Modifier letters are commonly used in technical phonetic transcriptional systems, where they augment the use of combining marks to make phonetic distinctions. Some of them have been adapted into regular language orthographies as well. For example, U+02BB MODIFIER LETTER TURNED COMMA is used to represent the 'okina (glottal stop) in the orthography for Hawaiian.
That example also shows that a modifying letter need not be associated with any other character. That is never the case with combining characters.
Also note that a modifier letter is not necessarily a letter in any alphabet, and the majority of modifier letters are actually symbols (e.g. the circumflex).
How is that different from the plain U+005E, CIRCUMFLEX ACCENT (^)?
That is simply the character used to represent a circumflex accent. Unlike combining characters and modifier letters, it cannot be semantically or visually associated with any other character.
See the following sections in The Unicode® Standard Version 11.0 – Core Specification for lots more detail:
7.8 Modifier Letters
7.9 Combining Marks
Modifier letters don't combine. They are semantically used as a modifier, unlike the plain equivalents like U+005E.
https://www.unicode.org/versions/Unicode11.0.0/ch07.pdf#G15832
7.8 Modifier Letters
Modifier letters, in the sense used in the Unicode Standard, are letters or symbols that are typically written
adjacent to other letters and which modify their usage in some way.
They are not formally combining marks (gc=Mn or gc=Mc) and do not
graphically combine with the base letter that they modify. They are
base characters in their own right. The sense in which they modify
other letters is more a matter of their semantics in usage; they often
tend to function as if they were diacritics, indicating a change in
pronunciation of a letter, or otherwise distinguishing a letter’s use.
Typically this diacritic modification applies to the character
preceding the modifier letter, but modifier letters may sometimes
modify a following character. Occasionally a modifier letter may
simply stand alone representing its own sound.
Example of five U+0302 vs. U+02C6 vs. U+005E:
ô̂̂̂̂
oˆˆˆˆˆo^^^^^

How to compare the same character with multiple code points? [duplicate]

Is there a standard way, in Python, to normalize a unicode string, so that it only comprehends the simplest unicode entities that can be used to represent it ?
I mean, something which would translate a sequence like ['LATIN SMALL LETTER A', 'COMBINING ACUTE ACCENT'] to ['LATIN SMALL LETTER A WITH ACUTE'] ?
See where is the problem:
>>> import unicodedata
>>> char = "á"
>>> len(char)
1
>>> [ unicodedata.name(c) for c in char ]
['LATIN SMALL LETTER A WITH ACUTE']
But now:
>>> char = "á"
>>> len(char)
2
>>> [ unicodedata.name(c) for c in char ]
['LATIN SMALL LETTER A', 'COMBINING ACUTE ACCENT']
I could, of course, iterate over all the chars and do manual replacements, etc., but it is not efficient, and I'm pretty sure I would miss half of the special cases, and do mistakes.
The unicodedata module offers a .normalize() function, you want to normalize to the NFC form. An example using the same U+0061 LATIN SMALL LETTER - U+0301 A COMBINING ACUTE ACCENT combination and U+00E1 LATIN SMALL LETTER A WITH ACUTE code points you used:
>>> print(ascii(unicodedata.normalize('NFC', '\u0061\u0301')))
'\xe1'
>>> print(ascii(unicodedata.normalize('NFD', '\u00e1')))
'a\u0301'
(I used the ascii() function here to ensure non-ASCII codepoints are printed using escape syntax, making the differences clear).
NFC, or 'Normal Form Composed' returns composed characters, NFD, 'Normal Form Decomposed' gives you decomposed, combined characters.
The additional NFKC and NFKD forms deal with compatibility codepoints; e.g. U+2160 ROMAN NUMERAL ONE is really just the same thing as U+0049 LATIN CAPITAL LETTER I but present in the Unicode standard to remain compatible with encodings that treat them separately. Using either NFKC or NFKD form, in addition to composing or decomposing characters, will also replace all 'compatibility' characters with their canonical form.
Here is an example using the U+2167 ROMAN NUMERAL EIGHT codepoint; using the NFKC form replaces this with a sequence of ASCII V and I characters:
>>> unicodedata.normalize('NFC', '\u2167')
'Ⅷ'
>>> unicodedata.normalize('NFKC', '\u2167')
'VIII'
Note that there is no guarantee that composed and decomposed forms are commutative; normalizing a combined character to NFC form, then converting the result back to NFD form does not always result in the same character sequence. The Unicode standard maintains a list of exceptions; characters on this list are composable, but not decomposable back to their combined form, for various reasons. Also see the documentation on the Composition Exclusion Table.
Yes, there is.
unicodedata.normalize(form, unistr)
You need to select one of the four normalization forms.

Replacing non-Latin chars with Latin counterparts like Phonebook App

I have a phonebook app where I generate the title for section headers by comparing the first letters of entries.
The indexes are predefined so I expect letters to be assigned from A-Z and for numbers #.
The problem is there are many letter with accents including ü, İ, ç etc in many languages. In my approach, since these chars do not fall under the range A-Z, they are assigned to # which is not desired.
The native iOS Phonebook app assigns for example ü to U and so on. Is there a simple way to make this casting without defining a set of chars?
Thanks.
Check out Unicode Normalization. You probably want some combination of NFD and extraction of the adequate data. If you look at this file from Unicode, you will see something like
00E9;LATIN SMALL LETTER E ACUTE;Ll;0;L;0065 0301;;;;N;;;00C9;;00C9
Where 00E9, ie 'é', is decomposed as 0065 0301. You pick up 0065 (a), and discard 0301 (´). This file should get you started nicely. There may be equivalent functions in Objective-C/iOS, but I wouldn't know where to start...

How to enumerate all Unicode canonically equivalent sequences in Perl?

Does there exist a standard Perl module or function that, given a Unicode Combining Character Sequence (or, more generally, an arbitrary Unicode text string), will generate a list of all canonically equivalent strings?
For example, if given the character U+1EAD, I'd like to get back a list of all these canonically equivalent sequences:
0061 0302 0323
0061 0323 0302
00E2 0323
1EA1 0302
1EAD
(I don't particularly care whether the interface is in terms of arrays of USVs or utf strings.)
Is this an XY problem? If you want to compare/match 2 unicode strings and you're worried that different ways of encoding the accented characters would create false negatives, then the best way to do this would be to normalize the 2 strings using one of the normalization functions from Unicode::Normalize, before doing the comparison or match.
Otherwise it gets a little messy.
You could get the complete character name using charnames::viacode(0x1EAD); (for U+1EAD it would be LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW), and get the various composing characters by splitting the name on WITH|AND. Then you could generate all combinations (checking that they exist!) of the base character + modifiers and the other modifiers. At this point you will run into the problem of matching the combining characters names in the full name (eg CIRCUMFLEX) with the combining character real name (COMBINING CIRCUMFLEX ACCENT). There are probably rules for this, but I don't know them.
This would be my naive attempt, there may be better ways of doing this, but since so far no one has volunteered the information...

What is the range of Unicode Printable Characters?

Can anybody please tell me what is the range of Unicode printable characters? [e.g. Ascii printable character range is \u0020 - \u007f]
See, http://en.wikipedia.org/wiki/Unicode_control_characters
You might want to look especially at C0 and C1 control character http://en.wikipedia.org/wiki/C0_and_C1_control_codes
The wiki says, the C0 control character is in the range U+0000—U+001F and U+007F (which is the same range as ASCII) and C1 control character is in the range U+0080—U+009F
other than C-control character, Unicode also has hundreds of formatting control characters, e.g. zero-width non-joiner, which makes character spacing closer, or bidirectional text control. This formatting control characters are rather scattered.
More importantly, what are you doing that requires you to know Unicode's non-printable characters? More likely than not, whatever you're trying to do is the wrong approach to solve your problem.
This is an old question, but it is still valid and I think there is more to usefully, but briefly, say on the subject than is covered by existing answers.
Unicode
Unicode defines properties for characters.
One of these properties is "General Category" which has Major classes and subclasses. The Major classes are Letter, Mark, Punctuation, Symbol, Separator, and Other.
By knowing the properties of your characters, you can decide whether you consider them printable in your particular context.
You must always remember that terms like "character" and "printable" are often difficult and have interesting edge-cases.
Programming Language support
Some programming languages assist with this problem.
For example, the Go language has a "unicode" package which provides many useful Unicode-related functions including these two:
func IsGraphic(r rune) bool
IsGraphic reports whether the rune is defined as a Graphic by Unicode. Such
characters include letters, marks, numbers, punctuation, symbols, and spaces,
from categories L, M, N, P, S, Zs.
func IsPrint(r rune) bool
IsPrint reports whether the rune is defined as printable by Go. Such
characters include letters, marks, numbers, punctuation, symbols, and
the ASCII space character, from categories L, M, N, P, S and the ASCII
space character. This categorization is the same as IsGraphic except
that the only spacing character is ASCII space, U+0020.
Notice that it says "defined as printable by Go" not by "defined as printable by Unicode". It is almost as if there are some depths the wizards at Unicode dare not plumb.
Printable
The more you learn about Unicode, the more you realise how unexpectedly diverse and unfathomably weird human writing systems are.
In particular whether a particular "character" is printable is not always obvious.
Is a zero-width space printable? When is a hyphenation point printable? Are there characters whose printability depends on their position in a word or on what characters are adjacent to them? Is a combining-character always printable?
Footnotes
ASCII printable character range is \u0020 - \u007f
No it isn't. \u007f is DEL which is not normally considered a printable character. It is, for example, associated with the keyboard key labelled "DEL" whose earliest purpose was to command the deletion of a character from some medium (display, file etc).
In fact many 8-bit character sets have many non-consecutive ranges which are non-printable. See for example C0 and C1 controls.
First, you should remove the word 'UTF8' in your question, it's not pertinent (UTF8 is just one of the encodings of Unicode, it's something orthogonal to your question).
Second: the meaning of "printable/non printable" is less clear in Unicode. Perhaps you mean a "graphical character" ; and one can even dispute if a space is printable/graphical. The non-graphical characters would consist, basically, of control characters: the range 0x00-0x0f plus some others that are scattered.
Anyway, the vast majority of Unicode characters (more than 200.000) are "graphical". But this certainly does not imply that they are printable in your environment.
It seems to me a bad idea, if you intend to generate a "random printable" unicode string, to try to include all "printable" characters.
What you should do is pick a font, and then generate a list of which Unicode characters have glyphs defined for your font. You can use a font library like freetype to test glyphs (test for FT_Get_Char_Index(...) != 0).
Taking the opposite approach to #HoldOffHunger, it might be easier to list the ranges of non-printable characters, and use not to test if a character is printable.
In the style of Regex (so if you wanted printable characters, place a ^):
[\u0000-\u0008\u000B-\u001F\u007F-\u009F\u2000-\u200F\u2028-\u202F\u205F-\u206F\u3000\uFEFF]
Which accounts for things like separator spaces and joiners
Note that unlike their answer which is a whitelist that ignores all non-latin languages, this blacklist wont permit non-printable characters just because they're in blocks with printable characters (their answer wholly includes Non-Latin, Language Supplement blocks as 'printable', even though it contains things like 'zero-width non-joiner'..).
Be aware though, that if using this or any other solution, for sanitation for example, you may want to do something more nuanced than a blanket replace.
Arguably in that case, non-breaking spaces should change to space, not be removed, and invisible separator should be replaced with comma conditionally.
Then there's invalid character ranges, either [yet] unused or reserved for encoding purposes, and language-specific variation selectors..
NB when using regular expressions, that you enable unicode awareness if it isn't that way by default (for javascript it's via /.../u).
You can tell if you have it correct by attempting to create the regular expression with some multi-byte character ranges.
For example, the above, plus the invalid character range \u{E0100}-\u{E01EF} in javascript:
/[\u0000-\u0008\u000B-\u001F\u007F-\u009F\u2000-\u200F\u2028-\u202F\u205F-\u206F\u3000\uFEFF\u{E0100}-\u{E01EF}]/u
Without u \u{E0100}-\u{E01EF} equates to \uDB40(\uDD00-\uDB40)\uDDEF, not (\uDB40\uDD00)-(\uDB40\uDDEF), and if replacing you should always enable u even when not including multbyte unicode in the regex itself as you might break surrogate pairs that exist in the text.
What characters are valid?
At present, Unicode is defined as starting from U+0000 and ending at U+10FFFF. The first block, Basic Latin, spans U+0000 to U+007F and the last block, Supplementary Private Use Area-B, spans U+100000 to 10FFFF. If you want to see all of these blocks, see here: Wikipedia.org: Unicode Block; List of Blocks.
Let's break down what's valid/invalid in the Latin Block1.
The Latin Block: TLDR
If you're interested in filtering out either invisible characters, you'll want to filter out:
U+0000 to U+0008: Control
U+000E to U+001F: Device (i.e., Control)
U+007F: Delete (Control)
U+008D to U+009F: Device (i.e., Control)
The Latin Block: Full Ranges
Here's the Latin block, broken up into smaller sections...
U+0000 to U+0008: Control
U+0009 to U+000C: Space
U+000E to U+001F: Device (i.e., Control)
U+0020: Space
U+0021 to U+002F: Symbols
U+0030 to U+0039: Numbers
U+003A to U+0040: Symbols
U+0041 to U+005A: Uppercase Letters
U+005B to U+0060: Symbols
U+0061 to U+007A: Lowercase Letters
U+007B to U+007E: Symbols
U+007F: Delete (Control)
U+0080 to U+008C: Latin1-Supplement symbols.
U+008D to U+009F: Device (i.e., Control)
U+00A0: Non-breaking space. (i.e., )
U+00A1 to U+00BF: Symbols.
U+00C0 to U+00FF: Accented characters.
The Other Blocks
Unicode is famous for supporting non-Latin character sets, so what are these other blocks? This is just a broad overview, see the wikipedia.org page for the full, complete list.
Latin1 & Latin1-Related Blocks
U+0000 to U+007F : Basic Latin
U+0080 to U+00FF : Latin-1 Supplement
U+0100 to U+017F : Latin Extended-A
U+0180 to U+024F : Latin Extended-B
Combinable blocks
U+0250 to U+036F: 3 Blocks.
Non-Latin, Language blocks
U+0370 to U+1C7F: 55 Blocks.
Non-Latin, Language Supplement blocks
U+1C80 to U+209F: 11 Blocks.
Symbol blocks
U+20A0 to U+2BFF: 22 Blocks.
Ancient Language blocks
U+2C00 to U+2C5F: 1 Block (Glagolitic).
Language Extensions blocks
U+2C60 to U+FFEF: 66 Blocks.
Special blocks
U+FFF0 to U+FFFF: 1 Block (Specials).
One approach is to render each character to a texture and manually check if it is visible. This solution excludes spaces.
I've written such a program and used it to determine there are roughly 467241 printable characters within the first 471859 code points. I've selected this number because it covers all of the first 4 Planes of Unicode, which seem to contain all printable characters. See https://en.wikipedia.org/wiki/Plane_(Unicode)
I would much like to refine my program to produce the list of ranges, but for now here's what I am working with for anyone who needs immediate answers:
https://editor.p5js.org/SamyBencherif/sketches/_OE8Y3kS9
I am posting this tool because I think this question attracts a lot of people who are looking for slightly different applications of knowing printable ranges. Hopefully this is useful, even though it does not fully answer the question.
The printable Unicode character range, excluding the hex, is 32 to 126 in the int datatype.
Unicode, stict term, has no range. Numbers can go infinite.
What you gave is not UTF8 which has 1 byte for ASCII characters.
As for the range, I believe there is no range of printable characters. It always evolves. Check the page I gave above.