unicode character for a with e above it - unicode

The letters ä, ö, ü in German were written (e.g., in Gutenberg's Bible) with the respective vowels that had a tiny e printed right above them. Are these characters available in Unicode? They looked something like:
e e e e e e
A O U a o u
If they are not available in Unicode as single glyphs, perhaps they can be "produced" using Unicode control characters? For example, I thought of using Unicode character 1d49 ("MODIFIER LETTER SMALL E"), but the glyph does not appear above the previous vowel, but on its upper-right.

It's not a "control code" but a Unicode Combining Diacritical Mark, U+0364.

0364 is in hex = 868 decimal, so in Word type your letter, then hold down the ALT key; type 0 8 6 8; release the ALT key and the e is added above the preceding letter

Related

Delete greek letters in string

I am pre-processing dirty text fields. I have managed to delete single characters and numbers, but there are still Greek letters (from formulas) that I want to delete altogether.
Any type of Greek letter can occur at any position in the string.
Any ideas how to do it?
select regexp_replace(' ω ω α ω alkanediylbis alkylimino bis alkanolpolyethoxylate the formula where straight branched chain alkylene group also known alkanediyl group that has the range carbon atoms and least carbon atoms length and can the same different and are primary alkyl groups which contain carbon atoms each and can the same different and are alkylene groups which contain the range from carbon atoms each and and are the same different numerals the range each ', '\W+', '')
[Α-Ωα-ω] will match the standard Greek alphabet. (Note that the Α here is a distinct character from the Latin A, though they probably look identical).
Some commonly-used symbols are outside of the standard alphabet, so at the very least, you probably want to match the whole Greek Unicode block using [\u0370-\u03FF].
Unicode also has
The Greek Extended block containing letters with diacritics
The Coptic block with some very similar-looking characters
The Mathematical Operators block with its own ∆/∏/∑ symbols
Several copies of the Greek alphabet in the Mathematical Alphanumeric Symbols block
...and probably more.
Rather than trying to list everything you want to replace, it might be easier to list the things you want to keep. For example, to remove everything outside the printable ASCII range:
select regexp_replace(
'ABCΑαΒβΓγΔδΕεΖζΗηΘθΙιΚκΛλΜμΝνΞξΟοΠπΡρΣσςΤτΥυΦφΧχΨψΩω123',
'[^\u0020-\u007E]', '', 'g'
);
regexp_replace
----------------
ABC123

How can 2 different UTF-8 encoded data generate the same Unicode string?

I have very little knowledge about this, and would like to ask for help. The language I'm using is C#.
I have 2 words which are encoded to a UTF-8 CSV file:
TĂ´̀‰ng
Tổng
I created a test Windows Form with 2 text boxes, and added the following event:
private void textBox3_TextChanged(object sender, EventArgs e)
{
byte[] decodeutf = Encoding.Default.GetBytes(textBox3.Text);
textBox4.Text = Encoding.UTF8.GetString(decodeutf);
}
Either 1 or 2 is input to textBox3. I get the same result in textBox4 as Tổng.
When I try to revert the same process to input Tổng to get result of 2, I always get 1 instead.
The revert code is as following (with another 2 text boxes):
private void textBox1_TextChanged(object sender, EventArgs e)
{
Encoding utf8 = new UTF8Encoding();
byte[] encodedBytes = utf8.GetBytes(textBox1.Text);
textBox2.Text = Encoding.Default.GetString(encodedBytes);
}
The result is only TĂ´̀‰ng.
What is the different between encoding to UTF-8 and reading back to Unicode as above?
Your Encoding.Default is Windows-1258 (Vietnamese).
'TĂ´̀‰ng' encoded with Windows-1258, then decoded with UTF-8 generates the following five Unicode codepoints:
LATIN CAPITAL LETTER T
LATIN SMALL LETTER O WITH CIRCUMFLEX
COMBINING HOOK ABOVE
LATIN SMALL LETTER N
LATIN SMALL LETTER G
'Tổng' encoded as Windows-1258 and decoded as UTF-8 generates four Unicode codepoints:
LATIN CAPITAL LETTER T
LATIN SMALL LETTER O WITH CIRCUMFLEX AND HOOK ABOVE
LATIN SMALL LETTER N
LATIN SMALL LETTER G
The difference is the first is a denormalized form using two codepoints for ổ. The second uses a single codepoint for the combined character (normalized form). They are visually the same.
Here's a third one, 'Tò‚̀‰ng', that becomes six Unicode characters after the encode/decode used above:
LATIN CAPITAL LETTER T
LATIN SMALL LETTER O
COMBINING CIRCUMFLEX ACCENT
COMBINING HOOK ABOVE
LATIN SMALL LETTER N
LATIN SMALL LETTER G

Diacritical marks in man pages

I have written a man page in the nroff syntax. The text is in English but I want to make sure that a name containing the character "ö" is displayed correctly (even on a non-UTF-8 system). Is there a way to specify this character in nroff, similar to ö in HTML? Or can I specify the encoding in the file?
GNU troff (groff), which seems to be the de facto standard, accepts the named glyph \[:o] for the character "ö":
http://man7.org/linux/man-pages/man7/groff_char.7.html
I don't know troff but maybe this helps:
accent mark input output
acute accent e\*' é
grave accent e\*` è
circumflex o\*^ ô
cedilla c\*, ç
tilde n\*~ ñ
question \*?
exclamation \*!
umlaut u\*: ü
digraph s \*8 ß
hacek c\*v
macron a\*_
underdot s\*.
o-slash o\*/ ø
angstrom a\*o å
yogh kni\*3t
Thorn \*(Th Þ
thorn \*(th þ
Eth \*(D- Ð
eth \*(d- ð
hooked o \*q
ae ligature \*(ae æ
AE ligature \*(Ae Æ
oe ligature \*(oe
OE ligature \*(Oe
These new diacritical marks will not appear or will be placed on the wrong letter if .AM is not at the top of your file. If .AM is at the top of your file, the default -ms accent marks will be placed on the wrong letter. Choose one set or the other and use it consistently.
As an aid in producing text that will format correctly with both nroff and troff, there are some new string definitions that define dashes and quotation marks for each of these two formatting programs. The (*- string will yield two hyphens in nroff, but in troff it will produce an em dash--like this one. The *Q and (*U strings will produce open and close quotes in troff, but straight double quotes in nroff. (In typesetting, the double quote character is traditionally considered bad form.)

Evaluating non-latin characters which are created by a combination of keystrokes

Our app compares an expected string, against what the user is typing and only allows expected characters to appear.
So, given an expected string:
foo
The app evaluates each characters like so:
Type f
Is typed char (case & diacritic insensitive)f?
yes? do nothing
no? remove 1 char from string
This works wonderfully for latin characters. However, as soon as you hit non latin characters (such as Thai or Kanji) things turn to custard.
Specifically:
Some non-latin characters appear to be typed by a combination of characters, so
Is typed char (case & diacritic insensitive)ยั?
is always false, because it appears that ยั is typed ย ั
That also causes problem, because
remove 1 char from string
isn't reliable, because some non-latin chars have a Range length of >1 in Swift (this isn't too hard to fix).
For case & diacritics, æ and œ I just search case and diacritic insensitive (so typing e makes é appear if it is the next expected character, or typing a makes æ appear if it is next expected). This works well, and I was hoping there was a way I could do something similar with [the entire world of non-latin character sets] 😳.
How can I practically compare an expected character (such as ยั) with user intent (ย) in AppKit or with Swift? I don't have the language to express the problem, but basically "Is the character just typed a character that can, when combined with another character, create the expected (composed unicode) character?"

Unicode characters between \u0003 and \u00ff

I have a piece of Java code that is checking it is between two unicode characters:
LA(2) >= '\u0003' && LA(2) <= '\u00ff'
I understand that \u0003 represents END OF TEXT and \u00ff is LATIN SMALL LETTER Y WITH DIAERESIS, but what lies between these points? (what is it checking that LA(2) is?)
e.g. is it all Latin characters, or number characters, or characters with accents, all ascii characters, or something else?
It's Latin 1 minus the code points U+0000, U+0001 and U+0002. This includes the usual stuff that can be found on the US keyboard, plenty of control characters (below U+0020 and between U+007F and U+009F) and a few other Latin characters that can be used to write the majority of Western European languages.
The following ranges are declared:
0000 - 007F C0 Controls and Basic Latin
0080 - 00FF C1 Controls and Latin-1 Supplement
To check out which unicode value represents which character, I advise to have a look at one of the following links:
http://en.wikipedia.org/wiki/List_of_Unicode_characters
http://unicode.org/
It's the basic latin1 character set except the first 3 codes.
0x0000 - 0x007F : Basic Latin (128)
0x0080 - 0x00FF : Latin-1 Supplement (128)
The code probably checks whether the character can be output as a single byte char (latin1 encoded).