I have written a man page in the nroff syntax. The text is in English but I want to make sure that a name containing the character "ö" is displayed correctly (even on a non-UTF-8 system). Is there a way to specify this character in nroff, similar to ö in HTML? Or can I specify the encoding in the file?
GNU troff (groff), which seems to be the de facto standard, accepts the named glyph \[:o] for the character "ö":
http://man7.org/linux/man-pages/man7/groff_char.7.html
I don't know troff but maybe this helps:
accent mark input output
acute accent e\*' é
grave accent e\*` è
circumflex o\*^ ô
cedilla c\*, ç
tilde n\*~ ñ
question \*?
exclamation \*!
umlaut u\*: ü
digraph s \*8 ß
hacek c\*v
macron a\*_
underdot s\*.
o-slash o\*/ ø
angstrom a\*o å
yogh kni\*3t
Thorn \*(Th Þ
thorn \*(th þ
Eth \*(D- Ð
eth \*(d- ð
hooked o \*q
ae ligature \*(ae æ
AE ligature \*(Ae Æ
oe ligature \*(oe
OE ligature \*(Oe
These new diacritical marks will not appear or will be placed on the wrong letter if .AM is not at the top of your file. If .AM is at the top of your file, the default -ms accent marks will be placed on the wrong letter. Choose one set or the other and use it consistently.
As an aid in producing text that will format correctly with both nroff and troff, there are some new string definitions that define dashes and quotation marks for each of these two formatting programs. The (*- string will yield two hyphens in nroff, but in troff it will produce an em dash--like this one. The *Q and (*U strings will produce open and close quotes in troff, but straight double quotes in nroff. (In typesetting, the double quote character is traditionally considered bad form.)
Related
Every programming language has their own interpretation of \n and \r.
Unicode supports multiple characters that can represent a new line.
From the Rust reference:
A whitespace escape is one of the characters U+006E (n), U+0072 (r),
or U+0074 (t), denoting the Unicode values U+000A (LF), U+000D (CR) or
U+0009 (HT) respectively.
Based on that statement, I'd say a Rust character is a new-line character if it is either \n or \r. On Windows it might be the combination of \r and \n. I'm not sure though.
What about the following?
Next line character (U+0085)
Line separator character (U+2028)
Paragraph separator character (U+2029)
In my opinion, we are missing something like a char.is_new_line().
I looked through the Unicode Character Categories but couldn't find a definition for new-lines.
Do I have to come up with my own definition of what a Unicode new-line character is?
There is considerable practical disagreement between languages like Java, Python, Go and JavaScript as to what constitutes a newline-character and how that translates to "new lines". The disagreement is demonstrated by how the batteries-included regex engines treat patterns like $ against a string like \r\r\n\n in multi-line-mode: Are there two lines (\r\r\n, \n), three lines (\r, \r\n, \n, like Unicode says) or four (\r, \r, \n, \n, like JS sees it)? Go and Python do not treat \r\n as a single $ and neither does Rust's regex crate; Java's does however. I don't know of any language whose batteries extend newline-handling to any more Unicode characters.
So the takeaway here is
It is agreed upon that \n is a newline
\r\n may be a single newline
unless \r\n is treated as two newlines
unless \r\n is "some character followed by a newline"
You shall not have any more newlines beside that.
If you really need more Unicode characters to be treated as newlines, you'll have to define a function that does that for you. Don't expect real-world input that expects that. After all, we had the ASCII Record separator for a gazillion years and everybody uses \t instead as well.
Update: See http://www.unicode.org/reports/tr14/tr14-32.html#BreakingRules section LB5 for why \r\r\n should be treated as two line breaks. You could read the whole page to get a grip on how your original question would have to be implemented. My guess is by the point you reach "South East Asian: line breaks require morphological analysis" you'll close the tab :-)
The newline character is declared as 0xA from this documentation
Sample: Rust Playground
// c is our `char`
if c == 0xA as char {
println!("got a newline character")
}
I am trying to get an MFC Static Text control to display an ASCII Unicode character, specifically Omega (Ω). When i use just that the & doesn't display and the rest of the text does. But if i set the 'No Prefix' Property of the Control to True, then it removes the & and everything after it.
Is this possible to do through a project setting or am i just inputting the string wrong?
Here is what I am using for a string: VDC Resistance (kΩ) → where I want Ω to be the omega symbol.
First of all Ω isn't an ASCII character, it is a Unicode character: GREEK CAPITAL LETTER OMEGA.
Ω is the Html escape sequence for an omega, so a static text control doesn't do translate Html escape sequences. If you are entering the text in C/C++ source then use the C escape sequence L"\u03A9". (3A9 in hex equals 937 decimal). This assumes that you are building a Unicode application in in ANSI it won't work. I'm not sure how you would do it in that case.
I'm giving a tech talk about Unicode and encoding in my company, in which I'm trying to make the point that strings are always encoded, and developers should never carelessly assume that everything is 0-127 ASCII.
I have numerous examples of problems caused by mis-encoded text, but I didn't find any example of simple English text with numbers that's encoded above Unicode code point 127.
The basic English alphabet is mapped in Unicode to the same numerical value as the plain old ASCII: The range A-Z is mapped to [65-90] (or [0x41-0x5a] in hex), and [a-z] is mapped to [97-122] (hex [0x61-0x7a]).
Does the English alphabet appear elsewhere in the code charts? I do not mean circumflex letters or other Latin variants, just the plain English alphabet.
CJK characters are generally monospaced in all fonts, since that's how those languages tend to be written.
When mixing CJK and English characters, however, you run into a problem: ASCII characters do not in general have the width of a CJK character. This means that if you use ASCII, you lose the monospaced property - which may not always be desirable.
For this purpose, fullwidth characters (U+FF00-FFEE, Wikipedia, Unicode code chart) may be used in place of "regular" characters. These have the property that they have the same width as a single CJK character.
Note, however, that fullwidth characters are virtually never used outside of a CJK context, and even in those contexts, plain ASCII is frequently used as well, when monospacing is considered unimportant.
Plenty of punctuation and symbols have code point values above U+007F:
“Hello.”
He had been given the comprehensive sixty-four-crayon Crayola box—including the gold and silver crayons—and would not let me look.
x ≠ y
The above examples use:
U+201C and U+201D — smart quotes
U+2014 — em-dash
U+2260 — not equal to
See the Unicode charts for more.
Well, if you just mean a-z and A-Z then no, there are no English characters above 127. But words like fiancé, resumé etc are sometimes spelled like that in English and use codepoints above 127.
Then there are various punctuation signs, currency symbols and so on that are above 127. Not sure if this counts as simple English text.
Is anyone aware of a way to add diacritics from different unicode blocks to say, latin letters (or latin diacritics to say, Devanagari letters)? For instance:
Oै
I tried the zero-width-joiner in between, but it had no effect. Any ideas?
I know, for instance, that the Arabic combining diacritics will work on latin letters, but Hebrew will not. Is this random?
Accoding to the Unicode Standard, Chapter 2, Section 2.11, “All combining characters can be applied to any base character and can, in principle, be used with any script.” So the Latin letter O followed by the Devanagari vowel sign ai U+0948 is permitted. But the standard adds: “This does not create an obligation on implementations to support all possible combinations equally well. Thus, while application of an Arabic annotation mark to a Han character or a Devanagari consonant is permitted, it is unlikely to be supported well in rendering or to make much sense.”
So it is up to implementations. But there are some “cross-script” diacritics. For example, the acute accent has been unified with the Greek tonos mark, so the Latin letter é and the Greek letter έ, when decomposed, contain the same diacritic U+0301. Moreover, this combining mark can be placed after a Cyrillic letter, and this can be regarded as normal (though relatively rare) usage, so we can expect good implementations to render it properly.
Worked fine for me. I just typed in the characters. Probably depends on the program rendering the text.
Oै
I have a piece of Java code that is checking it is between two unicode characters:
LA(2) >= '\u0003' && LA(2) <= '\u00ff'
I understand that \u0003 represents END OF TEXT and \u00ff is LATIN SMALL LETTER Y WITH DIAERESIS, but what lies between these points? (what is it checking that LA(2) is?)
e.g. is it all Latin characters, or number characters, or characters with accents, all ascii characters, or something else?
It's Latin 1 minus the code points U+0000, U+0001 and U+0002. This includes the usual stuff that can be found on the US keyboard, plenty of control characters (below U+0020 and between U+007F and U+009F) and a few other Latin characters that can be used to write the majority of Western European languages.
The following ranges are declared:
0000 - 007F C0 Controls and Basic Latin
0080 - 00FF C1 Controls and Latin-1 Supplement
To check out which unicode value represents which character, I advise to have a look at one of the following links:
http://en.wikipedia.org/wiki/List_of_Unicode_characters
http://unicode.org/
It's the basic latin1 character set except the first 3 codes.
0x0000 - 0x007F : Basic Latin (128)
0x0080 - 0x00FF : Latin-1 Supplement (128)
The code probably checks whether the character can be output as a single byte char (latin1 encoded).