I am working on an RTF file made by someone else on an unknown platform, and everything is interpreted correctly, except some characters, whatever character set I open them from in openoffice. Here is the plain text, after interpretation:
"Même taille que la Terre, même masse, même âgec Vénus a souvent été qualifiée de sœur de la Terre. "
and here is the original ANSI paragraph:
"M\u234\'3fme taille que la Terre, m\u234\'3fme masse, m\u234\'3fme \u226\'3fge\uc2 \u61825\'ff\'81\uc1 c V\u233\'3fnus a souvent \u233\'3ft\u233\'3f qualifi\u233\'3fe de s\u339\'3fur de la Terre."
To zoom in:
"âgec Vénus" becomes "\u226\'3fge\uc2 \u61825\'ff\'81\uc1 c V\u233\'3fnus"
and finally, what we come up with:
"\uc2 \u61825\'ff\'81\uc1 c"
here \uc2 and \uc1 are to say we are going back and forth between 4-bytes and 2-bytes Unicode encoding.
\u61825 is an unknown Unicode character. Indeed, according to the RTF specification, any UTF character greater than 2^15 should be written in a negative form; negative form with ANSI characters should make the "-" (minus) sign visible to the notepad, am I right? So here already I have something I don't understand, how the RTF writer used by the person who made the rtf file in the first place could have done it. Maybe I missed something in the specification, specific versions, character sets, I don't know. If taken as is, 61825 would correspond to F181 which is in a private area of the Unicode table.
And then, the \'ff\'81 would be some use of the ANSI equivalent field of the whole "specific character" group (whose structure is usually \uN\'XX), to code something that would be 4-byte long. And here again, I could not find:
what is the code page (Windows-1252, ISO-8859-1, other?) being refered to (as in all the other places in the file where a \uN\'XX sequence apears, XX are always 3F, the Windows-1252 code for "?", so it did not give me much information)
what does the \'FF (which looks like some control character inside an escape sequence!) stand for, and then why \'81... Actually, the translation of \u61825 to hex is F181, not FF81...I am lost here!
Finally, what the translated text (in French) would make us expect is the ":" (semicolon): "Same size as Earth, same mass, same age: Venus has often been qualified as Earth's sister". It would make sense. But what rtf writer could imagine such a complicated code for the semicolon?
So again, after 1 hour of search, I open the question to you fellows: does someone recognize this, and could tell me what control word encoding is used, is there a big endian/little endian/2's complement mess here with the 61825, and same with the \'ff\'81, which would assemble as FF81 instead of F181, which itself doesn't mean anything as is...here my question is only to know if there would be a way to find the complete original text back from the bizarre RTF encoding!
what the translated text (in french) would make us expect is the ":" (semicolon
Nearly: it should be the ellipsis. You can see the source text eg here.
The ellipsis should normally be written simply as three periods, but there has traditionally been a separate character representing ellipsis in order better to control their spacing, back before complex text layout algorithms existed that could do automatic glyph replacement. Consequently there exists a Unicode compatibility character U+2026 HORIZONTAL ELLIPSIS to allow round-tripping to legacy encodings such as Windows code page 1252, where it is byte 133.
That, however, is not what has been encoded in your RTF document. That would be too easy.
61825 is an unknown Unicode character.
It's a Private Use Area character, which means it could represent absolutely anything. Word has exported certain common symbol fonts as PUA characters - see this post for the background.
So someone at some point may have used a symbol font where code unit 129 (the 0x81 in U+F181, 61825) maps to something that looks like an ellipsis. Quite what that font is, I have no idea! It doesn't seem to be one of the usual suspects (Symbol, Wingdings, Webdings). You might just have to manually replace U+F181 with U+2026 for now unless you can find out more about the source.
Related
Where can I get the complete list of all unicode characters that doesn't behave as simple characters. Examples: character 0x0363 (won't be printed without another one before), character 0x0084 (does weird things when printed). I need just a raw list of such unusual characters to replace them with something harmless to avoid unwanted output effects. Regular characters (those who not in this list) should use exactly one character place when printed (= cursor moved +1 to the right), should not depend on previous or next characters, and should not affect printing style in any way.
Edit because of multiple comments:
I have some unicode string, usually consists of "usual" characters like 0x20-0x7E or cyrillic letters. Also, there are a lot of other unicode characters that are usual and may be safely assumed as having strlen() = 1. The string is printed on the terminal and I should know the resulting position of the cursor. I don't want to use some complex and non-stable libraries to do that, i want to have simplest possible logic to do that. Every problematic character may be replaced with U+0xFFFD or something like "<U+0363>" (ASCII string with its index instead of character itself). I want to have a list of "possibly-problematic" characters to replace. It is acceptable to have some non-problematic characters in this list too, but not much.
There is no simple algorithm for this. You'll likely need a complex, but extremely stable library: libicu, or something based on it. Basically every other library that does this kind of work is based on libicu, which is maintained by the Unicode organization.
If you don't want to use the official library (or something based on their library), you'll need to parse the Unicode Character Database yourself. In particular, you need to look at Character Properties, and parse the files in the UCD.
I believe you're asking for Bidi_Class (i.e. "direction") to be Left_To_Right, Canonical_Combining_Class to be Not_Reordered, and Joining_Type to be Non_Joining.
You probably also want to check the General_Category and avoid M* (Marks) and C* (Other).
This should work for some Emoji, but this whole approach will break a lot of emoji that look simple and are not. Most famously: ❤️, which is two "characters," not one. You may want to filter out Emoji. As a simple starting point, you may want to restrict yourself to the Basic Multilingual Plane (BMP), which are code points 0000-FFFF. Anything above this range is, almost by definition, rare or unusual. The BMP does include some emoji, but most emoji (and all new emoji) are outside the range.
Remember that the glyphs for single characters can still have radically different widths, even in nominally fixed-width fonts. For example, 𒈙 (U+12219 CUNEIFORM SIGN LUGAL OPPOSING LUGAL) is a completely "normal" character in the way you're describing. It is left-to-right. It doesn't depend on or influence characters around it (it's non-combining and non-joining). Its "length in characters" is 1. Its glyph is also extremely wide in most fonts and breaks a lot of layout. I don't know anything in the Unicode database that would warn you of this, since "glyph width" is entirely a function of fonts, not characters, and Unicode explicitly does not consider fonts. (That said, most of the most problematic characters are outside the BMP. Probably the most common exception is DŽ, but many fixed-width fonts have a narrow glyph for it: DŽ.)
Let's write some cuneiform in a fixed-width font.
Normally, every character should line up with a character above.
Here: 𒈙. See how these characters don't align correctly?
Not only is it a very wide glyph, but its width is not even a multiple.
At least not in my font (Mac Safari 15.0).
But DŽ is ok.
Also remember that there are multiple ways to encode the same "character." For example, é can be a "simple" character (U+00E9), or it can be two characters (U+0065, U+0301). So in some cases é may print in your scheme, and in others it won't. I suspect this is fine for your problem, but if it isn't, you're going to need to apply a normalization form (likely NFC).
I am working in postgreSQL database and I have text column which in various languages like russian, chineses, korean, english etc. Although our application handles these languages well, we are having a issue dealing with non-UTF-8 characters.
For example, if you see the image from notepad++ where I have done Encoding > Encode in UTF-8, it neatly shows all the non-recognizable characters.
However, we are facing issue marking such records as non-process-able in postgres. Something like a flag should also do but I am trying something like below but it flags the valid russian records as well whereas notepad++ explicitly shows the hidden/non-UTF-8 characters.
Notepad++
Weird thing about these characters are that they do not show up regular select query but when I convert them to "UTF-8", those show up like below.
Database
Tried something like this (below query) but it does not seem to work i.e give me the desired output. Expectation is to set a flag to such records which have invalid hidden HTML references but not lose the valid text like the valid russian sentence in the snapshot. Should be able to distinctly identify only such texts.
select text, text ~ '[^[:ascii:]]', text ~ '^[\x00-\x7F]*$'
from sample_data;
Sample Data -
"Я не наркоман. Это у меня всегда, когда мне афигитительно. А если серьёзно, это интересно,…"
"Ya le dieron amor a la foto de instagram de mi #UberCALAVERITA?"
"Executive Admininstrative Assistant in Toronto, ON for a Group"
"Сегодня валютные стратеги BMO обновили прогнозы по основным валютам на ближайшие пять кварталов (на конец периода): читать далее…"
"Flicitations Gestion d'actifs pour 6 Trophes #FundGradeA+2016 de fonds communs de placement :"
This answer might help you go back to fix problems. It doesn't directly help you to go forward in the direction you are asking about.
Looking at Flicitations and F\302\202licitations, the escapes look like octal, which is possibly a presentation choice of your "IDE" and/or the convert_to function. From octal, \302\202 is 0xC2 0x82, decoding as UTF-8 gives U+0082. In Unicode, that's a control character, in ISO 8859-1 it's a non-character, either might explain why some renderings make it invisible or take no space.
Now, Google tells me that Flicitations is almost like a French word, Félicitations. So, perhaps there is a character set and encoding where é is encoded as 0x82. Wikipedia helps here—Indeed there is: IBM850, which has been used for some French text.
So, it seems that someone has mishandled the user's text, causing data loss. The fundamental rule of text encoding is that text bytes must be read with the same encoding they were written with. Don't guess; Ask, or reference a standard, specification, documentation, or convention. Maybe you can go back and find the misbehaving process/code—at least that would prevent future data loss.
"Dealing with non-UTF-8 characters": There aren't really any non-UTF-8 characters. UTF-8 is an encoding of the Unicode character set. There are areas with exceptions but, practically speaking, Unicode has all characters, and UTF-8 can encode them all. So, if you think there are non-UTF-8 characters, the writer is either non-compliant or the reader is using the wrong encoding.
While trying to parse some unicode text strings, I'm hitting an invisible character that I can't find any definition for. If I paste it in to a text editor and show invisibles, I can see that it looks like a bullet point (• alt-8), and by copy/pasting them, I can see it has an effect like a space or tab, but it's none of those.
I need to test for it, something like...
if(uniChar == L'\t')
But of course I need to provide something to match to.
It has bytes 0xc2 0xa0 in UTF-8.
If no-one has a definition, is there any devious way to test for something I can't define!?
(I happen to be using NSStrings in Objective-C, OSX, Xcode, but I don't think that has any bearing.)
Bytes C2 A0 in UTF-8 encode U+00A0 ɴᴏ-ʙʀᴇᴀᴋ sᴘᴀᴄᴇ, which can be used, for example, to display combining marks in isolation. It is as a named HTML entity. It is almost the same as a U+0020 sᴘᴀᴄᴇ, except it prevents line breaks before or after it, and acts as a numerical separator for bidirectional layout.
The dot you see when you ask a text editor to show invisibles just happens to be what glyph the text editor chose to display spaces. It does not mean the character in question is U+00B7 ᴍɪᴅᴅʟᴇ ᴅᴏᴛ, which is definitely not invisible.
In code, if you have it as a unichar, you can compare it to L'\x00A0'.
What is the subset of Unicode characters that are normally used in writing — such as those that would be typically found in a newspaper article?
For example, in English, the characters in the range [a-zA-Z0-9], plus some punctuation characters, would be sufficient for most writing.
But I want to support languages that use characters that fall outside the ASCII range, while excluding the non-printing or decorative characters.
The objective is to restrict the user input to the application to codepoints that are legitimately used in written language. Because the user input will be saved and displayed, I do not want to allow pranksters to input text consisting entirely of things like diacritics, Unicode combining characters, Unicode flow control characters, etc.
Regrettably, I am not fluent in every single language found in Unicode. Has anyone compiled a list of all of the subset of Unicode characters that are normally used in writing?
The official list of Unicode code points is UnicodeData.txt. This is a plain text file with one line per code point; it's easily machine-readable. For example:
0022;QUOTATION MARK;Po;0;ON;;;;;N;;;;;
The third semicolon-delimited field is the abbreviated name of the "General Category". This is explained further in chapter 4 of the Unicode Standard, specifically in section 4.5; see the table on page 131 (page 12 of the PDF file). For example, "Lu" is uppercase letters, "Ll" is lowercase letters, Pc, Pd, Ps, et al are various kinds of punctuation. (The first letter of the two-letter abbreviation represents a higher-level category such as letter, digit, punctuation, etc.)
Note that some ranges of code points are not listed explicitly. For example, the range of CJK (Chinese, Japanese, Korean) ideographs is represented as:
4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;
9FCC;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;
I think there are other files on unicode.org that fill in these gaps.
I'm still not 100% clear on just what subset you're trying to define, but you can probably define it as a particular set of General Category values.
I do not want to allow pranksters to input text consisting entirely of things like diacritics, Unicode combining characters
Diacritics/combining characters will be used in normal written language. So if you want to stop 'pranksters' you're going to need something more sophisticated than just a list of permitted characters. You'll have to do some sort of linguistic analysis for every language you want to permit.
I'd recommend not bothering with this, because it's going to be hard and you won't succeed anyway. Just let people write what they want.
Try WGL4 (652 characters), MES-1 (335 characters) or MES-2 (1062 characters). Find these at Wikipedia.
You may wish to exclude characters IJijĸĿŀʼn˚―⅛⅜⅝⅞♪ from MES-1 if you want to use this set.
Edit: I realize this is a bad answer. Especially the removing characters from MES-1 part was total garbage. I shouldn't have posted this. I'm ashamed of whoever upvoted this.
If anything, use Subset1 (678 characters), Subset2 (1193 characters) and Subset3 (2823 characters). https://unicodesubsets.miraheze.org/wiki/User:PiotrGrochowski
Does someone know a easy way to find characters in Unicode that are similar to ASCII characters. An example is the "CYRILLIC SMALL LETTER DZE (ѕ)". I'd like to do a search and replace for similar characters. By similar I mean human readable. You can't see a difference by looking at it.
As noted by other commenters, Unicode normalisation ("compatibilty characters") isn't going to help you here as you aren't looking for official equivalences but for similarities in glyphs (letter shapes). (The linked Unicode Technical Report is still worth reading, though, as it is extremely well written.)
If I were you, to spare you the tedious work of assembling a list of characters yourself, I'd search for resources on homograph attacks: This is a method of maliciously misleading web users by displaying URLs containing domain names in which some letters have been replaced with visually similar letters. Another Unicode Technical Report, on security, contains a section on the problem. There is also -- and that may be what you most need -- a "confusables" table. Here's another article with mainly punctuation marks, some of which ASCII, that have visually similar counterparts in the non-ASCII code tables.
What I do hope is that you aren't asking the question to construct such an attack.
See the Unicode Database: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt.
Each line describes a unicode caharacter, for example:
1E9A;LATIN SMALL LETTER A WITH RIGHT HALF RING;Ll;0;L;<compat> 0061 02BE;;;;N;;;;;
If there's any similar (compatible) characters for that symbol, it will appear in the <compat> field of the entry. In this example, 0061 (ASCII a) is compatible to the LATIN SMALL LETTER A WITH RIGHT HALF RING Unicode character.
As for your character, the entry is
0455;CYRILLIC SMALL LETTER DZE;Ll;0;L;;;;;N;;;0405;;0405
which, as you can see, does not specify a compatibility character.